generic plans and "initial" pruning
Executing generic plans involving partitions is known to become slower
as partition count grows due to a number of bottlenecks, with
AcquireExecutorLocks() showing at the top in profiles.
Previous attempt at solving that problem was by David Rowley [1]/messages/by-id/CAKJS1f_kfRQ3ZpjQyHC7=PK9vrhxiHBQFZ+hc0JCwwnRKkF3hg@mail.gmail.com,
where he proposed delaying locking of *all* partitions appearing under
an Append/MergeAppend until "initial" pruning is done during the
executor initialization phase. A problem with that approach that he
has described in [2]/messages/by-id/CAKJS1f99JNe+sw5E3qWmS+HeLMFaAhehKO67J1Ym3pXv0XBsxw@mail.gmail.com is that leaving partitions unlocked can lead to
race conditions where the Plan node belonging to a partition can be
invalidated when a concurrent session successfully alters the
partition between AcquireExecutorLocks() saying the plan is okay to
execute and then actually executing it.
However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking. The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions. I've attempted
implementing that idea in the attached patch.
Note that "initial" pruning steps are now performed twice when
executing generic plans: once in AcquireExecutorLocks() to find
partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
determine the set of partition sub-nodes to be initialized for
execution, though I wasn't able to come up with a good idea to avoid
this duplication.
Using the following benchmark setup:
pgbench testdb -i --partitions=$nparts > /dev/null 2>&1
pgbench -n testdb -S -T 30 -Mprepared
And plan_cache_mode = force_generic_plan,
I get following numbers:
HEAD:
32 tps = 20561.776403 (without initial connection time)
64 tps = 12553.131423 (without initial connection time)
128 tps = 13330.365696 (without initial connection time)
256 tps = 8605.723120 (without initial connection time)
512 tps = 4435.951139 (without initial connection time)
1024 tps = 2346.902973 (without initial connection time)
2048 tps = 1334.680971 (without initial connection time)
Patched:
32 tps = 27554.156077 (without initial connection time)
64 tps = 27531.161310 (without initial connection time)
128 tps = 27138.305677 (without initial connection time)
256 tps = 25825.467724 (without initial connection time)
512 tps = 19864.386305 (without initial connection time)
1024 tps = 18742.668944 (without initial connection time)
2048 tps = 16312.412704 (without initial connection time)
--
Amit Langote
EDB: http://www.enterprisedb.com
[1]: /messages/by-id/CAKJS1f_kfRQ3ZpjQyHC7=PK9vrhxiHBQFZ+hc0JCwwnRKkF3hg@mail.gmail.com
[2]: /messages/by-id/CAKJS1f99JNe+sw5E3qWmS+HeLMFaAhehKO67J1Ym3pXv0XBsxw@mail.gmail.com
Attachments:
v1-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patchapplication/octet-stream; name=v1-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patchDownload+866-209
On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
Executing generic plans involving partitions is known to become slower
as partition count grows due to a number of bottlenecks, with
AcquireExecutorLocks() showing at the top in profiles.Previous attempt at solving that problem was by David Rowley [1],
where he proposed delaying locking of *all* partitions appearing under
an Append/MergeAppend until "initial" pruning is done during the
executor initialization phase. A problem with that approach that he
has described in [2] is that leaving partitions unlocked can lead to
race conditions where the Plan node belonging to a partition can be
invalidated when a concurrent session successfully alters the
partition between AcquireExecutorLocks() saying the plan is okay to
execute and then actually executing it.However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking. The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions. I've attempted
implementing that idea in the attached patch.
In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?
The benchmark is showing good results, indeed.
--
Best Wishes,
Ashutosh Bapat
On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:
On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com>
wrote:Executing generic plans involving partitions is known to become slower
as partition count grows due to a number of bottlenecks, with
AcquireExecutorLocks() showing at the top in profiles.Previous attempt at solving that problem was by David Rowley [1],
where he proposed delaying locking of *all* partitions appearing under
an Append/MergeAppend until "initial" pruning is done during the
executor initialization phase. A problem with that approach that he
has described in [2] is that leaving partitions unlocked can lead to
race conditions where the Plan node belonging to a partition can be
invalidated when a concurrent session successfully alters the
partition between AcquireExecutorLocks() saying the plan is okay to
execute and then actually executing it.However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking. The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions. I've attempted
implementing that idea in the attached patch.In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?
This is mainly to be useful for prepared queries, so something like:
prepare q as select * from partitioned_table where key = $1;
And that too when execute q(…) uses a generic plan. Generic plans are
problematic because it must contain nodes for all partitions (without any
plan time pruning), which means CheckCachedPlan() has to spend time
proportional to the number of partitions to determine that the plan is
still usable / has not been invalidated; most of that is
AcquireExecutorLocks().
Other bottlenecks, not addressed in this patch, pertain to some executor
startup/shutdown subroutines that process the range table of a PlannedStmt
in its entirety, whose length is also proportional to the number of
partitions when the plan is generic.
The benchmark is showing good results, indeed.
Thanks.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Fri, Dec 31, 2021 at 7:56 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
Executing generic plans involving partitions is known to become slower
as partition count grows due to a number of bottlenecks, with
AcquireExecutorLocks() showing at the top in profiles.Previous attempt at solving that problem was by David Rowley [1],
where he proposed delaying locking of *all* partitions appearing under
an Append/MergeAppend until "initial" pruning is done during the
executor initialization phase. A problem with that approach that he
has described in [2] is that leaving partitions unlocked can lead to
race conditions where the Plan node belonging to a partition can be
invalidated when a concurrent session successfully alters the
partition between AcquireExecutorLocks() saying the plan is okay to
execute and then actually executing it.However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking. The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions. I've attempted
implementing that idea in the attached patch.In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?This is mainly to be useful for prepared queries, so something like:
prepare q as select * from partitioned_table where key = $1;
And that too when execute q(…) uses a generic plan. Generic plans are problematic because it must contain nodes for all partitions (without any plan time pruning), which means CheckCachedPlan() has to spend time proportional to the number of partitions to determine that the plan is still usable / has not been invalidated; most of that is AcquireExecutorLocks().
Other bottlenecks, not addressed in this patch, pertain to some executor startup/shutdown subroutines that process the range table of a PlannedStmt in its entirety, whose length is also proportional to the number of partitions when the plan is generic.
The benchmark is showing good results, indeed.
Indeed.
Here are few comments for v1 patch:
+ /* Caller error if we get here without contains_init_steps */
+ Assert(pruneinfo->contains_init_steps);
- prunedata = prunestate->partprunedata[i];
- pprune = &prunedata->partrelprunedata[0];
- /* Perform pruning without using PARAM_EXEC Params */
- find_matching_subplans_recurse(prunedata, pprune, true, &result);
+ if (parentrelids)
+ *parentrelids = NULL;
You got two blank lines after Assert.
--
+ /* Set up EState if not in the executor proper. */
+ if (estate == NULL)
+ {
+ estate = CreateExecutorState();
+ estate->es_param_list_info = params;
+ free_estate = true;
}
... [Skip]
+ if (free_estate)
+ {
+ FreeExecutorState(estate);
+ estate = NULL;
}
I think this work should be left to the caller.
--
/*
* Stuff that follows matches exactly what ExecCreatePartitionPruneState()
* does, except we don't need a PartitionPruneState here, so don't call
* that function.
*
* XXX some refactoring might be good.
*/
+1, while doing it would be nice if foreach_current_index() is used
instead of the i & j sequence in the respective foreach() block, IMO.
--
+ while ((i = bms_next_member(validsubplans, i)) >= 0)
+ {
+ Plan *subplan = list_nth(subplans, i);
+
+ context->relations =
+ bms_add_members(context->relations,
+ get_plan_scanrelids(subplan));
+ }
I think instead of get_plan_scanrelids() the
GetLockableRelations_worker() can be used; if so, then no need to add
get_plan_scanrelids() function.
--
/* Nodes containing prunable subnodes. */
+ case T_MergeAppend:
+ {
+ PlannedStmt *plannedstmt = context->plannedstmt;
+ List *rtable = plannedstmt->rtable;
+ ParamListInfo params = context->params;
+ PartitionPruneInfo *pruneinfo;
+ Bitmapset *validsubplans;
+ Bitmapset *parentrelids;
...
if (pruneinfo && pruneinfo->contains_init_steps)
{
int i;
...
return false;
}
}
break;
Most of the declarations need to be moved inside the if-block.
Also, initially, I was a bit concerned regarding this code block
inside GetLockableRelations_worker(), what if (pruneinfo &&
pruneinfo->contains_init_steps) evaluated to false? After debugging I
realized that plan_tree_walker() will do the needful -- a bit of
comment would have helped.
--
+ case T_CustomScan:
+ foreach(lc, ((CustomScan *) plan)->custom_plans)
+ {
+ if (walker((Plan *) lfirst(lc), context))
+ return true;
+ }
+ break;
Why not plan_walk_members() call like other nodes?
Regards,
Amul
On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking. The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions. I've attempted
implementing that idea in the attached patch.
Hmm. The first question that occurs to me is whether this is fully safe.
Currently, AcquireExecutorLocks calls LockRelationOid for every
relation involved in the query. That means we will probably lock at
least one relation on which we previously had no lock and thus
AcceptInvalidationMessages(). That will end up marking the query as no
longer valid and CheckCachedPlan() will realize this and tell the
caller to replan. In the corner case where we already hold all the
required locks, we will not accept invalidation messages at this
point, but must have done so after acquiring the last of the locks
required, and if that didn't mark the plan invalid, it can't be
invalid now either. Either way, everything is fine.
With the proposed patch, we might never lock some of the relations
involved in the query. Therefore, if one of those relations has been
modified in some way that would invalidate the plan, we will
potentially fail to discover this, and will use the plan anyway. For
instance, suppose there's one particular partition that has an extra
index and the plan involves an Index Scan using that index. Now
suppose that the scan of the partition in question is pruned, but
meanwhile, the index has been dropped. Now we're running a plan that
scans a nonexistent index. Admittedly, we're not running that part of
the plan. But is that enough for this to be safe? There are things
(like EXPLAIN or auto_explain) that we might try to do even on a part
of the plan tree that we don't try to run. Those things might break,
because for example we won't be able to look up the name of an index
in the catalogs for EXPLAIN output if the index is gone.
This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example. The point is that to my knowledge we have no existing
case where we try to use a plan that might be only partly valid, so if
we introduce one, there's some risk there. I thought for a while, too,
about whether changes to some object in a part of the plan that we're
not executing could break things for the rest of the plan even if we
never do anything with the plan but execute it. I can't quite see any
actual hazard. For example, I thought about whether we might try to
get the tuple descriptor for the pruned-away object and get a
different tuple descriptor than we were expecting. I think we can't,
because (1) the pruned object has to be a partition, and tuple
descriptors have to match throughout the partitioning hierarchy,
except for column ordering, which currently can't be changed
after-the-fact and (2) IIRC, the tuple descriptor is stored in the
plan and not reconstructed at runtime and (3) if we don't end up
opening the relation because it's pruned, then we certainly can't do
anything with its tuple descriptor. But it might be worth giving more
thought to the question of whether there's any other way we could be
depending on the details of an object that ended up getting pruned.
Note that "initial" pruning steps are now performed twice when
executing generic plans: once in AcquireExecutorLocks() to find
partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
determine the set of partition sub-nodes to be initialized for
execution, though I wasn't able to come up with a good idea to avoid
this duplication.
I think this is something that will need to be fixed somehow. Apart
from the CPU cost, it's scary to imagine that the set of nodes on
which we acquired locks might be different from the set of nodes that
we initialize. If we do the same computation twice, there must be some
non-zero probability of getting a different answer the second time,
even if the circumstances under which it would actually happen are
remote. Consider, for example, a function that is labeled IMMUTABLE
but is really VOLATILE. Now maybe you can get the system to lock one
set of partitions and then initialize a different set of partitions. I
don't think we want to try to reason about what consequences that
might have and prove that somehow it's going to be OK; I think we want
to nail the door shut very tightly to make sure that it can't.
--
Robert Haas
EDB: http://www.enterprisedb.com
Thanks for taking the time to look at this.
On Wed, Jan 12, 2022 at 1:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking. The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions. I've attempted
implementing that idea in the attached patch.Hmm. The first question that occurs to me is whether this is fully safe.
Currently, AcquireExecutorLocks calls LockRelationOid for every
relation involved in the query. That means we will probably lock at
least one relation on which we previously had no lock and thus
AcceptInvalidationMessages(). That will end up marking the query as no
longer valid and CheckCachedPlan() will realize this and tell the
caller to replan. In the corner case where we already hold all the
required locks, we will not accept invalidation messages at this
point, but must have done so after acquiring the last of the locks
required, and if that didn't mark the plan invalid, it can't be
invalid now either. Either way, everything is fine.With the proposed patch, we might never lock some of the relations
involved in the query. Therefore, if one of those relations has been
modified in some way that would invalidate the plan, we will
potentially fail to discover this, and will use the plan anyway. For
instance, suppose there's one particular partition that has an extra
index and the plan involves an Index Scan using that index. Now
suppose that the scan of the partition in question is pruned, but
meanwhile, the index has been dropped. Now we're running a plan that
scans a nonexistent index. Admittedly, we're not running that part of
the plan. But is that enough for this to be safe? There are things
(like EXPLAIN or auto_explain) that we might try to do even on a part
of the plan tree that we don't try to run. Those things might break,
because for example we won't be able to look up the name of an index
in the catalogs for EXPLAIN output if the index is gone.This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example. The point is that to my knowledge we have no existing
case where we try to use a plan that might be only partly valid, so if
we introduce one, there's some risk there. I thought for a while, too,
about whether changes to some object in a part of the plan that we're
not executing could break things for the rest of the plan even if we
never do anything with the plan but execute it. I can't quite see any
actual hazard. For example, I thought about whether we might try to
get the tuple descriptor for the pruned-away object and get a
different tuple descriptor than we were expecting. I think we can't,
because (1) the pruned object has to be a partition, and tuple
descriptors have to match throughout the partitioning hierarchy,
except for column ordering, which currently can't be changed
after-the-fact and (2) IIRC, the tuple descriptor is stored in the
plan and not reconstructed at runtime and (3) if we don't end up
opening the relation because it's pruned, then we certainly can't do
anything with its tuple descriptor. But it might be worth giving more
thought to the question of whether there's any other way we could be
depending on the details of an object that ended up getting pruned.
I have pondered on the possible hazards before writing the patch,
mainly because the concerns about a previously discussed proposal were
along similar lines [1]/messages/by-id/CA+TgmoZN-80143F8OhN8Cn5-uDae5miLYVwMapAuc+7+Z7pyNg@mail.gmail.com.
IIUC, you're saying the plan tree is subject to inspection by non-core
code before ExecutorStart() has initialized a PlanState tree, which
must have discarded pruned portions of the plan tree. I wouldn't
claim to have scanned *all* of the core code that could possibly
access the invalidated portions of the plan tree, but from what I have
seen, I couldn't find any site that does. An ExecutorStart_hook()
gets to do that, but from what I can see it is expected to call
standard_ExecutorStart() before doing its thing and supposedly only
looks at the PlanState tree, which must be valid. Actually, EXPLAIN
also does ExecutorStart() before starting to look at the plan (the
PlanState tree), so must not run into pruned plan tree nodes. All
that said, it does sound like wishful thinking to say that no problems
can possibly occur.
At first, I had tried to implement this such that the
Append/MergeAppend nodes are edited to record the result of initial
pruning, but it felt wrong to be munging the plan tree in plancache.c.
Or, maybe this won't be a concern if performing ExecutorStart() is
made a part of CheckCachedPlan() somehow, which would then take locks
on the relation as the PlanState tree is built capturing any plan
invalidations, instead of AcquireExecutorLocks(). That does sound like
an ambitious undertaking though.
Note that "initial" pruning steps are now performed twice when
executing generic plans: once in AcquireExecutorLocks() to find
partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
determine the set of partition sub-nodes to be initialized for
execution, though I wasn't able to come up with a good idea to avoid
this duplication.I think this is something that will need to be fixed somehow. Apart
from the CPU cost, it's scary to imagine that the set of nodes on
which we acquired locks might be different from the set of nodes that
we initialize. If we do the same computation twice, there must be some
non-zero probability of getting a different answer the second time,
even if the circumstances under which it would actually happen are
remote. Consider, for example, a function that is labeled IMMUTABLE
but is really VOLATILE. Now maybe you can get the system to lock one
set of partitions and then initialize a different set of partitions. I
don't think we want to try to reason about what consequences that
might have and prove that somehow it's going to be OK; I think we want
to nail the door shut very tightly to make sure that it can't.
Yeah, the premise of the patch is that "initial" pruning steps produce
the same result both times. I assumed that would be true because the
pruning steps are not allowed to contain any VOLATILE expressions.
Regarding the possibility that IMMUTABLE labeling of functions may be
incorrect, I haven't considered if the runtime pruning code can cope
or whether it should try to. If such a case does occur in practice,
the bad outcome would be an Assert failure in
ExecGetRangeTableRelation() or using a partition unlocked in the
non-assert builds, the latter of which feels especially bad.
--
Amit Langote
EDB: http://www.enterprisedb.com
[1]: /messages/by-id/CA+TgmoZN-80143F8OhN8Cn5-uDae5miLYVwMapAuc+7+Z7pyNg@mail.gmail.com
On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
I have pondered on the possible hazards before writing the patch,
mainly because the concerns about a previously discussed proposal were
along similar lines [1].
True. I think that the hazards are narrower with this proposal,
because if you *delay* locking a partition that you eventually need,
then you might end up trying to actually execute a portion of the plan
that's no longer valid. That seems like hopelessly bad news. On the
other hand, with this proposal, you skip locking altogether, but only
for parts of the plan that you don't plan to execute. That's still
kind of scary, but not to nearly the same degree.
IIUC, you're saying the plan tree is subject to inspection by non-core
code before ExecutorStart() has initialized a PlanState tree, which
must have discarded pruned portions of the plan tree. I wouldn't
claim to have scanned *all* of the core code that could possibly
access the invalidated portions of the plan tree, but from what I have
seen, I couldn't find any site that does. An ExecutorStart_hook()
gets to do that, but from what I can see it is expected to call
standard_ExecutorStart() before doing its thing and supposedly only
looks at the PlanState tree, which must be valid. Actually, EXPLAIN
also does ExecutorStart() before starting to look at the plan (the
PlanState tree), so must not run into pruned plan tree nodes. All
that said, it does sound like wishful thinking to say that no problems
can possibly occur.
Yeah. I don't think it's only non-core code we need to worry about
either. What if I just do EXPLAIN ANALYZE on a prepared query that
ends up pruning away some stuff? IIRC, the pruned subplans are not
shown, so we might escape disaster here, but FWIW if I'd committed
that code I would have pushed hard for showing those and saying "(not
executed)" .... so it's not too crazy to imagine a world in which
things work that way.
At first, I had tried to implement this such that the
Append/MergeAppend nodes are edited to record the result of initial
pruning, but it felt wrong to be munging the plan tree in plancache.c.
It is. You can't munge the plan tree: it's required to be strictly
read-only once generated. It can be serialized and deserialized for
transmission to workers, and it can be shared across executions.
Or, maybe this won't be a concern if performing ExecutorStart() is
made a part of CheckCachedPlan() somehow, which would then take locks
on the relation as the PlanState tree is built capturing any plan
invalidations, instead of AcquireExecutorLocks(). That does sound like
an ambitious undertaking though.
On the surface that would seem to involve abstraction violations, but
maybe that could be finessed somehow. The plancache shouldn't know too
much about what the executor is going to do with the plan, but it
could ask the executor to perform a step that has been designed for
use by the plancache. I guess the core problem here is how to pass
around information that is node-specific before we've stood up the
executor state tree. Maybe the executor could have a function that
does the pruning and returns some kind of array of results that can be
used both to decide what to lock and also what to consider as pruned
at the start of execution. (I'm hand-waving about the details because
I don't know.)
Yeah, the premise of the patch is that "initial" pruning steps produce
the same result both times. I assumed that would be true because the
pruning steps are not allowed to contain any VOLATILE expressions.
Regarding the possibility that IMMUTABLE labeling of functions may be
incorrect, I haven't considered if the runtime pruning code can cope
or whether it should try to. If such a case does occur in practice,
the bad outcome would be an Assert failure in
ExecGetRangeTableRelation() or using a partition unlocked in the
non-assert builds, the latter of which feels especially bad.
Right. I think it's OK for a query to produce wrong answers under
those kinds of conditions - the user has broken everything and gets to
keep all the pieces - but doing stuff that might violate fundamental
assumptions of the system like "relations can only be accessed when
holding a lock on them" feels quite bad. It's not a stretch to imagine
that failing to follow those invariants could take the whole system
down, which is clearly too severe a consequence for the user's failure
to label things properly.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote:
Here are few comments for v1 patch:
Thanks Amul. I'm thinking about Robert's latest comments addressing
which may need some rethinking of this whole design, but I decided to
post a v2 taking care of your comments.
+ /* Caller error if we get here without contains_init_steps */ + Assert(pruneinfo->contains_init_steps);- prunedata = prunestate->partprunedata[i];
- pprune = &prunedata->partrelprunedata[0];- /* Perform pruning without using PARAM_EXEC Params */ - find_matching_subplans_recurse(prunedata, pprune, true, &result); + if (parentrelids) + *parentrelids = NULL;You got two blank lines after Assert.
Fixed.
--
+ /* Set up EState if not in the executor proper. */ + if (estate == NULL) + { + estate = CreateExecutorState(); + estate->es_param_list_info = params; + free_estate = true; }... [Skip]
+ if (free_estate) + { + FreeExecutorState(estate); + estate = NULL; }I think this work should be left to the caller.
Done. Also see below...
/*
* Stuff that follows matches exactly what ExecCreatePartitionPruneState()
* does, except we don't need a PartitionPruneState here, so don't call
* that function.
*
* XXX some refactoring might be good.
*/+1, while doing it would be nice if foreach_current_index() is used
instead of the i & j sequence in the respective foreach() block, IMO.
Actually, I rewrote this part quite significantly so that most of the
code remains in its existing place. I decided to let
GetLockableRelations_walker() create a PartitionPruneState and pass
that to ExecFindInitialMatchingSubPlans() that is now left more or
less as is. Instead, ExecCreatePartitionPruneState() is changed to be
callable from outside the executor.
The temporary EState is no longer necessary. ExprContext,
PartitionDirectory, etc. are now managed in the caller,
GetLockableRelations_walker().
--
+ while ((i = bms_next_member(validsubplans, i)) >= 0) + { + Plan *subplan = list_nth(subplans, i); + + context->relations = + bms_add_members(context->relations, + get_plan_scanrelids(subplan)); + }I think instead of get_plan_scanrelids() the
GetLockableRelations_worker() can be used; if so, then no need to add
get_plan_scanrelids() function.
You're right, done.
--
/* Nodes containing prunable subnodes. */ + case T_MergeAppend: + { + PlannedStmt *plannedstmt = context->plannedstmt; + List *rtable = plannedstmt->rtable; + ParamListInfo params = context->params; + PartitionPruneInfo *pruneinfo; + Bitmapset *validsubplans; + Bitmapset *parentrelids;...
if (pruneinfo && pruneinfo->contains_init_steps)
{
int i;
...
return false;
}
}
break;Most of the declarations need to be moved inside the if-block.
Done.
Also, initially, I was a bit concerned regarding this code block
inside GetLockableRelations_worker(), what if (pruneinfo &&
pruneinfo->contains_init_steps) evaluated to false? After debugging I
realized that plan_tree_walker() will do the needful -- a bit of
comment would have helped.
You're right. Added a dummy else {} block with just the comment saying so.
+ case T_CustomScan: + foreach(lc, ((CustomScan *) plan)->custom_plans) + { + if (walker((Plan *) lfirst(lc), context)) + return true; + } + break;Why not plan_walk_members() call like other nodes?
Makes sense, done.
Again, most/all of this patch might need to be thrown away, but here
it is anyway.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
v2-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patchapplication/octet-stream; name=v2-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patchDownload+563-74
On Fri, Jan 14, 2022 at 11:10 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote:
Here are few comments for v1 patch:
Thanks Amul. I'm thinking about Robert's latest comments addressing
which may need some rethinking of this whole design, but I decided to
post a v2 taking care of your comments.
cfbot tells me there is an unused variable warning, which is fixed in
the attached v3.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
v3-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patchapplication/octet-stream; name=v3-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patchDownload+561-74
On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example.
I haven't followed this in any detail, but this patch and its goal of
reducing the O(N) drag effect on partition execution time is very
important. Locking a long list of objects that then get pruned is very
wasteful, as the results show.
Ideally, we want an O(1) algorithm for single partition access and DDL
is rare. So perhaps that is the starting point for a safe design -
invent a single lock or cache that allows us to check if the partition
hierarchy has changed in any way, and if so, replan, if not, skip
locks.
Please excuse me if this idea falls short, if so, please just note my
comment about how important this is. Thanks.
--
Simon Riggs http://www.EnterpriseDB.com/
Hi Simon,
On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example.I haven't followed this in any detail, but this patch and its goal of
reducing the O(N) drag effect on partition execution time is very
important. Locking a long list of objects that then get pruned is very
wasteful, as the results show.Ideally, we want an O(1) algorithm for single partition access and DDL
is rare. So perhaps that is the starting point for a safe design -
invent a single lock or cache that allows us to check if the partition
hierarchy has changed in any way, and if so, replan, if not, skip
locks.
Rearchitecting partition locking to be O(1) seems like a project of
non-trivial complexity as Robert mentioned in a related email thread
couple of years ago:
/messages/by-id/CA+TgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ@mail.gmail.com
Pursuing that kind of a project would perhaps have been more
worthwhile if the locking issue had affected more than just this
particular case, that is, the case of running prepared statements over
partitioned tables using generic plans. Addressing this by
rearchitecting run-time pruning (and plancache to some degree) seemed
like it might lead to this getting fixed in a bounded timeframe. I
admit that the concerns that Robert has raised about the patch make me
want to reconsider that position, though maybe it's too soon to
conclude.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote:
Hi Simon,
On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example.I haven't followed this in any detail, but this patch and its goal of
reducing the O(N) drag effect on partition execution time is very
important. Locking a long list of objects that then get pruned is very
wasteful, as the results show.Ideally, we want an O(1) algorithm for single partition access and DDL
is rare. So perhaps that is the starting point for a safe design -
invent a single lock or cache that allows us to check if the partition
hierarchy has changed in any way, and if so, replan, if not, skip
locks.Rearchitecting partition locking to be O(1) seems like a project of
non-trivial complexity as Robert mentioned in a related email thread
couple of years ago:/messages/by-id/CA+TgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ@mail.gmail.com
I agree, completely redesigning locking is a major project. But that
isn't what I suggested, which was to find an O(1) algorithm to solve
the safety issue. I'm sure there is an easy way to check one lock,
maybe a new one/new kind, rather than N.
Why does the safety issue exist? Why is it important to be able to
concurrently access parts of the hierarchy with DDL? Those are not
critical points.
If we asked them, most users would trade a 10x performance gain for
some restrictions on DDL. If anyone cares, make it an option, but most
people will use it.
Maybe force all DDL, or just DDL that would cause safety issues, to
update a hierarchy version number, so queries can tell whether they
need to replan. Don't know, just looking for an O(1) solution.
--
Simon Riggs http://www.EnterpriseDB.com/
On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
Pursuing that kind of a project would perhaps have been more
worthwhile if the locking issue had affected more than just this
particular case, that is, the case of running prepared statements over
partitioned tables using generic plans. Addressing this by
rearchitecting run-time pruning (and plancache to some degree) seemed
like it might lead to this getting fixed in a bounded timeframe. I
admit that the concerns that Robert has raised about the patch make me
want to reconsider that position, though maybe it's too soon to
conclude.
I wasn't trying to say that your approach was dead in the water. It
does create a situation that can't happen today, and such things are
scary and need careful thought. But redesigning the locking mechanism
would need careful thought, too ... maybe even more of it than sorting
this out.
I do also agree with Simon that this is an important problem to which
we need to find some solution.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Jan 18, 2022 at 7:28 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote:
On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:I haven't followed this in any detail, but this patch and its goal of
reducing the O(N) drag effect on partition execution time is very
important. Locking a long list of objects that then get pruned is very
wasteful, as the results show.Ideally, we want an O(1) algorithm for single partition access and DDL
is rare. So perhaps that is the starting point for a safe design -
invent a single lock or cache that allows us to check if the partition
hierarchy has changed in any way, and if so, replan, if not, skip
locks.Rearchitecting partition locking to be O(1) seems like a project of
non-trivial complexity as Robert mentioned in a related email thread
couple of years ago:/messages/by-id/CA+TgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ@mail.gmail.com
I agree, completely redesigning locking is a major project. But that
isn't what I suggested, which was to find an O(1) algorithm to solve
the safety issue. I'm sure there is an easy way to check one lock,
maybe a new one/new kind, rather than N.
I misread your email then, sorry.
Why does the safety issue exist? Why is it important to be able to
concurrently access parts of the hierarchy with DDL? Those are not
critical points.If we asked them, most users would trade a 10x performance gain for
some restrictions on DDL. If anyone cares, make it an option, but most
people will use it.Maybe force all DDL, or just DDL that would cause safety issues, to
update a hierarchy version number, so queries can tell whether they
need to replan. Don't know, just looking for an O(1) solution.
Yeah, it would be great if it would suffice to take a single lock on
the partitioned table mentioned in the query, rather than on all
elements of the partition tree added to the plan. AFAICS, ways to get
that are 1) Prevent modifying non-root partition tree elements, 2)
Make it so that locking a partitioned table becomes a proxy for having
locked all of its descendents, 3) Invent a Plan representation for
scanning partitioned tables such that adding the descendent tables
that survive plan-time pruning to the plan doesn't require locking
them too. IIUC, you've mentioned 1 and 2. I think I've seen 3
mentioned in the past discussions on this topic, but I guess the
research on whether that's doable has never been done.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Tue, Jan 18, 2022 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
Pursuing that kind of a project would perhaps have been more
worthwhile if the locking issue had affected more than just this
particular case, that is, the case of running prepared statements over
partitioned tables using generic plans. Addressing this by
rearchitecting run-time pruning (and plancache to some degree) seemed
like it might lead to this getting fixed in a bounded timeframe. I
admit that the concerns that Robert has raised about the patch make me
want to reconsider that position, though maybe it's too soon to
conclude.I wasn't trying to say that your approach was dead in the water. It
does create a situation that can't happen today, and such things are
scary and need careful thought. But redesigning the locking mechanism
would need careful thought, too ... maybe even more of it than sorting
this out.
Yes, agreed.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Wed, 19 Jan 2022 at 08:31, Amit Langote <amitlangote09@gmail.com> wrote:
Maybe force all DDL, or just DDL that would cause safety issues, to
update a hierarchy version number, so queries can tell whether they
need to replan. Don't know, just looking for an O(1) solution.Yeah, it would be great if it would suffice to take a single lock on
the partitioned table mentioned in the query, rather than on all
elements of the partition tree added to the plan. AFAICS, ways to get
that are 1) Prevent modifying non-root partition tree elements,
Can we reuse the concept of Strong/Weak locking here?
When a DDL request is in progress (for that partitioned table), take
all required locks for safety. When a DDL request is not in progress,
take minimal locks knowing it is safe.
We can take a single PartitionTreeModificationLock, nowait to prove
that we do not need all locks. DDL would request the lock in exclusive
mode. (Other mechanisms possible).
--
Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jan 13, 2022 at 3:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
Or, maybe this won't be a concern if performing ExecutorStart() is
made a part of CheckCachedPlan() somehow, which would then take locks
on the relation as the PlanState tree is built capturing any plan
invalidations, instead of AcquireExecutorLocks(). That does sound like
an ambitious undertaking though.On the surface that would seem to involve abstraction violations, but
maybe that could be finessed somehow. The plancache shouldn't know too
much about what the executor is going to do with the plan, but it
could ask the executor to perform a step that has been designed for
use by the plancache. I guess the core problem here is how to pass
around information that is node-specific before we've stood up the
executor state tree. Maybe the executor could have a function that
does the pruning and returns some kind of array of results that can be
used both to decide what to lock and also what to consider as pruned
at the start of execution. (I'm hand-waving about the details because
I don't know.)
The attached patch implements this idea. Sorry for the delay in
getting this out and thanks to Robert for the off-list discussions on
this.
So the new executor "step" you mention is the function ExecutorPrep in
the patch, which calls a recursive function ExecPrepNode on the plan
tree's top node, much as ExecutorStart calls (via InitPlan)
ExecInitNode to construct a PlanState tree for actual execution
paralleling the plan tree.
For now, ExecutorPrep() / ExecPrepNode() does mainly two things if and
as it walks the plan tree: 1) Extract the RT indexes of RTE_RELATION
entries and add them to a bitmapset in the result struct, 2) If the
node contains a PartitionPruneInfo, perform its "initial pruning
steps" and store the result of doing so in a per-plan-node node called
PlanPrepOutput. The bitmapset and the array containing per-plan-node
PlanPrepOutput nodes are returned in a node called ExecPrepOutput,
which is the result of ExecutorPrep, to its calling module (say,
plancache.c), which, after it's done using that information, must pass
it forward to subsequent execution steps. That is done by passing it,
via the module's callers, to CreateQueryDesc() which remembers the
ExecPrepOutput in QueryDesc that is eventually passed to
ExecutorStart().
A bunch of other details are mentioned in the patch's commit message,
which I'm pasting below for anyone reading to spot any obvious flaws
(no-go's) of this approach:
Invent a new executor "prep" phase
The new phase, implemented by execMain.c:ExecutorPrep() and its
recursive underling execProcnode.c:ExecPrepNode(), takes a query's
PlannedStmt and processes the plan tree contained in it to produce
a ExecPrepOutput node as result.
As the plan tree is walked, each node must add the RT index(es) of
any relation(s) that it directly manipulates to a bitmapset member of
ExecPrepOutput (for example, an IndexScan node must add the Scan's
scanrelid). Also, each node may want to make a PlanPrepOutput node
containing additional information that may be of interest to the
calling module or to the later execution phases, if the node can
provide one (for example, an Append node may perform initial pruning
and add a set of "initially valid subplans" to the PlanPrepOutput).
The PlanPrepOutput nodess of all the plan nodes are added to an array
in the ExecPrepOutput, which is indexed using the individual nodes'
plan_node_id; a NULL is stored in the array slots of nodes that
don't have anything interesting to add to the PlanPrepOutput.
The ExecPrepOutput thus produced is passed to CreateQueryDesc()
and subsequently to ExecutorStart() via QueryDesc, which then makes
it available to the executor routines via the query's EState.
The main goal of adding this new phase is, for now, to allow cached
cached generic plans containing scans of partitioned tables using
Append/MergeAppend to be executed more efficiently by the prep phase
doing any initial pruning, instead of deferring that to
ExecutorStart(). That may allow AcquireExecutorLocks() on the plan
to lock only only the minimal set of relations/partitions, that is
those whose subplans survive the initial pruning.
Implementation notes:
* To allow initial pruning to be done as part of the pre-execution
prep phase as opposed to as part of ExecutorStart(), this refactors
ExecCreatePartitionPruneState() and ExecFindInitialMatchingSubPlans()
to pass the information needed to do initial pruning directly as
parameters instead of getting that from the EState and the PlanState
of the parent Append/MergeAppend, both of which would not be
available in ExecutorPrep(). Another, sort of non-essential-to-this-
goal, refactoring this does is moving the partition pruning
initialization stanzas in ExecInitAppend() and ExecInitMergeAppend()
both of which contain the same cod into its own function
ExecInitPartitionPruning().
* To pass the ExecPrepOutput(s) created by the plancache module's
invocation of ExecutorPrep() to the callers of the module, which in
turn would pass them down to ExecutorStart(), CachedPlan gets a new
List field that stores those ExecPrepOutputs, containing one element
for each PlannedStmt also contained in the CachedPlan. The new list
is stored in a child context of the context containing the
PlannedStmts, though unlike the latter, it is reset on every
invocation of CheckCachedPlan(), which in turn calls ExecutorPrep()
with a new set of bound Params.
* AcquireExecutorLocks() is now made to loop over a bitmapset of RT
indexes, those of relations returned in ExecPrepOutput, instead of
over the whole range table. With initial pruning that is also done
as part of ExcecutorPrep(), only relations from non-pruned nodes of
the plan tree would get locked as a result of this new arrangement.
* PlannedStmt gets a new field usesPrepExecPruning that indicates
whether any of the nodes of the plan tree contain "initial" (or
"pre-execution") pruning steps, which saves ExecutorPrep() the
trouble of walking the plan tree only to find out whether that's
the case.
* PartitionPruneInfo nodes now explicitly stores whether the steps
contained in any of the individual PartitionedRelPruneInfos embedded
in it contain initial pruning steps (those that can be performed
during ExecutorPrep) and execution pruning steps (those that can only
be performed during ExecutorRun), as flags contains_initial_steps and
contains_exec_steps, respectively. In fact, the aforementioned
PlannedStmt field's value is a logical OR of the values of the former
across all PartitionPruneInfo nodes embedded in the plan tree.
* PlannedStmt also gets a bitmapset field to store the RT indexes of
all relation RTEs referenced in the query that is populated when
contructing the flat range table in setrefs.c, which effectively
contains all the relations that the planner must have locked. In the
case of a cached plan, AcquireExecutorLocks() must lock all of those
relations, except those whose subnodes get pruned as result of
ExecutorPrep().
* PlannedStmt gets yet another field numPlanNodes that records the
highest plan_node_id assigned to any of the node contained in the
tree, which serves as the size to use when allocating the
PlanPrepOutput array.
Maybe this should be more than one patch? Say:
0001 to add ExecutorPrep and the boilerplate,
0002 to teach plancache.c to use the new facility
Thoughts?
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
v4-0001-Invent-a-new-executor-prep-phase.patchapplication/octet-stream; name=v4-0001-Invent-a-new-executor-prep-phase.patchDownload+1866-286
On Thu, Feb 10, 2022 at 3:14 AM Amit Langote <amitlangote09@gmail.com> wrote:
Maybe this should be more than one patch? Say:
0001 to add ExecutorPrep and the boilerplate,
0002 to teach plancache.c to use the new facility
Could be, not sure. I agree that if it's possible to split this in a
meaningful way, it would facilitate review. I notice that there is
some straight code movement e.g. the creation of
ExecPartitionPruneFixSubPlanIndexes. It would be best, I think, to do
pure code movement in a preparatory patch so that the main patch is
just adding the new stuff we need and not moving stuff around.
David Rowley recently proposed a patch for some parallel-safety
debugging cross checks which added a plan tree walker. I'm not sure
whether he's going to press that patch forward to commit, but I think
we should get something like that into the tree and start using it,
rather than adding more bespoke code. Maybe you/we should steal that
part of his patch and commit it separately. What I'm imagining is that
plan_tree_walker() would know which nodes have subnodes and how to
recurse over the tree structure, and you'd have a walker function to
use with it that would know which executor nodes have ExecPrep
functions and call them, and just do nothing for the others. That
would spare you adding stub functions for nodes that don't need to do
anything, or don't need to do anything other than recurse. Admittedly
it would look a bit different from the existing executor phases, but
I'd argue that it's a better coding model.
Actually, you might've had this in the patch at some point, because
you have a declaration for plan_tree_walker but no implementation. I
guess one thing that's a bit awkward about this idea is that in some
cases you want to recurse to some subnodes but not other subnodes. But
maybe it would work to put the recursion in the walker function in
that case, and then just return true; but if you want to walk all
children, return false.
+ bool contains_init_steps;
+ bool contains_exec_steps;
s/steps/pruning/? maybe with contains -> needs or performs or requires as well?
+ * Returned information includes the set of RT indexes of relations referenced
+ * in the plan, and a PlanPrepOutput node for each node in the planTree if the
+ * node type supports producing one.
Aren't all RT indexes referenced in the plan?
+ * This may lock relations whose information may be used to produce the
+ * PlanPrepOutput nodes. For example, a partitioned table before perusing its
+ * PartitionPruneInfo contained in an Append node to do the pruning the result
+ * of which is used to populate the Append node's PlanPrepOutput.
"may lock" feels awfully fuzzy to me. How am I supposed to rely on
something that "may" happen? And don't we need to have tight logic
around locking, with specific guarantees about what is locked at which
points in the code and what is not?
+ * At least one of 'planstate' or 'econtext' must be passed to be able to
+ * successfully evaluate any non-Const expressions contained in the
+ * steps.
This also seems fuzzy. If I'm thinking of calling this function, I
don't know how I'd know whether this criterion is met.
I don't love PlanPrepOutput the way you have it. I think one of the
basic design issues for this patch is: should we think of the prep
phase as specifically pruning, or is it general prep and pruning is
the first thing for which we're going to use it? If it's really a
pre-pruning phase, we could name it that way instead of calling it
"prep". If it's really a general prep phase, then why does
PlanPrepOutput contain initially_valid_subnodes as a field? One could
imagine letting each prep function decide what kind of prep node it
would like to return, with partition pruning being just one of the
options. But is that a useful generalization of the basic concept, or
just pretending that a special-purpose mechanism is more general than
it really is?
+ return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */
It seems to me that we should do what the XXX suggests. It doesn't
seem nice if the parallel workers could theoretically decide to prune
a different set of nodes than the leader.
+ * known at executor startup (excludeing expressions containing
Extra e.
+ * into subplan indexes, is also returned for use during subsquent
Missing e.
Somewhere, we're going to need to document the idea that this may
permit us to execute a plan that isn't actually fully valid, but that
we expect to survive because we'll never do anything with the parts of
it that aren't. Maybe that should be added to the executor README, or
maybe there's some better place, but I don't think that should remain
something that's just implicit.
This is not a full review, just some initial thoughts looking through this.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2022-02-10 17:13:52 +0900, Amit Langote wrote:
The attached patch implements this idea. Sorry for the delay in
getting this out and thanks to Robert for the off-list discussions on
this.
I did not follow this thread at all. And I only skimmed the patch. So I'm
probably wrong.
I'm a wary of this increasing executor overhead even in cases it won't
help. Without this patch, for simple queries, I see small allocations
noticeably in profiles. This adds a bunch more, even if
!context->stmt->usesPreExecPruning:
- makeNode(ExecPrepContext)
- makeNode(ExecPrepOutput)
- palloc0(sizeof(PlanPrepOutput *) * result->numPlanNodes)
- stmt_execprep_list = lappend(stmt_execprep_list, execprep);
- AllocSetContextCreate(CurrentMemoryContext,
"CachedPlan execprep list", ...
- ...
That's a lot of extra for something that's already a bottleneck.
Greetings,
Andres Freund
(just catching up on this thread)
On Thu, 13 Jan 2022 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote:
Yeah. I don't think it's only non-core code we need to worry about
either. What if I just do EXPLAIN ANALYZE on a prepared query that
ends up pruning away some stuff? IIRC, the pruned subplans are not
shown, so we might escape disaster here, but FWIW if I'd committed
that code I would have pushed hard for showing those and saying "(not
executed)" .... so it's not too crazy to imagine a world in which
things work that way.
FWIW, that would remove the whole point in init run-time pruning. The
reason I made two phases of run-time pruning was so that we could get
away from having the init plan overhead of nodes we'll never need to
scan. If we wanted to show the (never executed) scans in EXPLAIN then
we'd need to do the init plan part and allocate all that memory
needlessly.
Imagine a hash partitioned table on "id" with 1000 partitions. The user does:
PREPARE q1 (INT) AS SELECT * FROM parttab WHERE id = $1;
EXECUTE q1(123);
Assuming a generic plan, if we didn't have init pruning then we have
to build a plan containing the scans for all 1000 partitions. There's
significant overhead to that compared to just locking the partitions,
and initialising 1 scan.
If it worked this way then we'd be even further from Amit's goal of
reducing the overhead of starting plan with run-time pruning nodes.
I understood at the time it was just the EXPLAIN output that you had
concerns with. I thought that was just around the lack of any display
of the condition we used for pruning.
David