track needed attributes in plan nodes for executor use
Hi,
I’ve been experimenting with an optimization that reduces executor
overhead by avoiding unnecessary attribute deformation. Specifically,
if the executor knows which attributes are actually needed by a plan
node’s targetlist and qual, it can skip deforming unused columns
entirely.
In a proof-of-concept patch, I initially computed the needed
attributes during ExecInitSeqScan by walking the plan’s qual and
targetlist to support deforming only what’s needed when evaluating
expressions in ExecSeqScan() or the variant thereof (I started with
SeqScan to keep the initial patch minimal). However, adding more work
to ExecInit* adds to executor startup cost, which we should generally
try to reduce. It also makes it harder to apply the optimization
uniformly across plan types.
I’d now like to propose computing the needed attributes at planning
time instead. This can be done at the bottom of create_plan_recurse,
after the plan node has been constructed. A small helper like
record_needed_attrs(plan) can walk the node’s targetlist and qual
using pull_varattnos() and store the result in a new Bitmapset
*attr_used field in the Plan struct. System attributes returned by
pull_varattnos() can be filtered out during this step, since they're
either not relevant to deformation or not performance sensitive.
This also lays the groundwork for a related executor-side optimization
that David Rowley suggested to me off-list. The idea is to remember,
in the TupleDesc, either the attribute number or the byte offset of
the first variable-length attribute. Then, if the minimum required
attribute (as provided by attr_used) lies before that, the executor
can safely jump directly to it using the cached offset, rather than
starting deformation from attno 0 as it currently does. That avoids
walking through fixed-length attributes that aren't needed --
specifically, skipping per-attribute alignment, null checking, and
offset tracking for unused columns -- which reduces CPU work and
avoids loading irrelevant tuple bytes into cache.
With both patches in place, heap tuple deforming can skip over unused
attributes entirely. For example, on a 30-column table where the first
15 columns are fixed-width, the query:
select sum(a_1) from foo where a_10 = $1;
which references only two fixed-width columns, ran nearly 2x faster
with the optimization in place (with heap pages prewarmed into
shared_buffers).
In more complex plans, for example those involving a Sort or Join
between the scan and aggregation, the CPU cost of the intermediate
node may dominate, making deforming-related savings at the top less
visible in overall performance. Still, I don't think that's a reason
to avoid enabling this optimization more broadly across plan nodes.
I'll post the PoC patches and performance measurements. Posting this
in advance to get feedback on the proposed direction and where best to
place attr_used.
--
Thanks,
Amit Langote
On Fri, 11 Jul 2025 at 17:16, Amit Langote <amitlangote09@gmail.com> wrote:
Hi,
I’ve been experimenting with an optimization that reduces executor
overhead by avoiding unnecessary attribute deformation. Specifically,
if the executor knows which attributes are actually needed by a plan
node’s targetlist and qual, it can skip deforming unused columns
entirely.In a proof-of-concept patch, I initially computed the needed
attributes during ExecInitSeqScan by walking the plan’s qual and
targetlist to support deforming only what’s needed when evaluating
expressions in ExecSeqScan() or the variant thereof (I started with
SeqScan to keep the initial patch minimal). However, adding more work
to ExecInit* adds to executor startup cost, which we should generally
try to reduce. It also makes it harder to apply the optimization
uniformly across plan types.I’d now like to propose computing the needed attributes at planning
time instead. This can be done at the bottom of create_plan_recurse,
after the plan node has been constructed. A small helper like
record_needed_attrs(plan) can walk the node’s targetlist and qual
using pull_varattnos() and store the result in a new Bitmapset
*attr_used field in the Plan struct. System attributes returned by
pull_varattnos() can be filtered out during this step, since they're
either not relevant to deformation or not performance sensitive.This also lays the groundwork for a related executor-side optimization
that David Rowley suggested to me off-list. The idea is to remember,
in the TupleDesc, either the attribute number or the byte offset of
the first variable-length attribute. Then, if the minimum required
attribute (as provided by attr_used) lies before that, the executor
can safely jump directly to it using the cached offset, rather than
starting deformation from attno 0 as it currently does. That avoids
walking through fixed-length attributes that aren't needed --
specifically, skipping per-attribute alignment, null checking, and
offset tracking for unused columns -- which reduces CPU work and
avoids loading irrelevant tuple bytes into cache.With both patches in place, heap tuple deforming can skip over unused
attributes entirely. For example, on a 30-column table where the first
15 columns are fixed-width, the query:select sum(a_1) from foo where a_10 = $1;
which references only two fixed-width columns, ran nearly 2x faster
with the optimization in place (with heap pages prewarmed into
shared_buffers).In more complex plans, for example those involving a Sort or Join
between the scan and aggregation, the CPU cost of the intermediate
node may dominate, making deforming-related savings at the top less
visible in overall performance. Still, I don't think that's a reason
to avoid enabling this optimization more broadly across plan nodes.I'll post the PoC patches and performance measurements. Posting this
in advance to get feedback on the proposed direction and where best to
place attr_used.
That's interesting. If I understand correctly, this approach wouldn't work if
the first attribute is variable-length, right?
--
Regards,
Japin Li
On 11/7/2025 10:16, Amit Langote wrote:
Hi,
I’ve been experimenting with an optimization that reduces executor
overhead by avoiding unnecessary attribute deformation. Specifically,
if the executor knows which attributes are actually needed by a plan
node’s targetlist and qual, it can skip deforming unused columns
entirely.
Sounds promising. However, I'm not sure we're on the same page. Do you
mean by the proposal an optimisation of slot_deform_heap_tuple() by
providing it with a bitmapset of requested attributes? In this case,
tuple header requires one additional flag to indicate a not-null, but
unfilled column, to detect potential issues.
In a proof-of-concept patch, I initially computed the needed
attributes during ExecInitSeqScan by walking the plan’s qual and
targetlist to support deforming only what’s needed when evaluating
expressions in ExecSeqScan() or the variant thereof (I started with
SeqScan to keep the initial patch minimal). However, adding more work
to ExecInit* adds to executor startup cost, which we should generally
try to reduce. It also makes it harder to apply the optimization
uniformly across plan types.
I'm not sure if a lot of work will be added. However, cached generic
plan execution should avoid any unnecessary overhead.
I’d now like to propose computing the needed attributes at planning
time instead. This can be done at the bottom of create_plan_recurse,
after the plan node has been constructed. A small helper like
record_needed_attrs(plan) can walk the node’s targetlist and qual
using pull_varattnos() and store the result in a new Bitmapset
*attr_used field in the Plan struct. System attributes returned by
pull_varattnos() can be filtered out during this step, since they're
either not relevant to deformation or not performance sensitive.
Why do you choose the Plan node? It seems it is relevant to only Scan
nodes. Does it mean extension of the CustomScan API?
With both patches in place, heap tuple deforming can skip over unused
attributes entirely. For example, on a 30-column table where the first
15 columns are fixed-width, the query:select sum(a_1) from foo where a_10 = $1;
which references only two fixed-width columns, ran nearly 2x faster
with the optimization in place (with heap pages prewarmed into
shared_buffers).
It may be profitable. However, I often encounter cases where a table has
20-40 columns, with arbitrarily mixed fixed and variable-width columns.
And fetching columns by index on a 30-something column is painful. And
in this area, Postgres may gain more profit by adding cost on the column
number in the order_qual_clauses() - in [1]https://open.substack.com/pub/danolivo/p/on-expressions-reordering-in-postgres I attempted to explain how
and why it should work.
[1]: https://open.substack.com/pub/danolivo/p/on-expressions-reordering-in-postgres
https://open.substack.com/pub/danolivo/p/on-expressions-reordering-in-postgres
--
regards, Andrei Lepikhov
On Fri, Jul 11, 2025 at 6:58 PM Japin Li <japinli@hotmail.com> wrote:
On Fri, 11 Jul 2025 at 17:16, Amit Langote <amitlangote09@gmail.com> wrote:
Hi,
I’ve been experimenting with an optimization that reduces executor
overhead by avoiding unnecessary attribute deformation. Specifically,
if the executor knows which attributes are actually needed by a plan
node’s targetlist and qual, it can skip deforming unused columns
entirely.In a proof-of-concept patch, I initially computed the needed
attributes during ExecInitSeqScan by walking the plan’s qual and
targetlist to support deforming only what’s needed when evaluating
expressions in ExecSeqScan() or the variant thereof (I started with
SeqScan to keep the initial patch minimal). However, adding more work
to ExecInit* adds to executor startup cost, which we should generally
try to reduce. It also makes it harder to apply the optimization
uniformly across plan types.I’d now like to propose computing the needed attributes at planning
time instead. This can be done at the bottom of create_plan_recurse,
after the plan node has been constructed. A small helper like
record_needed_attrs(plan) can walk the node’s targetlist and qual
using pull_varattnos() and store the result in a new Bitmapset
*attr_used field in the Plan struct. System attributes returned by
pull_varattnos() can be filtered out during this step, since they're
either not relevant to deformation or not performance sensitive.This also lays the groundwork for a related executor-side optimization
that David Rowley suggested to me off-list. The idea is to remember,
in the TupleDesc, either the attribute number or the byte offset of
the first variable-length attribute. Then, if the minimum required
attribute (as provided by attr_used) lies before that, the executor
can safely jump directly to it using the cached offset, rather than
starting deformation from attno 0 as it currently does. That avoids
walking through fixed-length attributes that aren't needed --
specifically, skipping per-attribute alignment, null checking, and
offset tracking for unused columns -- which reduces CPU work and
avoids loading irrelevant tuple bytes into cache.With both patches in place, heap tuple deforming can skip over unused
attributes entirely. For example, on a 30-column table where the first
15 columns are fixed-width, the query:select sum(a_1) from foo where a_10 = $1;
which references only two fixed-width columns, ran nearly 2x faster
with the optimization in place (with heap pages prewarmed into
shared_buffers).In more complex plans, for example those involving a Sort or Join
between the scan and aggregation, the CPU cost of the intermediate
node may dominate, making deforming-related savings at the top less
visible in overall performance. Still, I don't think that's a reason
to avoid enabling this optimization more broadly across plan nodes.I'll post the PoC patches and performance measurements. Posting this
in advance to get feedback on the proposed direction and where best to
place attr_used.That's interesting. If I understand correctly, this approach wouldn't work if
the first attribute is variable-length, right?
That is correct.
--
Thanks, Amit Langote
Thanks for the comments.
On Fri, Jul 11, 2025 at 11:09 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 11/7/2025 10:16, Amit Langote wrote:
Hi,
I’ve been experimenting with an optimization that reduces executor
overhead by avoiding unnecessary attribute deformation. Specifically,
if the executor knows which attributes are actually needed by a plan
node’s targetlist and qual, it can skip deforming unused columns
entirely.Sounds promising. However, I'm not sure we're on the same page. Do you
mean by the proposal an optimisation of slot_deform_heap_tuple() by
providing it with a bitmapset of requested attributes? In this case,
tuple header requires one additional flag to indicate a not-null, but
unfilled column, to detect potential issues.
Not quite -- the optimization doesn’t require changes to the tuple
header or representation. The existing deforming code already stops
once all requested attributes are filled, using tts_nvalid to track
that. What I’m proposing is to additionally allow the slot to skip
ahead to the first needed attribute, rather than always starting
deformation from attno 0. That lets us avoid alignment/null checks for
preceding fixed-width attributes that are guaranteed to be unused.
To support that efficiently, the slot can store a new tts_min_valid
field to indicate the lowest attno that needs deforming.
Alternatively, we could use a per-attribute flag array (with
TupleDesc->natts elements), though that adds some memory and
complexity. The first option seems simpler and should be sufficient in
most cases.
In a proof-of-concept patch, I initially computed the needed
attributes during ExecInitSeqScan by walking the plan’s qual and
targetlist to support deforming only what’s needed when evaluating
expressions in ExecSeqScan() or the variant thereof (I started with
SeqScan to keep the initial patch minimal). However, adding more work
to ExecInit* adds to executor startup cost, which we should generally
try to reduce. It also makes it harder to apply the optimization
uniformly across plan types.I'm not sure if a lot of work will be added. However, cached generic
plan execution should avoid any unnecessary overhead.
True, and that's exactly why I moved the computation from ExecInit* to
the planner. Doing it during plan construction ensures we avoid the
cost even in generic plan execution, which wouldn’t benefit if the
work were deferred to executor startup.
I’d now like to propose computing the needed attributes at planning
time instead. This can be done at the bottom of create_plan_recurse,
after the plan node has been constructed. A small helper like
record_needed_attrs(plan) can walk the node’s targetlist and qual
using pull_varattnos() and store the result in a new Bitmapset
*attr_used field in the Plan struct. System attributes returned by
pull_varattnos() can be filtered out during this step, since they're
either not relevant to deformation or not performance sensitive.Why do you choose the Plan node? It seems it is relevant to only Scan
nodes. Does it mean extension of the CustomScan API?
It’s true that the biggest win is for Scan nodes, since that’s where
the tuple is fetched from storage and first deformed. But upper nodes
like Agg also deform tuples to evaluate expressions. For example, in a
plan like Agg over Sort over SeqScan, the Agg node will receive
MinimalTuples from the Sort and need to deform them to extract just
the attributes required for aggregation. So the optimization could
help there too.
I wasn’t quite sure what you meant about the CustomScan API, could you
elaborate?
With both patches in place, heap tuple deforming can skip over unused
attributes entirely. For example, on a 30-column table where the first
15 columns are fixed-width, the query:select sum(a_1) from foo where a_10 = $1;
which references only two fixed-width columns, ran nearly 2x faster
with the optimization in place (with heap pages prewarmed into
shared_buffers).It may be profitable. However, I often encounter cases where a table has
20-40 columns, with arbitrarily mixed fixed and variable-width columns.
And fetching columns by index on a 30-something column is painful. And
in this area, Postgres may gain more profit by adding cost on the column
number in the order_qual_clauses() - in [1] I attempted to explain how
and why it should work.[1]
https://open.substack.com/pub/danolivo/p/on-expressions-reordering-in-postgres
Thanks, Andrei. Yes, I agree that clause ordering to minimize
deformation cost is a worthwhile idea and I appreciate the pointer to
your post. This patch aims to eliminate unnecessary work mechanically,
without depending on clause order or planner heuristics. It's
motivated by recent discussions that I am interested in around
improving the CPU characteristics of execution, especially by shaving
off predictable overheads in tight loops like tuple deformation.
--
Thanks, Amit Langote
On 14/7/2025 06:52, Amit Langote wrote:
On Fri, Jul 11, 2025 at 11:09 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
To support that efficiently, the slot can store a new tts_min_valid
field to indicate the lowest attno that needs deforming.
Alternatively, we could use a per-attribute flag array (with
TupleDesc->natts elements), though that adds some memory and
complexity. The first option seems simpler and should be sufficient in
most cases.
I'm not sure. Typically, people don't optimise the order of columns, and
it seems to me that necessary columns can be found both at the beginning
of the table (like the primary key) and at the end. I believe it's best
to skip any unused columns. However, I haven't seen your patch yet to
talk about the effect.
I wasn’t quite sure what you meant about the CustomScan API, could you
elaborate?
I was thinking that custom logic might require some columns that are not
detected in the target list or qualifications. Therefore, there should
be a method to provide the core with a list of the necessary columns.
--
regards, Andrei Lepikhov
Amit Langote <amitlangote09@gmail.com> writes:
Not quite -- the optimization doesn’t require changes to the tuple
header or representation. The existing deforming code already stops
once all requested attributes are filled, using tts_nvalid to track
that. What I’m proposing is to additionally allow the slot to skip
ahead to the first needed attribute, rather than always starting
deformation from attno 0. That lets us avoid alignment/null checks for
preceding fixed-width attributes that are guaranteed to be unused.
I'm quite skeptical about this being a net win. You could only skip
deformation for attributes that are both fixed-width and
guaranteed-not-null. Having a lot of those at the start may be true
in our system catalogs (because we have other reasons to lay them out
that way) but I doubt it occurs often in user tables. So I'm afraid
that this would eat more in planning time than it'd usually save in
practice.
I'm also bothered by the assumption that the planner has full
knowledge of which attributes will be used at run-time. I don't
believe that the plan tree contains every Var reference that will
occur during execution. Triggers, CHECK constraints, FK constraints,
etc are all things that aren't in the plan tree.
regards, tom lane
Thanks for the thoughts, Tom.
On Mon, Jul 14, 2025 at 11:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Langote <amitlangote09@gmail.com> writes:
Not quite -- the optimization doesn’t require changes to the tuple
header or representation. The existing deforming code already stops
once all requested attributes are filled, using tts_nvalid to track
that. What I’m proposing is to additionally allow the slot to skip
ahead to the first needed attribute, rather than always starting
deformation from attno 0. That lets us avoid alignment/null checks for
preceding fixed-width attributes that are guaranteed to be unused.I'm quite skeptical about this being a net win. You could only skip
deformation for attributes that are both fixed-width and
guaranteed-not-null. Having a lot of those at the start may be true
in our system catalogs (because we have other reasons to lay them out
that way) but I doubt it occurs often in user tables. So I'm afraid
that this would eat more in planning time than it'd usually save in
practice.
That’s fair, and I agree that a fixed-not-null prefix is not a common
pattern across all user schemas, and our handling of dropped columns
only makes that less likely. Still, I think it’s worth exploring this
optimization in the context of OLAP-style workloads, where the
executor processes large volumes of tuples and per-tuple CPU
efficiency can matter. In practice, users often copy operational data
into separate OLAP tables to gain performance, designing those tables
with specific layouts in mind (for example, wide tables with
fixed-width keys near the front followed by varlena columns). There is
a good deal of public guidance -- including talks, blog posts, and
vendor materials -- that promotes that pattern. Users adopting it, and
even those promoting it, might not realize that tuple deforming
overhead remains a bottleneck despite their schema work. But we have
seen that it can be and I think we now have a reasonably clean way to
mitigate that.
The example I showed, with 15 fixed-width columns followed by varlena
ones, was meant to demonstrate that deformation cost is mechanically
avoidable in some cases, not because we expect that exact schema to be
common. For instance, in that example, ExecInterpExpr() can account
for 70% of runtime in perf profiles of a backend running SELECT
sum(col_10) FROM foo WHERE col_1 = $1, most of which is spent in
slot_getsomeattrs_int() (62%) -- without HEAD that is. With the PoC
patch applied, total time in ExecInterpExpr() drops to 36%, and
slot_getsomeattrs_int() accounts for only 18%.
I'm also bothered by the assumption that the planner has full
knowledge of which attributes will be used at run-time. I don't
believe that the plan tree contains every Var reference that will
occur during execution. Triggers, CHECK constraints, FK constraints,
etc are all things that aren't in the plan tree.
Right, I agree that plan-time knowledge does not cover everything.
This optimization is not aimed at mechanisms like triggers or
constraints, which may access attributes outside the Plan tree. More
importantly, those mechanisms are not part of the hot executor loop I
am trying to optimize as mentioned above.
That said, computing the needed attribute set in the executor might
turn out to be more extensible in practice now that I think about it.
Once a TupleTableSlot has been populated during plan execution,
expressions that read from it -- including those outside the plan tree
-- can potentially benefit. For example, ModifyTable reuses the same
slot populated by its subplan when performing per-row operations like
CHECK constraint evaluation and trigger firing. Planner-side analysis
would miss such uses, but executor-side computation naturally covers
them. So while my current goal is just to improve performance for
plan-node expression evaluation, executor-side analysis could
naturally extend the benefit to other deforming paths without extra
effort. In contrast, planner-side analysis is inherently limited to
the Plan tree.
Thanks again for the feedback.
--
Thanks, Amit Langote
Just a quick historical note:
Back in 2016 [1]/messages/by-id/20160722015605.hpthk7axm6sx2mur@alap3.anarazel.de, Andres had raised similar concerns about executor
overhead from deforming unneeded columns, but at the time
(pre-ExprEvalStep), expression evaluation wasn’t yet structured enough
for that overhead to show up clearly in profiles. I see that Tom
replied then too that the CPU cost of deforming extra columns would
likely be lost in the noise.
That may have been fair back then, but I don’t think it holds anymore.
With today’s step-driven ExecInterpExpr(), perf profiles of even
simple OLAP queries like:
SELECT sum(col1) FROM tbl WHERE col30 = 123;
show substantial time spent in slot_getsomeattrs_int() and related
deforming code. This is with the table fully cached, so we’re talking
pure CPU overhead. Skipping deformation of columns not referenced in
quals or targetlists can materially reduce runtime in such cases.
Also, I forgot to mention in my earlier email that I’m proposing this
work partly based on recent off-list discussions with Andres. I think
the cost-benefit tradeoff is different now and worth reevaluating,
even if only some real-world schemas end up benefiting.
So, I’ll go off and prototype a version where the needed attributes
are collected during expression tree initialization, in a generic
enough way that any expression whose evaluation might involve a
deforming step will benefit.
--
Thanks, Amit Langote
[1]: /messages/by-id/20160722015605.hpthk7axm6sx2mur@alap3.anarazel.de
On Mon, Jul 14, 2025 at 5:04 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 14/7/2025 06:52, Amit Langote wrote:
On Fri, Jul 11, 2025 at 11:09 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
To support that efficiently, the slot can store a new tts_min_valid
field to indicate the lowest attno that needs deforming.
Alternatively, we could use a per-attribute flag array (with
TupleDesc->natts elements), though that adds some memory and
complexity. The first option seems simpler and should be sufficient in
most cases.I'm not sure. Typically, people don't optimise the order of columns, and
it seems to me that necessary columns can be found both at the beginning
of the table (like the primary key) and at the end. I believe it's best
to skip any unused columns. However, I haven't seen your patch yet to
talk about the effect.
Yeah, I agree that skipping arbitrary unused columns would be ideal.
For now though, I’m focusing on the fixed-not-null prefix case since
it’s easy to exploit with minimal runtime overhead -- if the minimum
needed attno is after a block of fixed-width not null columns, we can
skip deforming those with a single offset jump. Supporting arbitrary
column skipping would need something like a per-attribute flags array,
which might need someday, but wanted to start with something simpler.
I wasn’t quite sure what you meant about the CustomScan API, could you
elaborate?I was thinking that custom logic might require some columns that are not
detected in the target list or qualifications. Therefore, there should
be a method to provide the core with a list of the necessary columns.
I think I’m starting to understand the point. It’s not about the core
planner expressions, but about making sure that any expression
evaluated inside the custom scan node also has its needed attributes
marked. So even if the core plan doesn’t reference certain Vars, the
deforming logic still needs to know about them if the custom scan will
be evaluating expressions that access them. I’m still trying to fully
wrap my head around how that fits into the overall expression setup
and deformation path, but I agree that the custom node should have a
way to inform the executor about those needs.
--
Thanks, Amit Langote