Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

Started by James Hunterabout 1 year ago26 messages

james.hunter.pg@gmail.com

about 1 year ago

I want customers to be able to run large OLAP queries on PostgreSQL,
using as much memory as possible, to avoid spilling — without running
out of memory.

There are other ways to run out of memory, but the fastest and easiest
way, on an OLAP query, is to use a lot of work_mem. (This is true for
any SQL database: SQL operators are “usually” streaming operators...
except for those that use work_mem.) PostgreSQL already supports the
work_mem GUC, and every PostgreSQL operator tries very hard to spill
to disk rather than exceed its work_mem limit. For now, I’m not
concerned about other ways for queries to run out of memory — just
work_mem.

I like the way PostgreSQL operators respect work_mem, but I can’t find
a good way to set the work_mem GUC. Oracle apparently had the same
problem, with their RDBMS, 20 years ago [1]https://docs.oracle.com/en//database/oracle/oracle-database/23/admin/managing-memory.html#GUID-8D7FC70A-56D8-4CA1-9F74-592F04172EA7:

“In releases earlier than Oracle Database 10g, the database
administrator controlled the maximum size of SQL work areas by setting
the following parameters: SORT_AREA_SIZE, HASH_AREA_SIZE, ... Setting
these parameters is difficult, because the maximum work area size is
ideally selected from the data input size and the total number of work
areas active in the system. These two factors vary greatly from one
work area to another and from one time to another. Thus, the various
*_AREA_SIZE parameters are difficult to tune under the best of
circumstances.

“For this reason, Oracle strongly recommends that you leave automatic
PGA memory management enabled.”

It’s difficult to tune PostgreSQL’s work_mem and hash_mem_multiplier
GUCs, under the best of circumstances, yeah. The work_mem and
hash_mem_multiplier GUCs apply to all operators of a given type, even
though two operators of the same type, even in the same query, might
need vastly different amounts of work_mem.

I would like a “query_work_mem” GUC, similar to what’s proposed in
[2]: /messages/by-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com
easier to tune than the existing work_mem + hash_mem_multiplier GUCs;
and it would serve as a milestone on a path to my ultimate goal of
something like Oracle’s “automatic PGA memory management.”

I call it “query_work_mem,” rather than “max_total_backend_memory,”
because (a) for now, I care only about limiting work_mem, I’ll deal
with other types of memory separately; and (b) “query” instead of
“backend” avoids ambiguity over how much memory a recursively-compiled
query gets.

(Re (b), see “crosstab()”, [3]https://www.postgresql.org/docs/current/tablefunc.html. The “sql text” executed by crosstab()
would get its own query_work_mem allocation, separate from the query
that called the crosstab() function.)

The main problem I have with the “max_total_backend_memory” proposal,
however, is that it “enforces” its limit by killing the offending
query. This seems an overreaction to me, especially since PostgreSQL
operators already know how to spill to disk. If a customer’s OLAP
query exceeds its memory limit by 1%, I would rather spill 1% of their
data to disk, instead of cancelling their entire query.

(And if their OLAP query exceeds its memory limit by 1,000x... I still
don’t want PostgreSQL to preemptively cancel it, because either the
customer ends up OK with the overall performance, even with the
spilling; or else they decide the query is taking too long, and cancel
it themselves. I don’t want to be in the business of preemptively
cancelling customer queries.)

So, I want a query_work_mem GUC, and I want PostgreSQL to distribute
that total query_work_mem to the query’s individual SQL operators, so
that each operator will spill rather than exceed its per-operator
limit.

Making query_work_mem a session GUC makes it feasible for a DBA or an
automated system to distribute memory from a global memory pool among
individual queries, e.g. via pg_hint_plan(). So (as mentioned above),
“query_work_mem” is useful to a DBA, and also a step toward a
fully-automated memory-management system.

How should “query_work_mem” work? Let’s start with an example: suppose
we have an OLAP query that has 2 Hash Joins, and no other operators
that use work_mem. Suppose we’re pretty sure that one of the Hash
Joins will use 10 KB of work_mem, while the other will use 1 GB. And
suppose we know that the PostgreSQL instance has 1 GB of memory
available, for use by our OLAP query. (Perhaps we reserve 1 GB for
OLAP queries, and we allow only 1 OLAP query at a time; or perhaps we
have some sort of dynamic memory manager.)

How should we configure PostgreSQL so that our OLAP query spills as
little as possible, without running out of memory?

-- First, we could just say: 2 operators, total available working
memory is 1 GB — give each operator 512 MB. Then we would spill 512 MB
from the large Hash Join, because we’d waste around 512 MB for the
small Hash Join. We’re undersubscribing, to be safe, but our
performance suffers. That’s bad! We’re basically wasting memory that
the query would like to use.

-- Second, we could say, instead: the small Hash Join is *highly
unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
expecting that the small Hash Join won’t use more than 1 MB of its
1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
oversubscribing, betting that the small Hash Join will just stay
within some smaller, “unenforced” memory limit.

In this example, this bet is probably fine — but it won’t work in
general. I don’t want to be in the business of gambling with customer
resources: if the small Hash Join is unlikely to use more than 1 MB,
then let’s just assign it 1 MB of work_mem. That way, if I’m wrong,
the customer’s query will just spill, instead of running out of
memory. I am very strongly opposed to cancelling queries if/when we
can just spill to disk.

-- Third, we could just rewrite the existing “work_mem” logic so that
all of the query’s operators draw, at runtime, from a single,
“query_work_mem” pool. So, an operator won’t spill until all of
“query_work_mem” is exhausted — by the operator itself, or by some
other operator in the same query.

But doing that runs into starvation/fairness problems, where an
unlucky operator, executing later in the query, can’t get any
query_work_mem, because earlier, greedy operators used up all of it.

The solution I propose here is just to distribute the “query_work_mem”
into individual, per-operator, work_mem limits.

**Proposal:**

I propose that we add a “query_work_mem” GUC, which works by
distributing (using some algorithm to be described in a follow-up
email) the entire “query_work_mem” to the query’s operators. And then
each operator will spill when it exceeds its own work_mem limit. So
we’ll preserve the existing “spill” logic as much as possible.

To enable this to-be-described algorithm, I would add an “nbytes”
field to the Path struct, and display this (and related info) in
EXPLAIN PLAN. So the customer will be able to see how much work_mem
the SQL compiler thinks they’ll need, per operator; and so will the
algorithm.

I wouldn’t change the existing planning logic (at least not in the
initial implementaton). So the existing planning logic would choose
between different SQL operators, still on the assumption that every
operator that needs working memory will get work_mem [*
hash_mem_multiplier]. This assumption might understate or overstate
the actual working memory we’ll give the operator, at runtime. If it
understates, the planner will be biased in favor of operators that
don’t use much working memory. If it overstates, the planner will be
biased in favor of operators that use too much working memory.

(We could add a feedback loop to the planner, or even something simple
like generating multiple path, at different “work_mem” limits, but
everything I can think of here adds complexity without much potential
benefit. So I would defer any changes to the planner behavior until
later, if ever.)

The to-be-described algorithm would look at a query’s Paths’ “nbytes”
fields, as well as the session “work_mem” GUC (which would, now, serve
as a hint to the SQL compiler), and decide how much of
“query_work_mem” to assign to the corresponding Plan node.

It would assign that limit to a new “work_mem” field, on the Plan
node. And this limit would also be exposed, of course, in EXPLAIN
ANALYZE, along with the actual work_mem usage, which might very well
exceed the limit. This will let the customer know when a query spills,
and why.

I would write the algorithm to maintain the existing work_mem
behavior, as much as possible. (Backward compatibility is good!) Most
likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a
*minimum* work_mem. Then, so long as query_work_mem exceeds the sum of
work_mem [* hash _mem_multiplier] , for all operators in the query,
all operators would be assigned at least work_mem, which would make my
proposal a Pareto improvement.

Last, at runtime, each PlanState would check its plan -> work_mem
field, rather than the global work_mem GUC. Execution would otherwise
be the same as today.

What do you think?

James

[1]: https://docs.oracle.com/en//database/oracle/oracle-database/23/admin/managing-memory.html#GUID-8D7FC70A-56D8-4CA1-9F74-592F04172EA7
[2]: /messages/by-id/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel@crunchydata.com
[3]: https://www.postgresql.org/docs/current/tablefunc.html

Jeff Davis

pgsql@j-davis.com

12 months ago

In reply to: James Hunter (#1)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:

How should “query_work_mem” work? Let’s start with an example:
suppose
we have an OLAP query that has 2 Hash Joins, and no other operators
that use work_mem.

So we plan first, and then assign available memory afterward? If we do
it that way, then the costing will be inaccurate, because the original
costs are based on the original work_mem.

It may be better than killing the query, but not ideal.

-- Second, we could say, instead: the small Hash Join is *highly
unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
expecting that the small Hash Join won’t use more than 1 MB of its
1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
oversubscribing, betting that the small Hash Join will just stay
within some smaller, “unenforced” memory limit.

In this example, this bet is probably fine — but it won’t work in
general. I don’t want to be in the business of gambling with customer
resources: if the small Hash Join is unlikely to use more than 1 MB,
then let’s just assign it 1 MB of work_mem.

I like this idea. Operators that either know they don't need much
memory, or estimate that they don't need much memory, can constrain
themselves. That would protect against misestimations and advertise to
the higher levels of the planner how much memory the operator actually
wants. Right now, the planner doesn't know which operators need a lot
of memory and which ones don't need any significant amount at all.

The challenge, of course, is what the higher levels of the planner
would do with that information, which goes to the rest of your
proposal. But tracking the information seems very reasonable to me.

I propose that we add a “query_work_mem” GUC, which works by
distributing (using some algorithm to be described in a follow-up
email) the entire “query_work_mem” to the query’s operators. And then
each operator will spill when it exceeds its own work_mem limit. So
we’ll preserve the existing “spill” logic as much as possible.

The description above sounds too "top-down" to me. That may work, but
has the disadvantage that costing has already happened. We should also
consider:

* Reusing the path generation infrastructure so that both "high memory"
and "low memory" paths can be considered, and if a path requires too
much memory in aggregate, then it would be rejected in favor of a path
that uses less memory. This feels like it fits within the planner
architecture the best, but it also might lead to a path explosion, so
we may need additional controls.

* Some kind of negotiation where the top level of the planner finds
that the plan uses too much memory, and replans some or all of it. (I
think is similar to what you described as the "feedback loop" later in
your email.) I agree that this is complex and may not have enough
benefit to justify.

Regards,
Jeff Davis

Tomas Vondra

tomas@vondra.me

12 months ago

In reply to: Jeff Davis (#2)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On 1/21/25 22:26, Jeff Davis wrote:

On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:

How should “query_work_mem” work? Let’s start with an example:
suppose
we have an OLAP query that has 2 Hash Joins, and no other operators
that use work_mem.

So we plan first, and then assign available memory afterward? If we do
it that way, then the costing will be inaccurate, because the original
costs are based on the original work_mem.

It may be better than killing the query, but not ideal.

-- Second, we could say, instead: the small Hash Join is *highly
unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
expecting that the small Hash Join won’t use more than 1 MB of its
1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
oversubscribing, betting that the small Hash Join will just stay
within some smaller, “unenforced” memory limit.

In this example, this bet is probably fine — but it won’t work in
general. I don’t want to be in the business of gambling with customer
resources: if the small Hash Join is unlikely to use more than 1 MB,
then let’s just assign it 1 MB of work_mem.

I like this idea. Operators that either know they don't need much
memory, or estimate that they don't need much memory, can constrain
themselves. That would protect against misestimations and advertise to
the higher levels of the planner how much memory the operator actually
wants. Right now, the planner doesn't know which operators need a lot
of memory and which ones don't need any significant amount at all.

I'm not sure I like the idea that much.

At first restricting the operator to the amount the optimizer predicts
will be needed seems reasonable, because that's generally the best idea
of memory usage we have without running the query.

But these estimates are often pretty fundamentally unreliable - maybe
not for simple examples, but once you put an aggregate on top of a join,
the errors can be pretty wild. And allowing the operator to still use
more work_mem makes this more adaptive. I suspect forcing the operator
to adhere to the estimated work_mem might make this much worse (but I
haven't tried, maybe spilling to temp files is not that bad).

The challenge, of course, is what the higher levels of the planner
would do with that information, which goes to the rest of your
proposal. But tracking the information seems very reasonable to me.

I agree. Tracking additional information seems like a good idea, but
it's not clear to me what would the planner use this. I can imagine
various approaches - e.g. we might do the planning as usual and then
distribute the query_work_mem between the nodes in proportion to the
estimated amount of memory. But it all seems like a very ad hoc
heuristics, and easy to confuse / make the wrong decision.

I propose that we add a “query_work_mem” GUC, which works by
distributing (using some algorithm to be described in a follow-up
email) the entire “query_work_mem” to the query’s operators. And then
each operator will spill when it exceeds its own work_mem limit. So
we’ll preserve the existing “spill” logic as much as possible.

The description above sounds too "top-down" to me. That may work, but
has the disadvantage that costing has already happened. We should also
consider:

* Reusing the path generation infrastructure so that both "high memory"
and "low memory" paths can be considered, and if a path requires too
much memory in aggregate, then it would be rejected in favor of a path
that uses less memory. This feels like it fits within the planner
architecture the best, but it also might lead to a path explosion, so
we may need additional controls.

* Some kind of negotiation where the top level of the planner finds
that the plan uses too much memory, and replans some or all of it. (I
think is similar to what you described as the "feedback loop" later in
your email.) I agree that this is complex and may not have enough
benefit to justify.

Right, it seems rather at odds with the bottom-up construction of paths.
The amount of memory an operator may use seems like a pretty fundamental
information, but if it's available only after the whole plan is built,
that seems ... not great.

I don't know if generating (and keeping) low/high-memory paths is quite
feasible. Isn't that really a continuum for many paths? A hash join may
need very little memory (with batching) or a lot of memory (if keeping
everything in memory), so how would this work? Would we generate paths
for a range of work_mem values (with different costs)?

regards

--
Tomas Vondra

Tomas Vondra

tomas@vondra.me

12 months ago

In reply to: James Hunter (#1)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On 1/10/25 19:00, James Hunter wrote:

...

**Proposal:**

I propose that we add a “query_work_mem” GUC, which works by
distributing (using some algorithm to be described in a follow-up
email) the entire “query_work_mem” to the query’s operators. And then
each operator will spill when it exceeds its own work_mem limit. So
we’ll preserve the existing “spill” logic as much as possible.

To enable this to-be-described algorithm, I would add an “nbytes”
field to the Path struct, and display this (and related info) in
EXPLAIN PLAN. So the customer will be able to see how much work_mem
the SQL compiler thinks they’ll need, per operator; and so will the
algorithm.

I wouldn’t change the existing planning logic (at least not in the
initial implementaton). So the existing planning logic would choose
between different SQL operators, still on the assumption that every
operator that needs working memory will get work_mem [*
hash_mem_multiplier].

All this seems generally feasible, but it hinges on the to-be-described
algorithm can distribute the memory in a sensible way that doesn't
affect the costing too much. If we plan a hash join with nbatch=1, and
then come back and make it to use nbatch=1024, maybe we wouldn't have
used hash join at all. Not sure.

The fundamental issue seems to be not having any information about how
much memory might be available to the operator. And in principle we
can't have that during the bottom-up part of the planning, until after
we construct the whole plan. Only at that point we know how many
operators will need work_mem.

Could we get at least some initial estimate how many such operators the
query *might* end up using? Maybe that'd be just enough a priori
information to set the effective work_mem for the planning part to make
this practical.

This assumption might understate or overstate
the actual working memory we’ll give the operator, at runtime. If it
understates, the planner will be biased in favor of operators that
don’t use much working memory. If it overstates, the planner will be
biased in favor of operators that use too much working memory.

I'm not quite sure I understand this part. Could you elaborate?

(We could add a feedback loop to the planner, or even something simple
like generating multiple path, at different “work_mem” limits, but
everything I can think of here adds complexity without much potential
benefit. So I would defer any changes to the planner behavior until
later, if ever.)

What would be the feedback? I can imagine improving the estimate of how
much memory a given operator needs during the bottom-up phase, but it
doesn't quite help with knowing what will happen above the current node.

The to-be-described algorithm would look at a query’s Paths’ “nbytes”
fields, as well as the session “work_mem” GUC (which would, now, serve
as a hint to the SQL compiler), and decide how much of
“query_work_mem” to assign to the corresponding Plan node.

It would assign that limit to a new “work_mem” field, on the Plan
node. And this limit would also be exposed, of course, in EXPLAIN
ANALYZE, along with the actual work_mem usage, which might very well
exceed the limit. This will let the customer know when a query spills,
and why.

I would write the algorithm to maintain the existing work_mem
behavior, as much as possible. (Backward compatibility is good!) Most
likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a
*minimum* work_mem. Then, so long as query_work_mem exceeds the sum of
work_mem [* hash _mem_multiplier] , for all operators in the query,
all operators would be assigned at least work_mem, which would make my
proposal a Pareto improvement.

Last, at runtime, each PlanState would check its plan -> work_mem
field, rather than the global work_mem GUC. Execution would otherwise
be the same as today.

What do you think?

I find it a bit hard to discuss an abstract proposal, without knowing
the really crucial ingredient. It might be helpful to implement some
sort of PoC of this approach, I'm sure that'd give us a lot of insights
and means to experiment with it (instead of just speculating about what
might happen).

regards

--
Tomas Vondra

Jeff Davis

pgsql@j-davis.com

12 months ago

In reply to: Tomas Vondra (#3)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Wed, 2025-01-22 at 21:48 +0100, Tomas Vondra wrote:

But these estimates are often pretty fundamentally unreliable - maybe
not for simple examples, but once you put an aggregate on top of a
join,
the errors can be pretty wild.

It would be conditional on whether there's some kind of memory
constraint or not. Setting aside the difficulty of implementing a new
memory constraint, if we assume there is one, then it would be good to
know how much memory an operator estimates that it needs.

(Also, if extra memory is available, spill files will be able to use
the OS filesystem cache, which mitigates the spilling cost.)

Another thing that would be good to know is about concurrent memory
usage. That is, if it's a blocking executor node, then it can release
all the memory from child nodes when it completes. Therefore the
concurrent memory usage might be less than just the sum of memory used
by all operators in the plan.

I don't know if generating (and keeping) low/high-memory paths is
quite
feasible. Isn't that really a continuum for many paths? A hash join
may
need very little memory (with batching) or a lot of memory (if
keeping
everything in memory), so how would this work? Would we generate
paths
for a range of work_mem values (with different costs)?

A range might cause too much of an explosion. Let's do something simple
like define "low" to mean 1/16th, or have a separate low_work_mem GUC
(that could be an absolute number or a fraction).

There are a few ways we could pass the information down. We could just
have every operator generate twice as many paths (at least those
operators that want to use as much memory as they can get). Or we could
pass down the query_work_mem by subtracting the current operator's
memory needs and dividing what's left among its input paths.

We may have to track extra information to make sure that high-memory
paths don't dominate low-memory paths that are still useful (similar to
path keys).

Regards,
Jeff Davis

James Hunter

james.hunter.pg@gmail.com

12 months ago

In reply to: Tomas Vondra (#4)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Wed, Jan 22, 2025 at 1:13 PM Tomas Vondra <tomas@vondra.me> wrote:

On 1/10/25 19:00, James Hunter wrote:

...
I wouldn’t change the existing planning logic (at least not in the
initial implementaton). So the existing planning logic would choose
between different SQL operators, still on the assumption that every
operator that needs working memory will get work_mem [*
hash_mem_multiplier].

All this seems generally feasible, but it hinges on the to-be-described
algorithm can distribute the memory in a sensible way that doesn't
affect the costing too much. If we plan a hash join with nbatch=1, and
then come back and make it to use nbatch=1024, maybe we wouldn't have
used hash join at all. Not sure.

I see two problems:
(a) For a given query plan, minimizing spilling without exceeding
memory limits (and thus crashing);
(b) For a given amount of available memory, choosing the optimal plan.

Both problems exist today, and PostgreSQL offers the same tools to
address both: work_mem and hash_mem_multiplier. I argue that these
tools are inadequate for (a), but I think they work reasonably well
for (b).

I propose to solve problem (a), but not (b).

In your example, the reason PostgreSQL plans a Hash Join, with nbatch
= 1, is because the planner's "nbytes" estimate for working memory is
< (hash_mem_multiplier * work_mem). This implicitly assumes that the
PostgreSQL *instance* has at least that memory available.

If it turns out that the instance doesn't have that much memory
available, then the Hash Join will crash. That's the current behavior.

It would be better if, instead, we used nbatch=1024 for the Hash Join,
so we *didn't* crash. (Note that your example implies that work_mem is
set to 1,024x available memory!) This is problem (a). But then, as you
point out, it might be *even better* if we gave up on Hash Join
altogether, and just went with Nested Loop. This is problem (b).

Today, "work_mem" can be set too high or too low. I argue that there's
no way to avoid one or the other, and:
1. If "work_mem" is set too high -- PostgreSQL currently crashes. With
my proposal, it (a) would not crash, but (b) would possibly execute a
sub-optimal plan.
2. If "work_mem" is set too low -- PostgreSQL currently (a) spills
unnecessarily, or (b) chooses a sub-optimal plan. With my proposal, it
would (a) not spill unnecessarily, but would (b) still execute a
sub-optimal plan (if chosen).

I am not proposing to solve (b), the generation of optimal plans, when
memory constraints are a priori unknown. I see that as a separate,
lower-priority problem -- one that would require multi-pass
compilation, which I would like to avoid.

The fundamental issue seems to be not having any information about how
much memory might be available to the operator. And in principle we
can't have that during the bottom-up part of the planning, until after
we construct the whole plan. Only at that point we know how many
operators will need work_mem.

It's not just bottom-up vs. top-down: it's multi-pass. I think you
would need something beyond Oracle's Cost-Based Query Transformation
[1]: https://dl.acm.org/doi/10.5555/1182635.1164215
front, to get an estimate for the total working-memory requested for
each tree. Then we could distribute "query_work_mem" to each tree, and
then compute the costs once we knew how much working memory each path
node on the tree would actually get.

The algorithm described in the previous paragraph is certainly
*possible*, but it's not remotely *feasible*, in general, because it
requires generating way too many states to cost. For one thing,
instead of costing path nodes individually, it has to cost entire
trees; and there are far more trees than there are tree nodes. For
another: today, PostgreSQL's optimizer goes out of its way not to
generate obviously-bad path nodes, but in the above algorithm, there's
no way to know whether a path is bad until after you've generated the
entire tree.

Anyway, to come up with something feasible, we'd have to apply
heuristics to prune trees, etc.. And it's not clear to me that, after
all of that code complexity and run-time CPU cost, the result would be
any better than just leaving the optimizer as it is. "Better the devil
you know..." etc.

My proposal doesn't try to solve (b), instead relying on the customer
to provide a reasonable "work_mem" estimate, for use by the optimizer.

Could we get at least some initial estimate how many such operators the
query *might* end up using? Maybe that'd be just enough a priori
information to set the effective work_mem for the planning part to make
this practical.

You could use the # of base relations, as a proxy for # of joins, but
I am not convinced this would improve the optimizer's decision.

This assumption might understate or overstate
the actual working memory we’ll give the operator, at runtime. If it
understates, the planner will be biased in favor of operators that
don’t use much working memory. If it overstates, the planner will be
biased in favor of operators that use too much working memory.

I'm not quite sure I understand this part. Could you elaborate?

I was just trying to express what you more clearly restated: if
work_mem is too high, vs. the actual memory available on the instance,
then the optimizer will choose Hash Join, even though the optimal
choice might be Nested Loop.

(We could add a feedback loop to the planner, or even something simple
like generating multiple path, at different “work_mem” limits, but
everything I can think of here adds complexity without much potential
benefit. So I would defer any changes to the planner behavior until
later, if ever.)

What would be the feedback? I can imagine improving the estimate of how
much memory a given operator needs during the bottom-up phase, but it
doesn't quite help with knowing what will happen above the current node.

Something like the algorithm I sketched above, in this email, might
work. Of course, it would have to be modified with heuristics, etc.,
to reduce the state space to something manageable...

But my point is just that any feedback loop, or running the optimizer
at different "work_mem" limits, is a bad idea. Leaving the optimizer
as it is seems the least bad choice.

Thanks for your comments,
James

[1]: https://dl.acm.org/doi/10.5555/1182635.1164215

James Hunter

james.hunter.pg@gmail.com

12 months ago

In reply to: Jeff Davis (#2)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Tue, Jan 21, 2025 at 1:26 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:

How should “query_work_mem” work? Let’s start with an example:
suppose
we have an OLAP query that has 2 Hash Joins, and no other operators
that use work_mem.

So we plan first, and then assign available memory afterward? If we do
it that way, then the costing will be inaccurate, because the original
costs are based on the original work_mem.

It may be better than killing the query, but not ideal.

As you point out, the outcome is better, but not ideal. My intuition
is that an "ideal" solution would increase query compilation times
beyond what customers would accept...

But at least the outcome, if not ideal is better than killing the
query! So it is a net improvement.

I propose that we add a “query_work_mem” GUC, which works by
distributing (using some algorithm to be described in a follow-up
email) the entire “query_work_mem” to the query’s operators. And then
each operator will spill when it exceeds its own work_mem limit. So
we’ll preserve the existing “spill” logic as much as possible.

The description above sounds too "top-down" to me. That may work, but
has the disadvantage that costing has already happened. We should also
consider:

* Reusing the path generation infrastructure so that both "high memory"
and "low memory" paths can be considered, and if a path requires too
much memory in aggregate, then it would be rejected in favor of a path
that uses less memory. This feels like it fits within the planner
architecture the best, but it also might lead to a path explosion, so
we may need additional controls.

* Some kind of negotiation where the top level of the planner finds
that the plan uses too much memory, and replans some or all of it. (I
think is similar to what you described as the "feedback loop" later in
your email.) I agree that this is complex and may not have enough
benefit to justify.

Generating "high memory" vs. "low memory" paths would be tricky,
because the definition of "high" vs. "low" depends on the entire path
tree, not just on a single path node. So I think it would quickly lead
to a state-space explosion, as you mention.

And I think negotiation has the same problem: it's based on the entire
tree, not just an individual path node. I think the general problem is
not so much "top-down" vs. "bottom-up", as "individual path node" vs.
"entire path tree." Today, PostgreSQL costs each path node
individually, by referring to the static "work_mem" GUC. In any
attempt to improve the optimizer's choice, I think we'd have to cost
the entire path tree. And there are many more trees than there are
tree nodes.

For example, the decision whether to prefer a Nested Loop vs. a Hash
Join that takes 2 MB of working memory, depends on what the query's
other joins are doing.

At any rate, I think we can solve the problem of "killing the query"
now; and then worry, in the future, about the ideal solution of how to
pick the optimal execution plan.

Regards,
Jeff Davis

Thanks for your comments!
James

Jeff Davis

pgsql@j-davis.com

12 months ago

In reply to: James Hunter (#7)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Fri, 2025-01-24 at 17:04 -0800, James Hunter wrote:

Generating "high memory" vs. "low memory" paths would be tricky,
because the definition of "high" vs. "low" depends on the entire path
tree, not just on a single path node. So I think it would quickly
lead
to a state-space explosion, as you mention.

At first, it appears to lead to an explosion, but there are a lot of
ways to prune early. Many operators, like an index scan, don't even
need to track memory, so they'd just have the one path. Other operators
can just generate a low memory path because estimates show that it's
unlikely to need more than that. And if there's a blocking operator,
then that resets the memory requirement, pruning the space further.

And I assume you are talking about analytic queries with reasonably
large values of work_mem anyway. That justifies a bit more planning
time -- no need to generate extra paths for cheap queries.

Maybe my idea doesn't work out, but I think it's too early to dismiss
it.

Regards,
Jeff Davis

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: Jeff Davis (#8)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Fri, Jan 24, 2025 at 5:48 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2025-01-24 at 17:04 -0800, James Hunter wrote:

Generating "high memory" vs. "low memory" paths would be tricky,
because the definition of "high" vs. "low" depends on the entire path
tree, not just on a single path node. So I think it would quickly
lead
to a state-space explosion, as you mention.

At first, it appears to lead to an explosion, but there are a lot of
ways to prune early. ...

Maybe my idea doesn't work out, but I think it's too early to dismiss
it.

I think it makes sense to split the work into two parts: one part that
improves SQL execution, and a second part that improves the optimizer,
to reflect the improvements to execution.

It seems better to me to wait until we have the ability to enforce
memory limits, before worrying about ways to generate different paths
with different memory limits. Then we would be able to tune the
optimizer heuristics based on the actual executor, instead of
extrapolating how the executor would behave under different memory
limits.

James

#10

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: James Hunter (#9)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

I hope to have an initial patch-set for a prototype, within the next
couple of weeks. But I wanted to add some design comments to this
thread, first, to solicit feedback, etc. —

First, some bookkeeping: Peter Geoghegan pointed me, offline, to
Oracle’s 2002 paper [1]https://www.vldb.org/conf/2002/S29P03.pdf on how they managed SQL execution memory in
9i. I found it helpful to compare my proposal to what Oracle did. The
main difference I see is that Oracle modified their SQL operators to
“give back” memory, at runtime, when the resource manager reduces the
per-operator memory limit. Doing this causes its own problems, but it
allows Oracle to maintain a single “per-operator” memory limit that
applies to *all* operators; see Figure 6.

I am reluctant to make a similar change to PostgreSQL, because (1) it
would involve a lot of code churn, and (2) it’s not clear to me that
this is a good path to take. Note that the Oracle design allows total
memory to exceed the global memory limit, temporarily, while the
system waits for running operators to give their memory back. So, the
paper describes how Oracle tries to anticipate this situation and
reduce the per-operator memory limit in advance... but I have not had
a good experience with that sort of strategy, in the cloud.

The Oracle design necessarily overprovisions some operators, because
it assigns the same limit to all operators. (See, again, Figure 6,
which makes all of this clearer than anything I could write.) It
relies on detecting when an overprovisioned operator starts to use
more of the memory it was provisioned... and then quickly reducing the
per-operator limit, so that other operators give up their memory for
use by the previously-overprovisioned operator. In this way, the
Oracle design is very fair.

However, while waiting for the other operators to give up their memory
(since they are now oversubscribed), the system temporarily exceeds
the global memory limit. This opens up a can of worms, but it seems
like the Oracle paper deals with this situation by letting the
excessive memory swap to disk (see Figures 10 and 11).

I don’t want to modify PostgreSQL operators so they can give up memory
at runtime. So this forces my solution to do two things: (1) provide
different operators different memory limits, since I can’t take memory
away from an operator after it has started running; and (2) give each
operator (at least) an initial memory reservation, before it starts
running. Hence, the approach I described earlier in this thread.

Second, some motivation: the cloud makes the resource management
problem worse than it is on-premise. I would refer to page 2 of the
Oracle doc (too long to quote here), as justification for moving away
from the “work_mem” GUC, but note that these arguments apply more
strongly to cloud databases, for two reasons. First reason: swap can
be prohibitively expensive in the cloud, and spilling very expensive.
This is because cloud instances frequently lack attached, ephemeral
storage. Cloud remote storage can be extremely slow [2]https://docs.aws.amazon.com/ebs/latest/userguide/ebs-io-characteristics.html#ebs-io-size-throughput-limits: “For example,
a gp2 volume under 1,000 GiB with burst credits available has ... a
volume throughput limit of 250 MiB/s.”

Second reason: any cloud provider has an effectively infinite number
of customer instances. I mean that this number is large enough that
the cloud provider cannot afford to manage these instances, except via
automated tools. So, when the Oracle paper says, “ Generally, the DBA
tries to avoid over-allocation by assuming the worst workload in order
to avoid paging (with dramatic degradation in performance) or query
failure.” The degradation is more dramatic in the cloud, and the cost
of under-utilization is higher.

Also also: “In most commercial systems the burden has been put on the
DBA to provide an optimal setting for configuration parameters that
are internally used to decide how much memory to allocate to a given
database operator. This is a challenging task for the DBA...” This is
an impossible task for the cloud provider!

Thanks,
James

[1]: https://www.vldb.org/conf/2002/S29P03.pdf
[2]: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-io-characteristics.html#ebs-io-size-throughput-limits

Show quoted text

On Mon, Feb 10, 2025 at 7:09 PM James Hunter <james.hunter.pg@gmail.com> wrote:

On Fri, Jan 24, 2025 at 5:48 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2025-01-24 at 17:04 -0800, James Hunter wrote:

Generating "high memory" vs. "low memory" paths would be tricky,
because the definition of "high" vs. "low" depends on the entire path
tree, not just on a single path node. So I think it would quickly
lead
to a state-space explosion, as you mention.

At first, it appears to lead to an explosion, but there are a lot of
ways to prune early. ...

Maybe my idea doesn't work out, but I think it's too early to dismiss
it.

I think it makes sense to split the work into two parts: one part that
improves SQL execution, and a second part that improves the optimizer,
to reflect the improvements to execution.

It seems better to me to wait until we have the ability to enforce
memory limits, before worrying about ways to generate different paths
with different memory limits. Then we would be able to tune the
optimizer heuristics based on the actual executor, instead of
extrapolating how the executor would behave under different memory
limits.

James

#11

Jeff Davis

pgsql@j-davis.com

11 months ago

In reply to: James Hunter (#9)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Mon, 2025-02-10 at 19:09 -0800, James Hunter wrote:

I think it makes sense to split the work into two parts: one part
that
improves SQL execution, and a second part that improves the
optimizer,
to reflect the improvements to execution.

I like the idea to store the value of work_mem in the
path/plan/executor nodes, and use that at execution time rather than
the GUC directly.

IIUC, that would allow an extension to do what you want, right? A
planner hook could just walk the tree and edit those values for
individual nodes, and the executor would enforce them.

Regards,
Jeff Davis

#12

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: Jeff Davis (#11)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Tue, Feb 11, 2025 at 10:00 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-02-10 at 19:09 -0800, James Hunter wrote:

I think it makes sense to split the work into two parts: one part
that
improves SQL execution, and a second part that improves the
optimizer,
to reflect the improvements to execution.

I like the idea to store the value of work_mem in the
path/plan/executor nodes, and use that at execution time rather than
the GUC directly.

IIUC, that would allow an extension to do what you want, right? A
planner hook could just walk the tree and edit those values for
individual nodes, and the executor would enforce them.

Yes, exactly!

* The Path would store "nbytes" (= the optimizer's estimate of how
much working memory a given Path will use), to allow for future
optimizer logic to consider memory usage when choosing the best Path.

* The Plan would store a copy of "nbytes," along with "work_mem," and
the executor would enforce work_mem. A "(work_mem on)" option to the
"EXPLAIN" command would display both "nbytes" and "work_mem", per Plan
node.

* Either built-in logic or an extensibility hook would set "work_mem"
on each individual Plan node, based on whatever heuristic or rule it
chooses.

Right now, my prototype sets "work_mem" inside ExecInitNode().

Thanks,
James

#13

Jeff Davis

pgsql@j-davis.com

11 months ago

In reply to: James Hunter (#12)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Tue, 2025-02-11 at 10:39 -0800, James Hunter wrote:

* The Path would store "nbytes" (= the optimizer's estimate of how
much working memory a given Path will use), to allow for future
optimizer logic to consider memory usage when choosing the best Path.

* The Plan would store a copy of "nbytes," along with "work_mem," and
the executor would enforce work_mem. A "(work_mem on)" option to the
"EXPLAIN" command would display both "nbytes" and "work_mem", per
Plan
node.

Storing work_mem in each Plan node, and using that to enforce the
memory limit (rather than using the GUC directly), seems
uncontroversial to me. I'd suggest a standalone patch.

Storing the optimizer's estimate of the memory wanted also sounds like
a good idea. Let's pick a better name than "nbytes" though; maybe
"requested_mem" or something? This change would make it a lot easier
for an extension to adjust the per-node-work_mem, and also seems like
good infrastructure for anything we build into the planner later. I
suggest a standalone patch for this, as well.

Can you write a useful extension with just the above two core patches?

Regards,
Jeff Davis

#14

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: Jeff Davis (#13)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Tue, Feb 11, 2025 at 2:04 PM Jeff Davis <pgsql@j-davis.com> wrote:

...

Storing work_mem in each Plan node, and using that to enforce the
memory limit (rather than using the GUC directly), seems
uncontroversial to me. I'd suggest a standalone patch.

I will submit a patch for this, thanks. (This will be "Patch 3".)

Storing the optimizer's estimate of the memory wanted also sounds like
a good idea. Let's pick a better name than "nbytes" though; maybe
"requested_mem" or something? This change would make it a lot easier
for an extension to adjust the per-node-work_mem, and also seems like
good infrastructure for anything we build into the planner later. I
suggest a standalone patch for this, as well.

I will submit a patch for this as well. (This will be "Patch 1".) I
went with "workmem" instead of "nbytes," for the estimate; and
"workmem_limit" for the limit. By omitting the underscore character
between "work" and "mem", this makes it a bit easier to distinguish
between the "work_mem" GUC, the "workmem" estimate, and the
"workmem_limit" limit.

Can you write a useful extension with just the above two core patches?

I think so; I will attach a patch for that as well.. (This will be
"Patch 4"; note that "Patch 2" is a prerequisite for "Patch 3".)

Regards,
Jeff Davis

Thanks,
James Hunter

#15

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: James Hunter (#14)

4 attachment(s)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

Attached please find the patch set I mentioned, above, in [1]/messages/by-id/CAJVSvF5kMi1-fwBDSv-9bvUjm83zUNEnL95B0s+i08sKDL5-mA@mail.gmail.com. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

* Patch 1 captures the optimizer’s estimate of how much working memory
a particular Plan node would need, to avoid spilling, and stores this
on the Plan, next to cost, etc. It also adds a new “work_mem on”
option to the EXPLAIN command, to display this working-memory
estimate. This “work_mem on” estimate gives the customer a sense of
how much working memory a particular query will actually use, and also
enables an extension (e.g., Patch 4), to assign working-memory limits,
per exec node, intelligently.

Patch 1 doesn't change any user-visible behavior, except for
displaying workmem estimates via EXPLAIN, when the new "work_mem on"
option is specified.

* Patch 2 is a prerequisite for Patches 3 and 4. It maintains a
subPlan list on the Plan node, next to the existing initPlan list, to
store (pointers to) regular SubPlans.

The existing initPlan list is needed, because otherwise there’s no way
to find the particular SubPlan; but this new subPlan list hasn’t been
needed before now, because every SubPlan on the list appears somewhere
inside the Plan node’s expressions. The subPlan list is needed now,
however, because a SubPlan can use working memory (if it maintains one
or two hash tables). So, we need a way to find this SubPlan, so we can
set its working-memory limit; and it doesn’t make sense to walk
through all the Plan node’s expressions, a second time, after we’ve
finalized the plan.

Instead, Patch 2 copies regular SubPlans to this new list, inside
setrefs.c, so we can find them and assign working memory to them,
later.
Patch 3 modifies all existing exec nodes to read their working-memory
limit off their Plan, rather than off the GUC. It adds a new function,
ExecAssignWorkMem(), which gets called from InitPlan(), immediately
before we start calling ExecInitNode(). This way, the executor could
assign different working-memory limits, based on global memory
conditions; but this patch just preserves existing behavior, and
copies these limits from the GUCs.

Patch 2 doesn't change any user-visible behavior -- it just adds some
internal bookkeeping.

* Patch 3 extends the new “work_mem on” EXPLAIN option, further, to
show the working-memory limit. This is the limit already imposed by
PostgreSQL's work_mem and hash_mem_multiplier GUCs. Patch 3 copies
this limit from these GUCs, onto a new field stored on the Plan
object. It then modifies "EXPLAIN (work_mem on)" to read this limit
off the Plan object and display it.

Other than this change to EXPLAIN, Patch 3 doesn't change any
user-visible behavior.

* Patch 4, finally!, adds a hook to allow extensions to override
ExecAssignWorkMem(). It also adds an extension, “workmem,” that
implements this hook and assigns working memory to individual
execution nodes, based on a new workmem.query_work_mem GUC. This
extension prevents queries from exceeding workmem.query_work_mem,
while also handing out extra memory in the case where the query limit,
from Patch 3, is < workmem.query_work_mem.

In this way, Patch 4 avoids either undersubscribing or oversubscribing
working memory for queries, which is the goal of my proposal.

Discussion follows--

A few operators currently do not honor their working-memory limits by
spilling; these operators use tuple hash tables — which don’t spill —
without implementing their own “spill” logic. I would address these
operators in a subsequent release. Note that Hash Agg and Hash Join
both spill, as expected, so the major cases already work.

I store the working-memory estimate on both Path and Plan objects.
Keeping with PostgreSQL convention that a Path is an abstraction of
one or more Plan nodes, the Path’s working-memory estimate is “total,”
while the Plan’s is “per data structure.” So, if a SQL operator
requires 2 sort buffers, the Path’s working-memory estimate will be 2x
the Plan’s.

The Plan’s estimate is “per data structure,” because it will be used
to determine the data structure’s working-memory limit. Note that
every operator (save one) currently treats work_mem [*
hash_mem_multiplier] as a per-structure limit, rather than a
per-operator limit. (The exception is Hash Agg, where all of the
node’s hash tables share the same memory limit; and so I have
preserved this behavior in the Hash Agg’s workmem and workmem_limit
fields.)

The Plan’s workmem estimate logically belongs on the Plan object (just
as the Path’s workmem logically belongs on the Path), while the
workmem_limit logically belongs on the PlanState. This is why
workmem_limit is set inside InitPlan() — it’s an execution-time limit,
not a plan-time limit.

However, the workmem_limit is stored, physically, on the Plan object,
not the PlanState. This is to avoid a chicken-and-egg problem: (a) The
PlanState is not created until ExecInitNode(); but, (b) ExecInitNode()
also typically creates the node’s data structures, sized to
workmem_limit.

So we need a way to set workmem_limit after the Plan has been
finalized, but before any exec nodes are initialized. Accordingly, we
set this field on the Plan object, with the understanding that it
doesn’t “really” belong there.

A nice consequence of storing workmem_limit on the Plan object, rather
than the PlanState, is that the limit automatically gets
serialized+deserialized to parallel workers. This simplifies Patch 3 a
little bit, since we can avoid executing ExecWorkMem() on parallel
workers; but it really benefits Patch 4, because it allows the
ExecAssignWorkMem_hook to set a memory limit on the query, regardless
of the number of parallel workers that get spawned at runtime.

Notes about individual SQL operators follow--

Patch 1 reuses existing optimizer logic, as much as possible, to
calculate “workmem” — rounded up to the nearest KB, and with a minimum
of 64 KB. (The 64 KB minimum is because that’s the smallest a customer
can set the work_mem GUC, so it seems possible that some SQL operators
rely on the assumption that they’ll always get >= 64 KB of working
memory.)

The PostgreSQL operators that use working memory can be placed into
three categories:

1. Operators that use working memory, and which also cost the
possibility of spilling. For these operators, Patch 1 just reports the
“nbytes” estimate that the optimizer already produces.
1a. Sort and IncrementalSort (via cost_tuplesort()).
1b. HashJoin (via ExecChooseHashTableSize()).
1c. Material (via cost_material()).
1d. Unique (via either cost_sort() or cost_agg()).
1e. Grouping Sets (via create_groupingsets_path(), which calls
cost_sort() and cost_agg()).

NOTE: Grouping Sets can end up creating a chain of Agg plan nodes,
each of which gets its own working-memory budget. Discussed below.

1f. Agg (via cost_agg()).

NOTE: Discussed below.

1g. SetOp (via create_setop_path()).

NOTE: A SetOp never spills; however, existing logic disables the SetOp
“if it doesn't look like the hashtable will fit into hash_mem.” It
assumes the hash entry size is: MAXALIGN(leftpath->pathtarget->width)
+ MAXALIGN(SizeofMinimalTupleHeader).

2. Operators that use working memory, but which do not currently cost
the possibility of spilling, because the existing estimate is assumed
to be unreliable. For these operators, Patch 1 just reports an
“unreliable” estimate.
2a. FunctionScan .
2b. TableFuncScan .

3. Remaining operators that use working memory, but for whatever
reason do not currently cost the possibility of spilling. For these
operators, Patch 1 just computes and reports an estimate, based on
logic appearing elsewhere in the code.

3a. RecursiveUnion. (Uses two Tuplestores, and possibly a
TupleHashTable.) Patch 1 uses nrterm to estimate one of the
Tuplestores; rterm to estimate the second Tuplestore; and (if
relevant) numGroups to estimate # of hash buckets.
3b. CteScan (but *not* WorkTableScan). Relies on cost_ctescan().)
Patch 1 just uses rows * width, since the output is materialized into
a Tuplestore.
3c. Memoize . Patch 1 uses ndistinct to estimate # of hash buckets.
3d. WindowAgg . Patch 1 uses startup_tuples to estimate # of tuples
materialized in the Tuplestore.
3e. BitmapHeapScan. Although the TID bitmaps created by the
bitmapqual’s BitmapIndexScan nodes are limited to work_mem, these
bitmaps lossify rather than spill. Patch 1 applies the inverse of
tbm_calculate_entries() to the expected number of heap rows, produced
by the optimizer.
3f. SubPlan, if it requires a hash table (and possibly a hash-NULL
table). Patch 1 uses rows and rows / 16, respectively, copying the
existing logic in nodeSubplan.c and subselect.c.

NOTE: Since we don’t display SubPlans directly, in EXPLAIN, Patch 1
includes this working-memory estimate along with the SubPlan’s parent
Plan node.

Final comments --

I think the attached patch-set is useful, by itself; but it also
serves as a necessary building block for future work to manage query
working-memory dynamically. For example, the optimizer could be
enhanced to trade off between a high-memory + low cost plan, and a
low-memory + high cost plan. The execution-time extension could be
enhanced to adjust its query-working-memory limit based on current,
global memory usage.

And individual exec nodes could be enhanced to request additional
working-memory, via hook, if they discover they need to spill
unexpectedly. (For example, this would be useful for serial Hash
Joins.)

Question / comments / suggestions / issues / complaints?
Thanks,
James Hunter

[1]: /messages/by-id/CAJVSvF5kMi1-fwBDSv-9bvUjm83zUNEnL95B0s+i08sKDL5-mA@mail.gmail.com

Attachments:

v01_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patchapplication/octet-stream; name=v01_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patchDownload

From 099366618d3f15f69bd9542d7d31f82148889a11 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 24 Jan 2025 20:48:39 +0000
Subject: [PATCH 1/4] EXPLAIN now takes "work_mem" option, to display estimated
 working memory

This commit adds option "WORK_MEM" to the existing EXPLAIN command. When
set to ON, the EXPLAIN output will include text of the form "(work_mem=
5.67 kB)" on every plan node that uses working memory.

The output is an *estimate*, typically based on the estimated number of
input rows for that plan node.

Normalize "working-memory" estimates to a minimum of 64 KB

The minimum possible value of the "work_mem" GUC is 64 KB. This commit
changes the tracking + output for "EXPLAIN (WORK_MEM ON)" so that it
reports a minimum of 64 KB for every node or subcomponent that requires
working memory.

It also rounds "nbytes" up to the nearest whole KB (= ceil()), and
changes the EXPLAIN output to report a whole integer, rather than to
two decimal places. Note that 1 KB = 1.6 percent of the 64 KB
minimum.

To allow for future optimizers to make decisions at Path time, this commit
aggregates the Path's total working memory onto the Path's "workmem" field.
To allow the executor to restrict memory usage by individual data
structure, it then breaks that total working memory into per-data structure
working memory, on the Plan.

Also adds a "Total Working Memory" line at the bottom of the
plan output.
---
 src/backend/commands/explain.c          | 207 ++++++++
 src/backend/executor/nodeHash.c         |  15 +-
 src/backend/nodes/tidbitmap.c           |  18 +
 src/backend/optimizer/path/costsize.c   | 387 ++++++++++++++-
 src/backend/optimizer/plan/createplan.c | 215 +++++++-
 src/backend/optimizer/prep/prepagg.c    |  12 +
 src/backend/optimizer/util/pathnode.c   |  53 +-
 src/include/commands/explain.h          |   3 +
 src/include/executor/nodeHash.h         |   3 +-
 src/include/nodes/pathnodes.h           |  11 +
 src/include/nodes/plannodes.h           |  11 +
 src/include/nodes/primnodes.h           |   2 +
 src/include/nodes/tidbitmap.h           |   1 +
 src/include/optimizer/cost.h            |  12 +-
 src/include/optimizer/planmain.h        |   2 +-
 src/test/regress/expected/workmem.out   | 631 ++++++++++++++++++++++++
 src/test/regress/parallel_schedule      |   2 +-
 src/test/regress/sql/workmem.sql        | 303 ++++++++++++
 18 files changed, 1828 insertions(+), 60 deletions(-)
 create mode 100644 src/test/regress/expected/workmem.out
 create mode 100644 src/test/regress/sql/workmem.sql

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c0d614866a9..e09d7f868c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -180,6 +180,8 @@ static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 static SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+static void compute_subplan_workmem(List *plans, double *workmem);
+static void compute_agg_workmem(Agg *agg, double *workmem);
 
 
 
@@ -235,6 +237,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		}
 		else if (strcmp(opt->defname, "memory") == 0)
 			es->memory = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "work_mem") == 0)
+			es->work_mem = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "serialize") == 0)
 		{
 			if (opt->arg)
@@ -835,6 +839,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 		ExplainPropertyFloat("Execution Time", "ms", 1000.0 * totaltime, 3,
 							 es);
 
+	if (es->work_mem)
+	{
+		ExplainPropertyFloat("Total Working Memory", "kB",
+							 es->total_workmem, 0, es);
+	}
+
 	ExplainCloseGroup("Query", NULL, true, es);
 }
 
@@ -1970,6 +1980,77 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
+	if (es->work_mem)
+	{
+		double		plan_workmem = 0.0;
+
+		/*
+		 * Include working memory used by this Plan's SubPlan objects, whether
+		 * they are included on the Plan's initPlan or subPlan lists.
+		 */
+		compute_subplan_workmem(planstate->initPlan, &plan_workmem);
+		compute_subplan_workmem(planstate->subPlan, &plan_workmem);
+
+		/* Include working memory used by this Plan, itself. */
+		switch (nodeTag(plan))
+		{
+			case T_Agg:
+				compute_agg_workmem((Agg *) plan, &plan_workmem);
+				break;
+			case T_FunctionScan:
+				{
+					FunctionScan *fscan = (FunctionScan *) plan;
+
+					plan_workmem += (double) plan->workmem *
+						list_length(fscan->functions);
+					break;
+				}
+			case T_IncrementalSort:
+
+				/*
+				 * IncrementalSort creates two Tuplestores, each of
+				 * (estimated) size workmem.
+				 */
+				plan_workmem = (double) plan->workmem * 2;
+				break;
+			case T_RecursiveUnion:
+				{
+					RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+					/*
+					 * RecursiveUnion creates two Tuplestores, each of
+					 * (estimated) size workmem, plus (possibly) a hash table
+					 * of size hashWorkMem.
+					 */
+					plan_workmem += (double) plan->workmem * 2 +
+						runion->hashWorkMem;
+					break;
+				}
+			default:
+				if (plan->workmem > 0)
+					plan_workmem += plan->workmem;
+				break;
+		}
+
+		/*
+		 * Every parallel worker (plus the leader) gets its own copy of
+		 * working memory.
+		 */
+		plan_workmem *= (1 + es->num_workers);
+
+		es->total_workmem += plan_workmem;
+
+		if (plan_workmem > 0.0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "  (work_mem=%.0f kB)",
+								 plan_workmem);
+			else
+				ExplainPropertyFloat("Working Memory", "kB",
+									 plan_workmem, 0, es);
+		}
+	}
+
 	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
@@ -2536,6 +2617,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (planstate->initPlan)
 		ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/*
+		 * Other than initPlan-s, every node below us gets the # of planned
+		 * workers we specified.
+		 */
+		Assert(es->num_workers == 0);
+
+		if (nodeTag(plan) == T_Gather)
+			es->num_workers = ((Gather *) plan)->num_workers;
+		else
+			es->num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
 	/* lefttree */
 	if (outerPlanState(planstate))
 		ExplainNode(outerPlanState(planstate), ancestors,
@@ -2592,6 +2687,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainCloseGroup("Plans", "Plans", false, es);
 	}
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/* End of parallel sub-tree. */
+		es->num_workers = 0;
+	}
+
 	/* in text format, undo whatever indentation we added */
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent = save_indent;
@@ -5952,3 +6053,109 @@ GetSerializationMetrics(DestReceiver *dest)
 
 	return empty;
 }
+
+/*
+ * compute_subplan_work_mem - compute total workmem for a SubPlan object
+ *
+ * If a SubPlan object uses a hash table, then that hash table needs working
+ * memory. We display that working memory on the owning Plan. This function
+ * increments work_mem counters to include the SubPlan's working-memory.
+ */
+static void
+compute_subplan_workmem(List *plans, double *workmem)
+{
+	foreach_node(SubPlanState, sps, plans)
+	{
+		SubPlan    *sp = sps->subplan;
+
+		if (sp->hashtab_workmem > 0)
+			*workmem += sp->hashtab_workmem;
+
+		if (sp->hashnul_workmem > 0)
+			*workmem += sp->hashnul_workmem;
+	}
+}
+
+/* Compute an Agg's working memory estimate. */
+typedef struct AggWorkMem
+{
+	double		input_sort_workmem;
+
+	double		output_hash_workmem;
+
+	int			num_sort_nodes;
+	double		max_output_sort_workmem;
+}			AggWorkMem;
+
+static void
+compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
+{
+	/* Record memory used for input sort buffers. */
+	mem->input_sort_workmem += (double) agg->numSorts * agg->sortWorkMem;
+
+	/* Record memory used for output data structures. */
+	switch (agg->aggstrategy)
+	{
+		case AGG_SORTED:
+
+			/* We'll have at most two sort buffers alive, at any time. */
+			mem->max_output_sort_workmem =
+				Max(mem->max_output_sort_workmem, agg->plan.workmem);
+
+			++mem->num_sort_nodes;
+			break;
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * All hash tables created by "hash" phases are kept for the
+			 * lifetime of the Agg.
+			 */
+			mem->output_hash_workmem += agg->plan.workmem;
+			break;
+		default:
+
+			/*
+			 * "Plain" phases don't use working memory (they output a single
+			 * aggregated tuple).
+			 */
+			break;
+	}
+}
+
+/*
+ * compute_agg_workmem - compute total workmem for an Agg node
+ *
+ * An Agg node might point to a chain of additional Agg nodes. When we explain
+ * the plan, we display only the first, "main" Agg node. However, to make life
+ * easier for the executor, we stored the estimated working memory ("workmem")
+ * on each individual Agg node.
+ *
+ * This function returns the combined workmem, so that we can display this
+ * value on the main Agg node.
+ */
+static void
+compute_agg_workmem(Agg *agg, double *workmem)
+{
+	AggWorkMem	mem;
+	ListCell   *lc;
+
+	memset(&mem, 0, sizeof(mem));
+
+	compute_agg_workmem_node(agg, &mem);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach(lc, agg->chain)
+	{
+		Agg		   *aggnode = (Agg *) lfirst(lc);
+
+		compute_agg_workmem_node(aggnode, &mem);
+	}
+
+	*workmem = mem.input_sort_workmem + mem.output_hash_workmem;
+
+	/* We'll have at most two sort buffers alive, at any time. */
+	*workmem += mem.num_sort_nodes > 2 ?
+		mem.max_output_sort_workmem * 2.0 :
+		mem.max_output_sort_workmem;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..d54cfe5fdbe 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
@@ -452,6 +453,7 @@ ExecHashTableCreate(HashState *state)
 	int			nbuckets;
 	int			nbatch;
 	double		rows;
+	int			workmem;		/* ignored */
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	MemoryContext oldcxt;
@@ -477,7 +479,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
 							&space_allowed,
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
 	/* nbuckets must be a power of 2 */
 	log2_nbuckets = my_log2(nbuckets);
@@ -661,7 +663,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						size_t *space_allowed,
 						int *numbuckets,
 						int *numbatches,
-						int *num_skew_mcvs)
+						int *num_skew_mcvs,
+						int *workmem)
 {
 	int			tupsize;
 	double		inner_rel_bytes;
@@ -792,6 +795,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * the required bucket headers, we will need multiple batches.
 	 */
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	*workmem = normalize_workmem(inner_rel_bytes + bucket_bytes);
+
 	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
@@ -811,7 +817,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									space_allowed,
 									numbuckets,
 									numbatches,
-									num_skew_mcvs);
+									num_skew_mcvs,
+									workmem);
 			return;
 		}
 
@@ -929,7 +936,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
 	Assert(nbuckets > 0);
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 66b3c387d53..43df31cdb21 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -1558,6 +1558,24 @@ tbm_calculate_entries(Size maxbytes)
 	return (int) nbuckets;
 }
 
+/*
+ * tbm_calculate_bytes
+ *
+ * Estimate number of bytes needed to store maxentries hashtable entries.
+ *
+ * This function is the inverse of tbm_calculate_entries(), and is used to
+ * estimate a work_mem limit, based on cardinality.
+ */
+double
+tbm_calculate_bytes(double maxentries)
+{
+	maxentries = Min(maxentries, INT_MAX - 1);	/* safety limit */
+	maxentries = Max(maxentries, 16);	/* sanity limit */
+
+	return maxentries * (sizeof(PagetableEntry) + sizeof(Pointer) +
+						 sizeof(Pointer));
+}
+
 /*
  * Create a shared or private bitmap iterator and start iteration.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 73d78617009..7c1fdde842b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -104,6 +104,7 @@
 #include "optimizer/plancat.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -200,9 +201,14 @@ static Cost append_nonpartial_cost(List *subpaths, int numpaths,
 								   int parallel_workers);
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static int32 get_expr_width(PlannerInfo *root, const Node *expr);
-static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static void compute_sort_output_sizes(double input_tuples, int input_width,
+									  double limit_tuples,
+									  double *output_tuples,
+									  double *output_bytes);
+static double compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+									 Cardinality max_ancestor_rows);
 
 
 /*
@@ -1112,6 +1118,18 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->disabled_nodes = enable_bitmapscan ? 0 : 1;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+
+	/*
+	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
+	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
+	 *
+	 * (When we convert this path into a BitmapHeapScan plan, we'll break this
+	 * overall estimate down into per-node estimates, just as we do for
+	 * AggPaths.)
+	 */
+	path->workmem = compute_bitmap_workmem(baserel, bitmapqual,
+										   0.0 /* max_ancestor_rows */ );
 }
 
 /*
@@ -1587,6 +1605,16 @@ cost_functionscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem = list_length(rte->functions) *
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1644,6 +1672,16 @@ cost_tablefuncscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1740,6 +1778,9 @@ cost_ctescan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1823,7 +1864,7 @@ cost_resultscan(Path *path, PlannerInfo *root,
  * We are given Paths for the nonrecursive and recursive terms.
  */
 void
-cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
+cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -1850,12 +1891,37 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	 */
 	total_cost += cpu_tuple_cost * total_rows;
 
-	runion->disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
-	runion->startup_cost = startup_cost;
-	runion->total_cost = total_cost;
-	runion->rows = total_rows;
-	runion->pathtarget->width = Max(nrterm->pathtarget->width,
-									rterm->pathtarget->width);
+	runion->path.disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
+	runion->path.startup_cost = startup_cost;
+	runion->path.total_cost = total_cost;
+	runion->path.rows = total_rows;
+	runion->path.pathtarget->width = Max(nrterm->pathtarget->width,
+										 rterm->pathtarget->width);
+
+	/*
+	 * Include memory for working and intermediate tables. Since we'll
+	 * repeatedly swap the two tables, use 2x whichever is larger as our
+	 * estimate.
+	 */
+	runion->path.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(nrterm->rows,
+												 nrterm->pathtarget->width),
+							  relation_byte_size(rterm->rows,
+												 rterm->pathtarget->width))
+						  * 2);
+
+	if (list_length(runion->distinctList) > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		hashentrysize;
+
+		hashentrysize = MAXALIGN(runion->path.pathtarget->width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		runion->path.workmem +=
+			normalize_workmem(runion->numGroups * hashentrysize);
+	}
 }
 
 /*
@@ -1895,7 +1961,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
  */
 static void
-cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+cost_tuplesort(Cost *startup_cost, Cost *run_cost, Cost *nbytes,
 			   double tuples, int width,
 			   Cost comparison_cost, int sort_mem,
 			   double limit_tuples)
@@ -1915,17 +1981,8 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	/* Include the default cost-per-comparison */
 	comparison_cost += 2.0 * cpu_operator_cost;
 
-	/* Do we have a useful LIMIT? */
-	if (limit_tuples > 0 && limit_tuples < tuples)
-	{
-		output_tuples = limit_tuples;
-		output_bytes = relation_byte_size(output_tuples, width);
-	}
-	else
-	{
-		output_tuples = tuples;
-		output_bytes = input_bytes;
-	}
+	compute_sort_output_sizes(tuples, width, limit_tuples,
+							  &output_tuples, &output_bytes);
 
 	if (output_bytes > sort_mem_bytes)
 	{
@@ -1982,6 +2039,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	 * counting the LIMIT otherwise.
 	 */
 	*run_cost = cpu_operator_cost * tuples;
+	*nbytes = output_bytes;
 }
 
 /*
@@ -2011,6 +2069,7 @@ cost_incremental_sort(Path *path,
 				input_groups;
 	Cost		group_startup_cost,
 				group_run_cost,
+				group_nbytes,
 				group_input_run_cost;
 	List	   *presortedExprs = NIL;
 	ListCell   *l;
@@ -2085,7 +2144,7 @@ cost_incremental_sort(Path *path,
 	 * Estimate the average cost of sorting of one group where presorted keys
 	 * are equal.
 	 */
-	cost_tuplesort(&group_startup_cost, &group_run_cost,
+	cost_tuplesort(&group_startup_cost, &group_run_cost, &group_nbytes,
 				   group_tuples, width, comparison_cost, sort_mem,
 				   limit_tuples);
 
@@ -2126,6 +2185,14 @@ cost_incremental_sort(Path *path,
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Incremental sort switches between two Tuplesortstates: one that sorts
+	 * all columns ("full"), and that sorts only suffix columns ("prefix").
+	 * We'll assume they're both around the same size: large enough to hold
+	 * one sort group.
+	 */
+	path->workmem = normalize_workmem(group_nbytes * 2.0);
 }
 
 /*
@@ -2150,8 +2217,9 @@ cost_sort(Path *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	Cost		nbytes;
 
-	cost_tuplesort(&startup_cost, &run_cost,
+	cost_tuplesort(&startup_cost, &run_cost, &nbytes,
 				   tuples, width,
 				   comparison_cost, sort_mem,
 				   limit_tuples);
@@ -2162,6 +2230,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2522,6 +2591,7 @@ cost_material(Path *path,
 	path->disabled_nodes = input_disabled_nodes + (enable_material ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2592,6 +2662,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	if ((estinfo.flags & SELFLAG_USED_DEFAULT) != 0)
 		ndistinct = calls;
 
+	/* How much working memory would we need, to store every distinct tuple? */
+	mpath->path.workmem = normalize_workmem(ndistinct * est_entry_bytes);
+
 	/*
 	 * Since we've already estimated the maximum number of entries we can
 	 * store at once and know the estimated number of distinct values we'll be
@@ -2866,6 +2939,19 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->disabled_nodes = disabled_nodes;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/* Include memory needed to produce output. */
+	path->workmem =
+		compute_agg_output_workmem(root, aggstrategy, numGroups,
+								   aggcosts->transitionSpace, input_tuples,
+								   input_width, false /* cost_sort */ );
+
+	/* Also include memory needed to sort inputs (if needed): */
+	if (aggcosts->numSorts > 0)
+	{
+		path->workmem += (double) aggcosts->numSorts *
+			compute_agg_input_workmem(input_tuples, input_width);
+	}
 }
 
 /*
@@ -3100,7 +3186,7 @@ cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, WindowClause *winclause,
 			   int input_disabled_nodes,
 			   Cost input_startup_cost, Cost input_total_cost,
-			   double input_tuples)
+			   double input_tuples, int width)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -3182,6 +3268,11 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	if (startup_tuples > 1.0)
 		path->startup_cost += (total_cost - startup_cost) / input_tuples *
 			(startup_tuples - 1.0);
+
+
+	/* We need to store a window of size "startup_tuples", in a Tuplestore. */
+	path->workmem =
+		normalize_workmem(relation_byte_size(startup_tuples, width));
 }
 
 /*
@@ -3336,6 +3427,7 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost;
 	/* Save private data for final_cost_nestloop */
 	workspace->run_cost = run_cost;
+	workspace->workmem = 0;
 }
 
 /*
@@ -3799,6 +3891,14 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost + inner_run_cost;
 	/* Save private data for final_cost_mergejoin */
 	workspace->run_cost = run_cost;
+
+	/*
+	 * By itself, Merge Join requires no working memory. If it adds one or
+	 * more Sort or Material nodes, we'll track their working memory when we
+	 * create them, inside createplan.c.
+	 */
+	workspace->workmem = 0;
+
 	workspace->inner_run_cost = inner_run_cost;
 	workspace->outer_rows = outer_rows;
 	workspace->inner_rows = inner_rows;
@@ -4170,6 +4270,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	double		outer_path_rows = outer_path->rows;
 	double		inner_path_rows = inner_path->rows;
 	double		inner_path_rows_total = inner_path_rows;
+	int			workmem;
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
@@ -4227,7 +4328,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
-							&num_skew_mcvs);
+							&num_skew_mcvs,
+							&workmem);
 
 	/*
 	 * If inner relation is too big then we will need to "batch" the join,
@@ -4258,6 +4360,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->numbuckets = numbuckets;
 	workspace->numbatches = numbatches;
 	workspace->inner_rows_total = inner_path_rows_total;
+	workspace->workmem = workmem;
 }
 
 /*
@@ -4266,8 +4369,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
  *
  * Note: the numbatches estimate is also saved into 'path' for use later
  *
- * 'path' is already filled in except for the rows and cost fields and
- *		num_batches
+ * 'path' is already filled in except for the rows and cost fields,
+ *		num_batches, and workmem
  * 'workspace' is the result from initial_cost_hashjoin
  * 'extra' contains miscellaneous information about the join
  */
@@ -4284,6 +4387,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	List	   *hashclauses = path->path_hashclauses;
 	Cost		startup_cost = workspace->startup_cost;
 	Cost		run_cost = workspace->run_cost;
+	int			workmem = workspace->workmem;
 	int			numbuckets = workspace->numbuckets;
 	int			numbatches = workspace->numbatches;
 	Cost		cpu_per_tuple;
@@ -4510,6 +4614,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	path->jpath.path.startup_cost = startup_cost;
 	path->jpath.path.total_cost = startup_cost + run_cost;
+	path->jpath.path.workmem = workmem;
 }
 
 
@@ -4532,6 +4637,9 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 	if (subplan->useHashTable)
 	{
+		long		nbuckets;
+		Size		hashentrysize;
+
 		/*
 		 * If we are using a hash table for the subquery outputs, then the
 		 * cost of evaluating the query is a one-time cost.  We charge one
@@ -4541,6 +4649,37 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		sp_cost.startup += plan->total_cost +
 			cpu_operator_cost * plan->plan_rows;
 
+		/*
+		 * Estimate working memory needed for the hashtable (and hashnulls, if
+		 * needed). The logic below MUST match the logic in buildSubPlanHash()
+		 * and ExecInitSubPlan().
+		 */
+		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
+		if (nbuckets < 1)
+			nbuckets = 1;
+
+		hashentrysize = MAXALIGN(plan->plan_width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		subplan->hashtab_workmem =
+			normalize_workmem((double) nbuckets * hashentrysize);
+
+		if (!subplan->unknownEqFalse)
+		{
+			/* Also needs a hashnulls table.  */
+			if (IsA(subplan->testexpr, OpExpr))
+				nbuckets = 1;	/* there can be only one entry */
+			else
+			{
+				nbuckets /= 16;
+				if (nbuckets < 1)
+					nbuckets = 1;
+			}
+
+			subplan->hashnul_workmem =
+				normalize_workmem((double) nbuckets * hashentrysize);
+		}
+
 		/*
 		 * The per-tuple costs include the cost of evaluating the lefthand
 		 * expressions, plus the cost of probing the hashtable.  We already
@@ -6424,7 +6563,7 @@ get_expr_width(PlannerInfo *root, const Node *expr)
  *	  Estimate the storage space in bytes for a given number of tuples
  *	  of a given width (size in bytes).
  */
-static double
+double
 relation_byte_size(double tuples, int width)
 {
 	return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
@@ -6603,3 +6742,197 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * compute_sort_output_sizes
+ *	  Estimate amount of memory and rows needed to hold a Sort operator's output
+ */
+static void
+compute_sort_output_sizes(double input_tuples, int input_width,
+						  double limit_tuples,
+						  double *output_tuples, double *output_bytes)
+{
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Do we have a useful LIMIT? */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+		*output_tuples = limit_tuples;
+	else
+		*output_tuples = input_tuples;
+
+	*output_bytes = relation_byte_size(*output_tuples, input_width);
+}
+
+/*
+ * compute_agg_input_workmem
+ *	  Estimate memory (in KB) needed to hold a sort buffer for aggregate's input
+ *
+ * Some aggregates involve DISTINCT or ORDER BY, so they need to sort their
+ * input, before they can process it. We need one sort buffer per such
+ * aggregate, and this function returns that sort buffer's (estimated) size (in
+ * KB).
+ */
+int
+compute_agg_input_workmem(double input_tuples, double input_width)
+{
+	/* Account for size of one buffer needed to sort the input. */
+	return normalize_workmem(input_tuples * input_width);
+}
+
+/*
+ * compute_agg_output_workmem
+ *	  Estimate amount of memory needed (in KB) to hold an aggregate's output
+ *
+ * In a Hash aggregate, we need space for the hash table that holds the
+ * aggregated data.
+ *
+ * Sort aggregates require output space only if they are part of a Grouping
+ * Sets chain: the first aggregate writes to its "sort_out" buffer, which the
+ * second aggregate uses as its "sort_in" buffer, and sorts.
+ *
+ * In the latter case, the "Path" code already costs the sort by calling
+ * cost_sort(), so it passes "cost_sort = false" to this function, to avoid
+ * double-counting.
+ */
+int
+compute_agg_output_workmem(PlannerInfo *root, AggStrategy aggstrategy,
+						   double numGroups, uint64 transitionSpace,
+						   double input_tuples, double input_width,
+						   bool cost_sort)
+{
+	/* Account for size of hash table to hold the output. */
+	if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+	{
+		double		hashentrysize;
+
+		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
+											input_width, transitionSpace);
+		return normalize_workmem(numGroups * hashentrysize);
+	}
+
+	/* Account for the size of the "sort_out" buffer. */
+	if (cost_sort && aggstrategy == AGG_SORTED)
+	{
+		double		output_tuples;	/* ignored */
+		double		output_bytes;
+
+		Assert(aggstrategy == AGG_SORTED);
+
+		compute_sort_output_sizes(numGroups, input_width,
+								  0.0 /* limit_tuples */ ,
+								  &output_tuples, &output_bytes);
+		return normalize_workmem(output_bytes);
+	}
+
+	return 0;
+}
+
+/*
+ * compute_bitmap_workmem
+ *	  Estimate total working memory (in KB) needed by bitmapqual
+ *
+ * Although we don't fill in the workmem_est or rows fields on the bitmapqual's
+ * paths, we fill them in on the owning BitmapHeapPath. This function estimates
+ * the total work_mem needed by all BitmapOrPaths and IndexPaths inside
+ * bitmapqual.
+ */
+static double
+compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+					   Cardinality max_ancestor_rows)
+{
+	double		workmem = 0.0;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * baserel->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
+
+	if (IsA(bitmapqual, BitmapAndPath))
+	{
+		BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, apath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, BitmapOrPath))
+	{
+		BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, opath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, IndexPath))
+	{
+		/* Working memory needed for 1 TID bitmap. */
+		workmem +=
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+	}
+
+	return workmem;
+}
+
+/*
+ * normalize_workmem
+ *	  Convert a double, "bytes" working-memory estimate to an int, "KB" value
+ *
+ * Normalizes to a minimum of 64 (KB), rounding up to the nearest whole KB.
+ */
+int
+normalize_workmem(double nbytes)
+{
+	double		workmem;
+
+	/*
+	 * We'll assign working-memory to SQL operators in 1 KB increments, so
+	 * round up to the next whole KB.
+	 */
+	workmem = ceil(nbytes / 1024.0);
+
+	/*
+	 * Although some components can probably work with < 64 KB of working
+	 * memory, PostgreSQL has imposed a hard minimum of 64 KB on the
+	 * "work_mem" GUC, for a long time; so, by now, some components probably
+	 * rely on this minimum, implicitly, and would fail if we tried to assign
+	 * them < 64 KB.
+	 *
+	 * Perhaps this minimum can be relaxed, in the future; but memory sizes
+	 * keep increasing, and right now the minimum of 64 KB = 1.6 percent of
+	 * the default "work_mem" of 4 MB.
+	 *
+	 * So, even with this (overly?) cautious normalization, with the default
+	 * GUC settings, we can still achieve a working-memory reduction of
+	 * 64-to-1.
+	 */
+	workmem = Max((double) 64, workmem);
+
+	/* And clamp to MAX_KILOBYTES. */
+	workmem = Min(workmem, (double) MAX_KILOBYTES);
+
+	return (int) workmem;
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 816a2b2a576..973b86371ef 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -130,6 +130,7 @@ static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
 											   BitmapHeapPath *best_path,
 											   List *tlist, List *scan_clauses);
 static Plan *create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+								   Cardinality max_ancestor_rows,
 								   List **qual, List **indexqual, List **indexECs);
 static void bitmap_subplan_mark_shared(Plan *plan);
 static TidScan *create_tidscan_plan(PlannerInfo *root, TidPath *best_path,
@@ -1853,6 +1854,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupCollations,
 								 NIL,
 								 NIL,
+								 0, /* numSorts */
 								 best_path->path.rows,
 								 0,
 								 subplan);
@@ -1911,6 +1913,15 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 	/* Copy cost data from Path to Plan */
 	copy_generic_path_info(plan, &best_path->path);
 
+	if (IsA(plan, Unique))
+	{
+		/*
+		 * We assigned "workmem" to the Sort subplan. Clear it from the top-
+		 * level Unique node, to avoid double-counting.
+		 */
+		plan->workmem = 0;
+	}
+
 	return plan;
 }
 
@@ -2228,6 +2239,13 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
+	/*
+	 * IncrementalSort creates two sort buffers, which the Path's "workmem"
+	 * estimate combined into a single value. Split it into two now.
+	 */
+	plan->sort.plan.workmem =
+		normalize_workmem(best_path->spath.path.workmem / 2);
+
 	return plan;
 }
 
@@ -2333,12 +2351,29 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 												subplan->targetlist),
 					NIL,
 					NIL,
+					best_path->numSorts,
 					best_path->numGroups,
 					best_path->transitionSpace,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace the overall workmem estimate with that we copied from the Path
+	 * with finer-grained estimates.
+	 */
+	plan->plan.workmem =
+		compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+								   plan->transitionSpace, subplan->plan_rows,
+								   subplan->plan_width, false /* cost_sort */ );
+
+	/* Also include estimated memory needed to sort the input: */
+	if (plan->numSorts > 0)
+	{
+		plan->sortWorkMem = compute_agg_input_workmem(subplan->plan_rows,
+													  subplan->plan_width);
+	}
+
 	return plan;
 }
 
@@ -2457,8 +2492,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			RollupData *rollup = lfirst(lc);
 			AttrNumber *new_grpColIdx;
 			Plan	   *sort_plan = NULL;
-			Plan	   *agg_plan;
+			Agg		   *agg_plan;
 			AggStrategy strat;
+			bool		cost_sort;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2480,19 +2516,20 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			else
 				strat = AGG_SORTED;
 
-			agg_plan = (Plan *) make_agg(NIL,
-										 NIL,
-										 strat,
-										 AGGSPLIT_SIMPLE,
-										 list_length((List *) linitial(rollup->gsets)),
-										 new_grpColIdx,
-										 extract_grouping_ops(rollup->groupClause),
-										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
-										 NIL,
-										 rollup->numGroups,
-										 best_path->transitionSpace,
-										 sort_plan);
+			agg_plan = make_agg(NIL,
+								NIL,
+								strat,
+								AGGSPLIT_SIMPLE,
+								list_length((List *) linitial(rollup->gsets)),
+								new_grpColIdx,
+								extract_grouping_ops(rollup->groupClause),
+								extract_grouping_collations(rollup->groupClause, subplan->targetlist),
+								rollup->gsets,
+								NIL,
+								best_path->numSorts,
+								rollup->numGroups,
+								best_path->transitionSpace,
+								sort_plan);
 
 			/*
 			 * Remove stuff we don't need to avoid bloating debug output.
@@ -2503,7 +2540,36 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan->lefttree = NULL;
 			}
 
-			chain = lappend(chain, agg_plan);
+			/*
+			 * If we're an AGG_SORTED, but not the last, we need to cost
+			 * working memory needed to produce our "sort_out" buffer.
+			 */
+			cost_sort = foreach_current_index(lc) < list_length(rollups) - 1;
+
+			/*
+			 * Although this side node doesn't need accurate cost estimates,
+			 * it does need an accurate *memory* estimate, since we'll use
+			 * that estimate to distribute working memory to this side node,
+			 * at runtime.
+			 */
+
+			/* Estimated memory needed to hold the output: */
+			agg_plan->plan.workmem =
+				compute_agg_output_workmem(root, agg_plan->aggstrategy,
+										   agg_plan->numGroups,
+										   agg_plan->transitionSpace,
+										   subplan->plan_rows,
+										   subplan->plan_width, cost_sort);
+
+			/* Also include estimated memory needed to sort the input: */
+			if (agg_plan->numSorts > 0)
+			{
+				agg_plan->sortWorkMem =
+					compute_agg_input_workmem(subplan->plan_rows,
+											  subplan->plan_width);
+			}
+
+			chain = lappend(chain, (Plan *) agg_plan);
 		}
 	}
 
@@ -2514,6 +2580,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
 		int			numGroupCols;
+		bool		cost_sort;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2529,12 +2596,37 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
 						rollup->gsets,
 						chain,
+						best_path->numSorts,
 						rollup->numGroups,
 						best_path->transitionSpace,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
+
+		/*
+		 * If we're an AGG_SORTED, but not the last, we need to cost working
+		 * memory needed to produce our "sort_out" buffer.
+		 */
+		cost_sort = list_length(rollups) > 1;
+
+		/*
+		 * Replace the overall workmem estimate that we copied from the Path
+		 * with finer-grained estimates.
+		 */
+		plan->plan.workmem =
+			compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows, subplan->plan_width,
+									   cost_sort);
+
+		/* Also include estimated memory needed to sort the input: */
+		if (plan->numSorts > 0)
+		{
+			plan->sortWorkMem =
+				compute_agg_input_workmem(subplan->plan_rows,
+										  subplan->plan_width);
+		}
 	}
 
 	return (Plan *) plan;
@@ -2783,6 +2875,38 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace our overall "workmem" estimate with estimates at finer
+	 * granularity.
+	 */
+
+	/*
+	 * Include memory for working and intermediate tables.  Since we'll
+	 * repeatedly swap the two tables, use the larger of the two as our
+	 * working- memory estimate.
+	 *
+	 * NOTE: The Path's "workmem" estimate is for the whole Path, but the
+	 * Plan's "workmem" estimates are *per data structure*. So, this value is
+	 * half of the corresponding Path's value.
+	 */
+	plan->plan.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(leftplan->plan_rows,
+												 leftplan->plan_width),
+							  relation_byte_size(rightplan->plan_rows,
+												 rightplan->plan_width)));
+
+	if (plan->numCols > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		entrysize;
+
+		entrysize = sizeof(TupleHashEntryData) + plan->plan.plan_width;
+
+		plan->hashWorkMem =
+			normalize_workmem(plan->numGroups * entrysize);
+	}
+
 	return plan;
 }
 
@@ -3223,6 +3347,7 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	/* Process the bitmapqual tree into a Plan tree and qual lists */
 	bitmapqualplan = create_bitmap_subplan(root, best_path->bitmapqual,
+										   0.0 /* max_ancestor_rows */ ,
 										   &bitmapqualorig, &indexquals,
 										   &indexECs);
 
@@ -3309,6 +3434,12 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&scan_plan->scan.plan, &best_path->path);
 
+	/*
+	 * We assigned "workmem" to the "bitmapqualplan" subplan. Clear it from
+	 * the top-level BitmapHeapScan node, to avoid double-counting.
+	 */
+	scan_plan->scan.plan.workmem = 0;
+
 	return scan_plan;
 }
 
@@ -3334,9 +3465,24 @@ create_bitmap_scan_plan(PlannerInfo *root,
  */
 static Plan *
 create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+					  Cardinality max_ancestor_rows,
 					  List **qual, List **indexqual, List **indexECs)
 {
 	Plan	   *plan;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * bitmapqual->parent->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
 
 	if (IsA(bitmapqual, BitmapAndPath))
 	{
@@ -3362,6 +3508,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3373,8 +3521,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_and(subplans);
 		plan->startup_cost = apath->path.startup_cost;
 		plan->total_cost = apath->path.total_cost;
-		plan->plan_rows =
-			clamp_row_est(apath->bitmapselectivity * apath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = apath->path.parallel_safe;
@@ -3409,6 +3556,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3437,8 +3586,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			plan = (Plan *) make_bitmap_or(subplans);
 			plan->startup_cost = opath->path.startup_cost;
 			plan->total_cost = opath->path.total_cost;
-			plan->plan_rows =
-				clamp_row_est(opath->bitmapselectivity * opath->path.parent->tuples);
+			plan->plan_rows = plan_rows;
 			plan->plan_width = 0;	/* meaningless */
 			plan->parallel_aware = false;
 			plan->parallel_safe = opath->path.parallel_safe;
@@ -3484,8 +3632,9 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
-		plan->plan_rows =
-			clamp_row_est(ipath->indexselectivity * ipath->path.parent->tuples);
+		plan->workmem =
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
@@ -3796,6 +3945,14 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	/*
+	 * Replace the path's total working-memory estimate with a per-function
+	 * estimate.
+	 */
+	scan_plan->scan.plan.workmem =
+		normalize_workmem(relation_byte_size(scan_plan->scan.plan.plan_rows,
+											 scan_plan->scan.plan.plan_width));
+
 	return scan_plan;
 }
 
@@ -4615,6 +4772,9 @@ create_mergejoin_plan(PlannerInfo *root,
 		 */
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
+		matplan->workmem =
+			normalize_workmem(relation_byte_size(matplan->plan_rows,
+												 matplan->plan_width));
 
 		inner_plan = matplan;
 	}
@@ -4961,6 +5121,10 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	/* Display "workmem" on the Hash subnode, not its parent HashJoin node. */
+	hash_plan->plan.workmem = join_plan->join.plan.workmem;
+	join_plan->join.plan.workmem = 0;
+
 	return join_plan;
 }
 
@@ -5458,6 +5622,7 @@ copy_generic_path_info(Plan *dest, Path *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = (int) Min(src->workmem, (double) MAX_KILOBYTES);
 	dest->plan_rows = src->rows;
 	dest->plan_width = src->pathtarget->width;
 	dest->parallel_aware = src->parallel_aware;
@@ -5474,6 +5639,7 @@ copy_plan_costsize(Plan *dest, Plan *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = src->workmem;
 	dest->plan_rows = src->plan_rows;
 	dest->plan_width = src->plan_width;
 	/* Assume the inserted node is not parallel-aware. */
@@ -5509,6 +5675,7 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 			  limit_tuples);
 	plan->plan.startup_cost = sort_path.startup_cost;
 	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.workmem = (int) Min(sort_path.workmem, (double) MAX_KILOBYTES);
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5540,6 +5707,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 						  limit_tuples);
 	plan->sort.plan.startup_cost = sort_path.startup_cost;
 	plan->sort.plan.total_cost = sort_path.total_cost;
+	plan->sort.plan.workmem = (int) Min(sort_path.workmem,
+										(double) MAX_KILOBYTES);
 	plan->sort.plan.plan_rows = lefttree->plan_rows;
 	plan->sort.plan.plan_width = lefttree->plan_width;
 	plan->sort.plan.parallel_aware = false;
@@ -6673,7 +6842,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 		 Size transitionSpace, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6689,6 +6858,8 @@ make_agg(List *tlist, List *qual,
 	node->grpColIdx = grpColIdx;
 	node->grpOperators = grpOperators;
 	node->grpCollations = grpCollations;
+	node->numSorts = numSorts;
+	node->sortWorkMem = 0;		/* caller will fill this */
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index c0a2f04a8c3..3eba364484d 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -691,5 +691,17 @@ get_agg_clause_costs(PlannerInfo *root, AggSplit aggsplit, AggClauseCosts *costs
 			costs->finalCost.startup += argcosts.startup;
 			costs->finalCost.per_tuple += argcosts.per_tuple;
 		}
+
+		/*
+		 * How many aggrefs need to sort their input? (Each such aggref gets
+		 * its own sort buffer. The logic here MUST match the corresponding
+		 * logic in function build_pertrans_for_aggref().)
+		 */
+		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind) &&
+			!aggref->aggpresorted &&
+			(aggref->aggdistinct || aggref->aggorder))
+		{
+			++costs->numSorts;
+		}
 	}
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..c533bfb9a58 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1709,6 +1709,13 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
 	pathnode->path.rows = subpath->rows;
 
+	/*
+	 * For now, set workmem at hash memory limit. Function
+	 * cost_memoize_rescan() will adjust this field, same as it does for field
+	 * "est_entries".
+	 */
+	pathnode->path.workmem = normalize_workmem(get_hash_memory_limit());
+
 	return pathnode;
 }
 
@@ -1937,12 +1944,14 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		pathnode->path.disabled_nodes = agg_path.disabled_nodes;
 		pathnode->path.startup_cost = agg_path.startup_cost;
 		pathnode->path.total_cost = agg_path.total_cost;
+		pathnode->path.workmem = agg_path.workmem;
 	}
 	else
 	{
 		pathnode->path.disabled_nodes = sort_path.disabled_nodes;
 		pathnode->path.startup_cost = sort_path.startup_cost;
 		pathnode->path.total_cost = sort_path.total_cost;
+		pathnode->path.workmem = sort_path.workmem;
 	}
 
 	rel->cheapest_unique_path = (Path *) pathnode;
@@ -2289,6 +2298,13 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	/* Cost is the same as for a regular CTE scan */
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
+	/*
+	 * But working memory used is 0, since the worktable scan doesn't create a
+	 * tuplestore -- it just reuses a tuplestore already created by a
+	 * recursive union.
+	 */
+	pathnode->workmem = 0;
+
 	return pathnode;
 }
 
@@ -3283,6 +3299,7 @@ create_agg_path(PlannerInfo *root,
 
 	pathnode->aggstrategy = aggstrategy;
 	pathnode->aggsplit = aggsplit;
+	pathnode->numSorts = aggcosts ? aggcosts->numSorts : 0;
 	pathnode->numGroups = numGroups;
 	pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
 	pathnode->groupClause = groupClause;
@@ -3333,6 +3350,8 @@ create_groupingsets_path(PlannerInfo *root,
 	ListCell   *lc;
 	bool		is_first = true;
 	bool		is_first_sort = true;
+	int			num_sort_nodes = 0;
+	double		max_sort_workmem = 0.0;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3369,6 +3388,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->numSorts = agg_costs ? agg_costs->numSorts : 0;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 	pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
@@ -3432,6 +3452,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 subpath->pathtarget->width);
 				if (!rollup->is_hashed)
 					is_first_sort = false;
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 			else
 			{
@@ -3444,6 +3466,12 @@ create_groupingsets_path(PlannerInfo *root,
 						  work_mem,
 						  -1.0);
 
+				/*
+				 * We costed sorting the previous "sort" rollup's "sort_out"
+				 * buffer. How much memory did it need?
+				 */
+				max_sort_workmem = Max(max_sort_workmem, sort_path.workmem);
+
 				/* Account for cost of aggregation */
 
 				cost_agg(&agg_path, root,
@@ -3457,12 +3485,17 @@ create_groupingsets_path(PlannerInfo *root,
 						 sort_path.total_cost,
 						 sort_path.rows,
 						 subpath->pathtarget->width);
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 
 			pathnode->path.disabled_nodes += agg_path.disabled_nodes;
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
+
+		if (!rollup->is_hashed)
+			++num_sort_nodes;
 	}
 
 	/* add tlist eval cost for each output row */
@@ -3470,6 +3503,17 @@ create_groupingsets_path(PlannerInfo *root,
 	pathnode->path.total_cost += target->cost.startup +
 		target->cost.per_tuple * pathnode->path.rows;
 
+	/*
+	 * Include working memory needed to sort agg output. If there's only 1
+	 * sort rollup, then we don't need any memory. If there are 2 sort
+	 * rollups, we need enough memory for 1 sort buffer. If there are >= 3
+	 * sort rollups, we need only 2 sort buffers, since we're
+	 * double-buffering.
+	 */
+	pathnode->path.workmem += num_sort_nodes > 2 ?
+		max_sort_workmem * 2.0 :
+		max_sort_workmem;
+
 	return pathnode;
 }
 
@@ -3619,7 +3663,8 @@ create_windowagg_path(PlannerInfo *root,
 				   subpath->disabled_nodes,
 				   subpath->startup_cost,
 				   subpath->total_cost,
-				   subpath->rows);
+				   subpath->rows,
+				   subpath->pathtarget->width);
 
 	/* add tlist eval cost for each output row */
 	pathnode->path.startup_cost += target->cost.startup;
@@ -3744,7 +3789,11 @@ create_setop_path(PlannerInfo *root,
 			MAXALIGN(SizeofMinimalTupleHeader);
 		if (hashentrysize * numGroups > get_hash_memory_limit())
 			pathnode->path.disabled_nodes++;
+
+		pathnode->path.workmem =
+			normalize_workmem(numGroups * hashentrysize);
 	}
+
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
@@ -3795,7 +3844,7 @@ create_recursiveunion_path(PlannerInfo *root,
 	pathnode->wtParam = wtParam;
 	pathnode->numGroups = numGroups;
 
-	cost_recursive_union(&pathnode->path, leftpath, rightpath);
+	cost_recursive_union(pathnode, leftpath, rightpath);
 
 	return pathnode;
 }
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 570e7cad1fa..50454952eb2 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -53,6 +53,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		memory;			/* print planner's memory usage information */
+	bool		work_mem;		/* print work_mem estimates per node */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
@@ -69,6 +70,8 @@ typedef struct ExplainState
 	bool		hide_workers;	/* set if we find an invisible Gather */
 	int			rtable_size;	/* length of rtable excluding the RTE_GROUP
 								 * entry */
+	int			num_workers;	/* # of worker processes planned to use */
+	double		total_workmem;	/* total working memory estimate (in bytes) */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..fc5b20994dd 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -62,7 +62,8 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									size_t *space_allowed,
 									int *numbuckets,
 									int *numbatches,
-									int *num_skew_mcvs);
+									int *num_skew_mcvs,
+									int *workmem);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fbf05322c75..17eb6b52579 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -60,6 +60,7 @@ typedef struct AggClauseCosts
 	QualCost	transCost;		/* total per-input-row execution costs */
 	QualCost	finalCost;		/* total per-aggregated-row costs */
 	Size		transitionSpace;	/* space for pass-by-ref transition data */
+	int			numSorts;		/* # of required input-sort buffers */
 } AggClauseCosts;
 
 /*
@@ -1697,6 +1698,13 @@ typedef struct Path
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
 	Cost		total_cost;		/* total cost (assuming all tuples fetched) */
 
+	/*
+	 * NOTE: The Path's workmem is a double, rather than an int, because it
+	 * sometimes combines multiple working-memory estimates (e.g., for
+	 * GroupingSetsPath).
+	 */
+	Cost		workmem;		/* estimated work_mem (in KB) */
+
 	/* sort ordering of path's output; a List of PathKey nodes; see above */
 	List	   *pathkeys;
 } Path;
@@ -2290,6 +2298,7 @@ typedef struct AggPath
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy, see nodes.h */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+	int			numSorts;		/* number of inputs that require sorting */
 	Cardinality numGroups;		/* estimated number of groups in input */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	List	   *groupClause;	/* a list of SortGroupClause's */
@@ -2331,6 +2340,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	int			numSorts;		/* number of inputs that require sorting */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
@@ -3374,6 +3384,7 @@ typedef struct JoinCostWorkspace
 
 	/* Fields below here should be treated as private to costsize.c */
 	Cost		run_cost;		/* non-startup cost components */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* private for cost_nestloop code */
 	Cost		inner_run_cost; /* also used by cost_mergejoin code */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index bf1f25c0dba..67da7f091b5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -168,6 +168,8 @@ typedef struct Plan
 	/* total cost (assuming all tuples fetched) */
 	Cost		total_cost;
 
+	int			workmem;		/* estimated work_mem (in KB) */
+
 	/*
 	 * planner's estimate of result size of this plan step
 	 */
@@ -426,6 +428,9 @@ typedef struct RecursiveUnion
 
 	/* estimated number of groups in input */
 	long		numGroups;
+
+	/* estimated work_mem for hash table (in KB) */
+	int			hashWorkMem;
 } RecursiveUnion;
 
 /* ----------------
@@ -1145,6 +1150,12 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* number of inputs that require sorting */
+	int			numSorts;
+
+	/* estimated work_mem needed to sort each input (in KB) */
+	int			sortWorkMem;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 839e71d52f4..b7d6b0fe7dc 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1109,6 +1109,8 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	int			hashtab_workmem;	/* estimated hashtable work_mem (in KB) */
+	int			hashnul_workmem;	/* estimated hashnulls work_mem (in KB) */
 } SubPlan;
 
 /*
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index a6ffeac90be..df8e7de9dc2 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -85,6 +85,7 @@ extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
 													dsa_pointer dp);
 extern int	tbm_calculate_entries(Size maxbytes);
+extern double tbm_calculate_bytes(double maxentries);
 
 extern TBMIterator tbm_begin_iterate(TIDBitmap *tbm,
 									 dsa_area *dsa, dsa_pointer dsp);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 3aa3c16e442..737c553a409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -106,7 +106,7 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 									 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_resultscan(Path *path, PlannerInfo *root,
 							RelOptInfo *baserel, ParamPathInfo *param_info);
-extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
+extern void cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, int disabled_nodes,
 					  Cost input_cost, double tuples, int width,
@@ -139,7 +139,7 @@ extern void cost_windowagg(Path *path, PlannerInfo *root,
 						   List *windowFuncs, WindowClause *winclause,
 						   int input_disabled_nodes,
 						   Cost input_startup_cost, Cost input_total_cost,
-						   double input_tuples);
+						   double input_tuples, int width);
 extern void cost_group(Path *path, PlannerInfo *root,
 					   int numGroupCols, double numGroups,
 					   List *quals,
@@ -217,9 +217,17 @@ extern void set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *re
 extern void set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double relation_byte_size(double tuples, int width);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 								   Path *bitmapqual, double loop_count,
 								   Cost *cost_p, double *tuples_p);
 extern double compute_gather_rows(Path *path);
+extern int	compute_agg_input_workmem(double input_tuples, double input_width);
+extern int	compute_agg_output_workmem(PlannerInfo *root,
+									   AggStrategy aggstrategy,
+									   double numGroups, uint64 transitionSpace,
+									   double input_tuples, double input_width,
+									   bool cost_sort);
+extern int	normalize_workmem(double nbytes);
 
 #endif							/* COST_H */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5a930199611..cf3694a744f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 					 Size transitionSpace, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
 						 LimitOption limitOption, int uniqNumCols,
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
new file mode 100644
index 00000000000..215180808f4
--- /dev/null
+++ b/src/test/regress/expected/workmem.out
@@ -0,0 +1,631 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Unique -> hash agg
+set enable_hashagg = on;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                         workmem_filter                          
+-----------------------------------------------------------------
+ Sort  (work_mem=N kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  HashAggregate  (work_mem=N kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+(10 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Unique -> sort
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ Sort  (work_mem=N kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  Unique
+               ->  Sort  (work_mem=N kB)
+                     Sort Key: "*VALUES*".column1, "*VALUES*".column2
+                     ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+(11 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+             workmem_filter              
+-----------------------------------------
+ Limit
+   ->  Incremental Sort  (work_mem=N kB)
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort  (work_mem=N kB)
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+ Total Working Memory: N kB
+(8 rows)
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+    4220 |    5017 |   0 |    0 |   0 |      0 |      20 |      220 |         220 |      4220 |     4220 |  40 |   41 | IGAAAA   | ZKHAAA   | HHHHxx
+(1 row)
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+                                 workmem_filter                                 
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Hash Join
+               Hash Cond: (t3.thousand = t1.unique1)
+               ->  HashAggregate  (work_mem=N kB)
+                     Group Key: t3.thousand, t3.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
+               ->  Hash  (work_mem=N kB)
+                     ->  Index Only Scan using onek_unique1 on onek t1
+                           Index Cond: (unique1 < 1)
+         ->  Index Only Scan using tenk1_hundred on tenk1 t2
+               Index Cond: (hundred = t3.tenthous)
+ Total Working Memory: N kB
+(13 rows)
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+ count 
+-------
+   100
+(1 row)
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+                       workmem_filter                        
+-------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Nested Loop Left Join
+               Filter: (t4.f1 IS NULL)
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize  (work_mem=N kB)
+                     ->  Nested Loop Left Join
+                           Join Filter: (t3.f1 > 1)
+                           ->  Seq Scan on int4_tbl t3
+                                 Filter: (f1 > 0)
+                           ->  Materialize  (work_mem=N kB)
+                                 ->  Seq Scan on int4_tbl t4
+         ->  Seq Scan on int4_tbl t1
+ Total Working Memory: N kB
+(14 rows)
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+ count 
+-------
+     0
+(1 row)
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB)
+   ->  Sort  (work_mem=N kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+(9 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB)
+   ->  Sort  (work_mem=N kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+(17 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+                   workmem_filter                   
+----------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB)
+   Group Key: (length((stringu1)::text))
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial HashAggregate  (work_mem=N kB)
+               Group Key: length((stringu1)::text)
+               ->  Parallel Seq Scan on tenk1
+ Total Working Memory: N kB
+(8 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length 
+--------
+      6
+(1 row)
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+         QUERY PLAN         
+----------------------------
+ Aggregate
+   ->  Seq Scan on tenk1
+ Total Working Memory: 0 kB
+(3 rows)
+
+select MAX(length(stringu1)) from tenk1;
+ max 
+-----
+   6
+(1 row)
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                      workmem_filter                       
+-----------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB)
+ Total Working Memory: N kB
+(3 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                     workmem_filter                      
+---------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB)
+ Total Working Memory: N kB
+(3 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- Table Function Scan
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+                      workmem_filter                      
+----------------------------------------------------------
+ Nested Loop
+   ->  Seq Scan on xmldata
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB)
+ Total Working Memory: N kB
+(4 rows)
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+ id | _id | country_name | country_id | region_id | size | unit | premier_name 
+----+-----+--------------+------------+-----------+------+------+--------------
+(0 rows)
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ SetOp Except
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+ Total Working Memory: 0 kB
+(5 rows)
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+ unique1 
+---------
+      10
+(1 row)
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   ->  HashSetOp Intersect  (work_mem=N kB)
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+ Total Working Memory: N kB
+(5 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+ count 
+-------
+  5000
+(1 row)
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+                       workmem_filter                       
+------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  Memoize  (work_mem=N kB)
+               Cache Key: o.four
+               Cache Mode: binary
+               ->  CTE Scan on x  (work_mem=N kB)
+                     CTE x
+                       ->  Recursive Union  (work_mem=N kB)
+                             ->  Result
+                             ->  WorkTable Scan on x x_1
+                                   Filter: (a < 10)
+ Total Working Memory: N kB
+(14 rows)
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+ sum  | sum  
+------+------
+ 1700 | 5350
+(1 row)
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+                   workmem_filter                   
+----------------------------------------------------
+ Aggregate
+   CTE q1
+     ->  HashAggregate  (work_mem=N kB)
+           Group Key: tenk1.hundred
+           ->  Seq Scan on tenk1
+   InitPlan 2
+     ->  Aggregate
+           ->  CTE Scan on q1 qsub  (work_mem=N kB)
+   ->  CTE Scan on q1  (work_mem=N kB)
+         Filter: ((y)::numeric > (InitPlan 2).col1)
+ Total Working Memory: N kB
+(11 rows)
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+ count 
+-------
+    50
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB)
+         ->  Sort  (work_mem=N kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB)
+ Total Working Memory: N kB
+(6 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+                                            workmem_filter                                             
+-------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (((a.unique1 = 1) AND (b.unique1 = 2)) OR ((a.unique2 = 3) AND (b.hundred = 4)))
+         ->  Bitmap Heap Scan on tenk1 b
+               Recheck Cond: ((hundred = 4) OR (unique1 = 2))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB)
+                           Index Cond: (hundred = 4)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                           Index Cond: (unique1 = 2)
+         ->  Materialize  (work_mem=N kB)
+               ->  Bitmap Heap Scan on tenk1 a
+                     Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
+                     ->  BitmapOr
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB)
+                                 Index Cond: (unique2 = 3)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                                 Index Cond: (unique1 = 1)
+ Total Working Memory: N kB
+(19 rows)
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+ count 
+-------
+   101
+(1 row)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+       workmem_filter       
+----------------------------
+ Result  (work_mem=N kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+(6 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+(9 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..1089e3bdf96 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate workmem
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/workmem.sql b/src/test/regress/sql/workmem.sql
new file mode 100644
index 00000000000..5878f2aa4c4
--- /dev/null
+++ b/src/test/regress/sql/workmem.sql
@@ -0,0 +1,303 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Unique -> hash agg
+set enable_hashagg = on;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Unique -> sort
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+
+select length(stringu1) from tenk1 group by length(stringu1);
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+
+select MAX(length(stringu1)) from tenk1;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- Table Function Scan
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
-- 
2.47.1

v01_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchapplication/octet-stream; name=v01_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchDownload

From ea57eb88096287fe55251903081adced4d1f3bc4 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Thu, 20 Feb 2025 17:33:48 +0000
Subject: [PATCH 2/4] Store non-init-plan SubPlan objects in Plan list

We currently track SubPlan objects, on Plans, via either the plan->initPlan
list, for init plans; or via whatever expression contains the SubPlan, for
regular sub plans.

A SubPlan object can itself use working memory, if it uses a hash table.
This hash table is associated with the SubPlan itself, and not with the
Plan to which the SubPlan points.

To allow us to assign working memory to an individual SubPlan, this commit
stores a link to the regular SubPlan, inside a new plan->subPlan list,
when we finalize the (parent) Plan whose expression contains the regular
SubPlan.

Unlike the existing plan->initPlan list, we will not use the new plan->
subPlan list to initialize SubPlan nodes -- that must be done when we
initialize the expression that contains the SubPlan. Instead, we will use
it, during InitPlan() but before ExecInitNode(), to assign a working-
memory limit to the SubPlan.
---
 src/backend/optimizer/plan/setrefs.c | 284 +++++++++++++++++----------
 src/include/nodes/plannodes.h        |   2 +
 2 files changed, 177 insertions(+), 109 deletions(-)

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 999a5a8ab5a..8a4e77baa90 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -58,6 +58,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	int			rtoffset;
 	double		num_exec;
 } fix_scan_expr_context;
@@ -65,6 +66,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *outer_itlist;
 	indexed_tlist *inner_itlist;
 	Index		acceptable_rel;
@@ -76,6 +78,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *subplan_itlist;
 	int			newvarno;
 	int			rtoffset;
@@ -127,8 +130,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, plan, lst, rtoffset, num_exec) \
+	((List *) fix_scan_expr(root, plan, (Node *) (lst), rtoffset, num_exec))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -157,7 +160,7 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 										int rtoffset);
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
-static Node *fix_scan_expr(PlannerInfo *root, Node *node,
+static Node *fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node,
 						   int rtoffset, double num_exec);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
@@ -183,7 +186,7 @@ static Var *search_indexed_tlist_for_sortgroupref(Expr *node,
 												  Index sortgroupref,
 												  indexed_tlist *itlist,
 												  int newvarno);
-static List *fix_join_expr(PlannerInfo *root,
+static List *fix_join_expr(PlannerInfo *root, Plan *plan,
 						   List *clauses,
 						   indexed_tlist *outer_itlist,
 						   indexed_tlist *inner_itlist,
@@ -193,7 +196,7 @@ static List *fix_join_expr(PlannerInfo *root,
 						   double num_exec);
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
-static Node *fix_upper_expr(PlannerInfo *root,
+static Node *fix_upper_expr(PlannerInfo *root, Plan *plan,
 							Node *node,
 							indexed_tlist *subplan_itlist,
 							int newvarno,
@@ -202,7 +205,7 @@ static Node *fix_upper_expr(PlannerInfo *root,
 							double num_exec);
 static Node *fix_upper_expr_mutator(Node *node,
 									fix_upper_expr_context *context);
-static List *set_returning_clause_references(PlannerInfo *root,
+static List *set_returning_clause_references(PlannerInfo *root, Plan *plan,
 											 List *rlist,
 											 Plan *topplan,
 											 Index resultRelation,
@@ -633,10 +636,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -646,13 +649,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablesample = (TableSampleClause *)
-					fix_scan_expr(root, (Node *) splan->tablesample,
+					fix_scan_expr(root, plan, (Node *) splan->tablesample,
 								  rtoffset, 1);
 			}
 			break;
@@ -662,22 +665,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual,
+					fix_scan_list(root, plan, splan->indexqual,
 								  rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexorderby =
-					fix_scan_list(root, splan->indexorderby,
+					fix_scan_list(root, plan, splan->indexorderby,
 								  rtoffset, 1);
 				splan->indexorderbyorig =
-					fix_scan_list(root, splan->indexorderbyorig,
+					fix_scan_list(root, plan, splan->indexorderbyorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -697,9 +700,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, plan, splan->indexqual, rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -709,13 +712,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->bitmapqualorig =
-					fix_scan_list(root, splan->bitmapqualorig,
+					fix_scan_list(root, plan, splan->bitmapqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -725,13 +728,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidquals =
-					fix_scan_list(root, splan->tidquals,
+					fix_scan_list(root, plan, splan->tidquals,
 								  rtoffset, 1);
 			}
 			break;
@@ -741,13 +744,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidrangequals =
-					fix_scan_list(root, splan->tidrangequals,
+					fix_scan_list(root, plan, splan->tidrangequals,
 								  rtoffset, 1);
 			}
 			break;
@@ -762,13 +765,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, plan, splan->functions, rtoffset, 1);
 			}
 			break;
 		case T_TableFuncScan:
@@ -777,13 +780,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablefunc = (TableFunc *)
-					fix_scan_expr(root, (Node *) splan->tablefunc,
+					fix_scan_expr(root, plan, (Node *) splan->tablefunc,
 								  rtoffset, 1);
 			}
 			break;
@@ -793,13 +796,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->values_lists =
-					fix_scan_list(root, splan->values_lists,
+					fix_scan_list(root, plan, splan->values_lists,
 								  rtoffset, 1);
 			}
 			break;
@@ -809,10 +812,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -822,10 +825,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -835,10 +838,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -877,7 +880,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 */
 				set_dummy_tlist_references(plan, rtoffset);
 
-				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
+				mplan->param_exprs = fix_scan_list(root, plan, mplan->param_exprs,
 												   rtoffset,
 												   NUM_EXEC_TLIST(plan));
 				break;
@@ -939,9 +942,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitOffset, rtoffset, 1);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitCount, rtoffset, 1);
 			}
 			break;
 		case T_Agg:
@@ -994,14 +997,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
-				wplan->runCondition = fix_scan_list(root,
+					fix_scan_expr(root, plan, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root, plan,
 													wplan->runCondition,
 													rtoffset,
 													NUM_EXEC_TLIST(plan));
-				wplan->runConditionOrig = fix_scan_list(root,
+				wplan->runConditionOrig = fix_scan_list(root, plan,
 														wplan->runConditionOrig,
 														rtoffset,
 														NUM_EXEC_TLIST(plan));
@@ -1043,15 +1046,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					}
 
 					splan->plan.targetlist =
-						fix_scan_list(root, splan->plan.targetlist,
+						fix_scan_list(root, plan, splan->plan.targetlist,
 									  rtoffset, NUM_EXEC_TLIST(plan));
 					splan->plan.qual =
-						fix_scan_list(root, splan->plan.qual,
+						fix_scan_list(root, plan, splan->plan.qual,
 									  rtoffset, NUM_EXEC_QUAL(plan));
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->resconstantqual, rtoffset,
+								  1);
 			}
 			break;
 		case T_ProjectSet:
@@ -1066,7 +1070,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->withCheckOptionLists =
-					fix_scan_list(root, splan->withCheckOptionLists,
+					fix_scan_list(root, plan, splan->withCheckOptionLists,
 								  rtoffset, 1);
 
 				if (splan->returningLists)
@@ -1086,7 +1090,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						List	   *rlist = (List *) lfirst(lcrl);
 						Index		resultrel = lfirst_int(lcrr);
 
-						rlist = set_returning_clause_references(root,
+						rlist = set_returning_clause_references(root, plan,
 																rlist,
 																subplan,
 																resultrel,
@@ -1121,13 +1125,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					itlist = build_tlist_index(splan->exclRelTlist);
 
 					splan->onConflictSet =
-						fix_join_expr(root, splan->onConflictSet,
+						fix_join_expr(root, plan, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
 
 					splan->onConflictWhere = (Node *)
-						fix_join_expr(root, (List *) splan->onConflictWhere,
+						fix_join_expr(root, plan, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
@@ -1135,7 +1139,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, plan, splan->exclRelTlist, rtoffset, 1);
 				}
 
 				/*
@@ -1186,7 +1190,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 							MergeAction *action = (MergeAction *) lfirst(l);
 
 							/* Fix targetList of each action. */
-							action->targetList = fix_join_expr(root,
+							action->targetList = fix_join_expr(root, plan,
 															   action->targetList,
 															   NULL, itlist,
 															   resultrel,
@@ -1195,7 +1199,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   NUM_EXEC_TLIST(plan));
 
 							/* Fix quals too. */
-							action->qual = (Node *) fix_join_expr(root,
+							action->qual = (Node *) fix_join_expr(root, plan,
 																  (List *) action->qual,
 																  NULL, itlist,
 																  resultrel,
@@ -1206,7 +1210,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 						/* Fix join condition too. */
 						mergeJoinCondition = (Node *)
-							fix_join_expr(root,
+							fix_join_expr(root, plan,
 										  (List *) mergeJoinCondition,
 										  NULL, itlist,
 										  resultrel,
@@ -1353,7 +1357,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 
 	plan->scan.scanrelid += rtoffset;
 	plan->scan.plan.targetlist = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.targetlist,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1361,7 +1365,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_TLIST((Plan *) plan));
 	plan->scan.plan.qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.qual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1369,7 +1373,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	plan->recheckqual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->recheckqual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1377,13 +1381,13 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
-	plan->indexqual = fix_scan_list(root, plan->indexqual,
+	plan->indexqual = fix_scan_list(root, (Plan *) plan, plan->indexqual,
 									rtoffset, 1);
 	/* indexorderby is already transformed to reference index columns */
-	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
+	plan->indexorderby = fix_scan_list(root, (Plan *) plan, plan->indexorderby,
 									   rtoffset, 1);
 	/* indextlist must NOT be transformed to reference index columns */
-	plan->indextlist = fix_scan_list(root, plan->indextlist,
+	plan->indextlist = fix_scan_list(root, (Plan *) plan, plan->indextlist,
 									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 
 	pfree(index_itlist);
@@ -1430,10 +1434,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		 */
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
-			fix_scan_list(root, plan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 		plan->scan.plan.qual =
-			fix_scan_list(root, plan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
 
 		result = (Plan *) plan;
@@ -1599,7 +1603,7 @@ set_foreignscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(fscan->fdw_scan_tlist);
 
 		fscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1607,7 +1611,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1615,7 +1619,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1623,7 +1627,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_recheck_quals,
 						   itlist,
 						   INDEX_VAR,
@@ -1633,7 +1637,7 @@ set_foreignscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
-			fix_scan_list(root, fscan->fdw_scan_tlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 	}
 	else
@@ -1643,16 +1647,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 * way
 		 */
 		fscan->scan.plan.targetlist =
-			fix_scan_list(root, fscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual =
-			fix_scan_list(root, fscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs =
-			fix_scan_list(root, fscan->fdw_exprs,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals =
-			fix_scan_list(root, fscan->fdw_recheck_quals,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_recheck_quals,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 	}
 
@@ -1685,7 +1689,7 @@ set_customscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(cscan->custom_scan_tlist);
 
 		cscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1693,7 +1697,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1701,7 +1705,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->custom_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1711,20 +1715,20 @@ set_customscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
-			fix_scan_list(root, cscan->custom_scan_tlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
-			fix_scan_list(root, cscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual =
-			fix_scan_list(root, cscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs =
-			fix_scan_list(root, cscan->custom_exprs,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 	}
 
@@ -1752,7 +1756,8 @@ set_customscan_references(PlannerInfo *root,
  * startup time.
  */
 static int
-register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
+register_partpruneinfo(PlannerInfo *root, Plan *plan, int part_prune_index,
+					   int rtoffset)
 {
 	PlannerGlobal *glob = root->glob;
 	PartitionPruneInfo *pinfo;
@@ -1776,10 +1781,10 @@ register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
 
 			prelinfo->rtindex += rtoffset;
 			prelinfo->initial_pruning_steps =
-				fix_scan_list(root, prelinfo->initial_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->initial_pruning_steps,
 							  rtoffset, 1);
 			prelinfo->exec_pruning_steps =
-				fix_scan_list(root, prelinfo->exec_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->exec_pruning_steps,
 							  rtoffset, 1);
 
 			for (i = 0; i < prelinfo->nparts; i++)
@@ -1863,7 +1868,8 @@ set_append_references(PlannerInfo *root,
 	 */
 	if (aplan->part_prune_index >= 0)
 		aplan->part_prune_index =
-			register_partpruneinfo(root, aplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) aplan,
+								   aplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(aplan->plan.lefttree == NULL);
@@ -1931,7 +1937,8 @@ set_mergeappend_references(PlannerInfo *root,
 	 */
 	if (mplan->part_prune_index >= 0)
 		mplan->part_prune_index =
-			register_partpruneinfo(root, mplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) mplan,
+								   mplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(mplan->plan.lefttree == NULL);
@@ -1958,7 +1965,7 @@ set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	 */
 	outer_itlist = build_tlist_index(outer_plan->targetlist);
 	hplan->hashkeys = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) hplan->hashkeys,
 					   outer_itlist,
 					   OUTER_VAR,
@@ -2194,7 +2201,8 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * replacing Aggref nodes that should be replaced by initplan output Params,
  * choosing the best implementation for AlternativeSubPlans,
  * looking up operator opcode info for OpExpr and related nodes,
- * and adding OIDs from regclass Const nodes into root->glob->relationOids.
+ * adding OIDs from regclass Const nodes into root->glob->relationOids, and
+ * recording Subplans that use hash tables.
  *
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
@@ -2204,11 +2212,13 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node, int rtoffset,
+			  double num_exec)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
 
@@ -2299,8 +2309,21 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 															 (AlternativeSubPlan *) node,
 															 context->num_exec),
 									 context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_scan_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 static bool
@@ -2312,6 +2335,17 @@ fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this SubPlan so that we can assign working memory to it (if
+		 * needed).
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
 	return expression_tree_walker(node, fix_scan_expr_walker, context);
 }
 
@@ -2341,7 +2375,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * NestLoopParams now, because those couldn't refer to nullable
 	 * subexpressions.
 	 */
-	join->joinqual = fix_join_expr(root,
+	join->joinqual = fix_join_expr(root, (Plan *) join,
 								   join->joinqual,
 								   outer_itlist,
 								   inner_itlist,
@@ -2371,7 +2405,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 			 * make things match up perfectly seems well out of proportion to
 			 * the value.
 			 */
-			nlp->paramval = (Var *) fix_upper_expr(root,
+			nlp->paramval = (Var *) fix_upper_expr(root, (Plan *) join,
 												   (Node *) nlp->paramval,
 												   outer_itlist,
 												   OUTER_VAR,
@@ -2388,7 +2422,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		MergeJoin  *mj = (MergeJoin *) join;
 
-		mj->mergeclauses = fix_join_expr(root,
+		mj->mergeclauses = fix_join_expr(root, (Plan *) join,
 										 mj->mergeclauses,
 										 outer_itlist,
 										 inner_itlist,
@@ -2401,7 +2435,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		HashJoin   *hj = (HashJoin *) join;
 
-		hj->hashclauses = fix_join_expr(root,
+		hj->hashclauses = fix_join_expr(root, (Plan *) join,
 										hj->hashclauses,
 										outer_itlist,
 										inner_itlist,
@@ -2414,7 +2448,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 		 * HashJoin's hashkeys are used to look for matching tuples from its
 		 * outer plan (not the Hash node!) in the hashtable.
 		 */
-		hj->hashkeys = (List *) fix_upper_expr(root,
+		hj->hashkeys = (List *) fix_upper_expr(root, (Plan *) join,
 											   (Node *) hj->hashkeys,
 											   outer_itlist,
 											   OUTER_VAR,
@@ -2433,7 +2467,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * be, so we just tell fix_join_expr to accept superset nullingrels
 	 * matches instead of exact ones.
 	 */
-	join->plan.targetlist = fix_join_expr(root,
+	join->plan.targetlist = fix_join_expr(root, (Plan *) join,
 										  join->plan.targetlist,
 										  outer_itlist,
 										  inner_itlist,
@@ -2441,7 +2475,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
 										  NUM_EXEC_TLIST((Plan *) join));
-	join->plan.qual = fix_join_expr(root,
+	join->plan.qual = fix_join_expr(root, (Plan *) join,
 									join->plan.qual,
 									outer_itlist,
 									inner_itlist,
@@ -2519,7 +2553,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 													  subplan_itlist,
 													  OUTER_VAR);
 			if (!newexpr)
-				newexpr = fix_upper_expr(root,
+				newexpr = fix_upper_expr(root, plan,
 										 (Node *) tle->expr,
 										 subplan_itlist,
 										 OUTER_VAR,
@@ -2528,7 +2562,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 										 NUM_EXEC_TLIST(plan));
 		}
 		else
-			newexpr = fix_upper_expr(root,
+			newexpr = fix_upper_expr(root, plan,
 									 (Node *) tle->expr,
 									 subplan_itlist,
 									 OUTER_VAR,
@@ -2542,7 +2576,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	plan->targetlist = output_targetlist;
 
 	plan->qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) plan->qual,
 					   subplan_itlist,
 					   OUTER_VAR,
@@ -3081,6 +3115,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  *    the source relation elements, outer_itlist = NULL and acceptable_rel
  *    the target relation.
  *
+ * 'plan' is the Plan node to which the clauses belong
  * 'clauses' is the targetlist or list of join clauses
  * 'outer_itlist' is the indexed target list of the outer join relation,
  *		or NULL
@@ -3097,6 +3132,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  */
 static List *
 fix_join_expr(PlannerInfo *root,
+			  Plan *plan,
 			  List *clauses,
 			  indexed_tlist *outer_itlist,
 			  indexed_tlist *inner_itlist,
@@ -3108,6 +3144,7 @@ fix_join_expr(PlannerInfo *root,
 	fix_join_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.outer_itlist = outer_itlist;
 	context.inner_itlist = inner_itlist;
 	context.acceptable_rel = acceptable_rel;
@@ -3234,7 +3271,19 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 															 context->num_exec),
 									 context);
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_join_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_join_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3258,6 +3307,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * expensive, so we don't want to try it in the common case where the
  * subplan tlist is just a flattened list of Vars.)
  *
+ * 'plan': the Plan node to which the expression belongs
  * 'node': the tree to be fixed (a target item or qual)
  * 'subplan_itlist': indexed target list for subplan (or index)
  * 'newvarno': varno to use for Vars referencing tlist elements
@@ -3271,6 +3321,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  */
 static Node *
 fix_upper_expr(PlannerInfo *root,
+			   Plan *plan,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
 			   int newvarno,
@@ -3281,6 +3332,7 @@ fix_upper_expr(PlannerInfo *root,
 	fix_upper_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.subplan_itlist = subplan_itlist;
 	context.newvarno = newvarno;
 	context.rtoffset = rtoffset;
@@ -3358,8 +3410,21 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
 															  (AlternativeSubPlan *) node,
 															  context->num_exec),
 									  context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_upper_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_upper_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3377,9 +3442,10 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * We also must perform opcode lookup and add regclass OIDs to
  * root->glob->relationOids.
  *
+ * 'plan': the ModifyTable node itself
  * 'rlist': the RETURNING targetlist to be fixed
  * 'topplan': the top subplan node that will be just below the ModifyTable
- *		node (note it's not yet passed through set_plan_refs)
+ *		node
  * 'resultRelation': RT index of the associated result relation
  * 'rtoffset': how much to increment varnos by
  *
@@ -3391,7 +3457,7 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * Note: resultRelation is not yet adjusted by rtoffset.
  */
 static List *
-set_returning_clause_references(PlannerInfo *root,
+set_returning_clause_references(PlannerInfo *root, Plan *plan,
 								List *rlist,
 								Plan *topplan,
 								Index resultRelation,
@@ -3415,7 +3481,7 @@ set_returning_clause_references(PlannerInfo *root,
 	 */
 	itlist = build_tlist_index_other_vars(topplan->targetlist, resultRelation);
 
-	rlist = fix_join_expr(root,
+	rlist = fix_join_expr(root, plan,
 						  rlist,
 						  itlist,
 						  NULL,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 67da7f091b5..d3f8fd7bd6c 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -206,6 +206,8 @@ typedef struct Plan
 	struct Plan *righttree;
 	/* Init Plan nodes (un-correlated expr subselects) */
 	List	   *initPlan;
+	/* Regular Sub Plan nodes (cf. "initPlan", above) */
+	List	   *subPlan;
 
 	/*
 	 * Information for management of parameter-change-driven rescanning
-- 
2.47.1

v01_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patchapplication/octet-stream; name=v01_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patchDownload

From 8b694ac29aeac53df9c48f4d61983baebb9875f9 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 21 Feb 2025 00:16:22 +0000
Subject: [PATCH 3/4] EXPLAIN (WORK_MEM ON) now shows working memory limit

This commit moves the working-memory limit that an executor node checks, at
runtime, from the "work_mem" and "hash_mem_multiplier" GUCs, to a new
field, "work_mem", added to the Plan node. To preserve backward
compatibility, it also copies the "work_mem", etc., values from these GUCs
to the new field.

This field is on the Plan node, instead of the PlanState, because it needs
to be set before we can call ExecInitNode(). Many PlanStates look at their
working-memory limit when creating their data structures, during
initialization. So the field is on the Plan node, but set between planning
and execution phases.

Also modifies "EXPLAIN (WORK_MEM ON)" so that it also displays this
working-memory limit.
---
 src/backend/commands/explain.c             |  59 ++++-
 src/backend/executor/Makefile              |   1 +
 src/backend/executor/execGrouping.c        |  10 +-
 src/backend/executor/execMain.c            |   6 +
 src/backend/executor/execSRF.c             |   5 +-
 src/backend/executor/execWorkmem.c         | 281 +++++++++++++++++++++
 src/backend/executor/nodeAgg.c             |  69 +++--
 src/backend/executor/nodeBitmapIndexscan.c |   3 +-
 src/backend/executor/nodeBitmapOr.c        |   3 +-
 src/backend/executor/nodeCtescan.c         |   3 +-
 src/backend/executor/nodeFunctionscan.c    |   2 +
 src/backend/executor/nodeHash.c            |  23 +-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeMaterial.c        |   3 +-
 src/backend/executor/nodeMemoize.c         |   2 +-
 src/backend/executor/nodeRecursiveunion.c  |  12 +-
 src/backend/executor/nodeSetOp.c           |   1 +
 src/backend/executor/nodeSort.c            |   4 +-
 src/backend/executor/nodeSubplan.c         |   2 +
 src/backend/executor/nodeTableFuncscan.c   |   3 +-
 src/backend/executor/nodeWindowAgg.c       |   3 +-
 src/backend/optimizer/path/costsize.c      |  15 +-
 src/include/commands/explain.h             |   1 +
 src/include/executor/executor.h            |   7 +
 src/include/executor/hashjoin.h            |   3 +-
 src/include/executor/nodeAgg.h             |   5 +-
 src/include/executor/nodeHash.h            |   3 +-
 src/include/nodes/plannodes.h              |   8 +-
 src/include/nodes/primnodes.h              |   2 +
 src/test/regress/expected/workmem.out      | 184 ++++++++------
 30 files changed, 576 insertions(+), 151 deletions(-)
 create mode 100644 src/backend/executor/execWorkmem.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e09d7f868c9..07c6d34764b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -180,8 +180,8 @@ static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 static SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
-static void compute_subplan_workmem(List *plans, double *workmem);
-static void compute_agg_workmem(Agg *agg, double *workmem);
+static void compute_subplan_workmem(List *plans, double *workmem, double *limit);
+static void compute_agg_workmem(Agg *agg, double *workmem, double *limit);
 
 
 
@@ -843,6 +843,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 	{
 		ExplainPropertyFloat("Total Working Memory", "kB",
 							 es->total_workmem, 0, es);
+		ExplainPropertyFloat("Total Working Memory Limit", "kB",
+							 es->total_workmem_limit, 0, es);
 	}
 
 	ExplainCloseGroup("Query", NULL, true, es);
@@ -1983,19 +1985,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->work_mem)
 	{
 		double		plan_workmem = 0.0;
+		double		plan_limit = 0.0;
 
 		/*
 		 * Include working memory used by this Plan's SubPlan objects, whether
 		 * they are included on the Plan's initPlan or subPlan lists.
 		 */
-		compute_subplan_workmem(planstate->initPlan, &plan_workmem);
-		compute_subplan_workmem(planstate->subPlan, &plan_workmem);
+		compute_subplan_workmem(planstate->initPlan, &plan_workmem, &plan_limit);
+		compute_subplan_workmem(planstate->subPlan, &plan_workmem, &plan_limit);
 
 		/* Include working memory used by this Plan, itself. */
 		switch (nodeTag(plan))
 		{
 			case T_Agg:
-				compute_agg_workmem((Agg *) plan, &plan_workmem);
+				compute_agg_workmem((Agg *) plan, &plan_workmem, &plan_limit);
 				break;
 			case T_FunctionScan:
 				{
@@ -2003,6 +2006,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 					plan_workmem += (double) plan->workmem *
 						list_length(fscan->functions);
+					plan_limit += (double) plan->workmem_limit *
+						list_length(fscan->functions);
 					break;
 				}
 			case T_IncrementalSort:
@@ -2011,7 +2016,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				 * IncrementalSort creates two Tuplestores, each of
 				 * (estimated) size workmem.
 				 */
-				plan_workmem = (double) plan->workmem * 2;
+				plan_workmem += (double) plan->workmem * 2;
+				plan_limit += (double) plan->workmem_limit * 2;
 				break;
 			case T_RecursiveUnion:
 				{
@@ -2024,11 +2030,15 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					 */
 					plan_workmem += (double) plan->workmem * 2 +
 						runion->hashWorkMem;
+					plan_limit += (double) plan->workmem_limit * 2 +
+						runion->hashWorkMemLimit;
 					break;
 				}
 			default:
 				if (plan->workmem > 0)
 					plan_workmem += plan->workmem;
+				if (plan->workmem_limit > 0)
+					plan_limit += plan->workmem_limit;
 				break;
 		}
 
@@ -2037,17 +2047,23 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		 * working memory.
 		 */
 		plan_workmem *= (1 + es->num_workers);
+		plan_limit *= (1 + es->num_workers);
 
 		es->total_workmem += plan_workmem;
+		es->total_workmem_limit += plan_limit;
 
-		if (plan_workmem > 0.0)
+		if (plan_workmem > 0.0 || plan_limit > 0.0)
 		{
 			if (es->format == EXPLAIN_FORMAT_TEXT)
-				appendStringInfo(es->str, "  (work_mem=%.0f kB)",
-								 plan_workmem);
+				appendStringInfo(es->str, "  (work_mem=%.0f kB limit=%.0f kB)",
+								 plan_workmem, plan_limit);
 			else
+			{
 				ExplainPropertyFloat("Working Memory", "kB",
 									 plan_workmem, 0, es);
+				ExplainPropertyFloat("Working Memory Limit", "kB",
+									 plan_limit, 0, es);
+			}
 		}
 	}
 
@@ -6062,29 +6078,39 @@ GetSerializationMetrics(DestReceiver *dest)
  * increments work_mem counters to include the SubPlan's working-memory.
  */
 static void
-compute_subplan_workmem(List *plans, double *workmem)
+compute_subplan_workmem(List *plans, double *workmem, double *limit)
 {
 	foreach_node(SubPlanState, sps, plans)
 	{
 		SubPlan    *sp = sps->subplan;
 
 		if (sp->hashtab_workmem > 0)
+		{
 			*workmem += sp->hashtab_workmem;
+			*limit += sp->hashtab_workmem_limit;
+		}
 
 		if (sp->hashnul_workmem > 0)
+		{
 			*workmem += sp->hashnul_workmem;
+			*limit += sp->hashnul_workmem_limit;
+		}
 	}
 }
 
-/* Compute an Agg's working memory estimate. */
+/* Compute an Agg's working memory estimate and limit. */
 typedef struct AggWorkMem
 {
 	double		input_sort_workmem;
+	double		input_sort_limit;
 
 	double		output_hash_workmem;
+	double		output_hash_limit;
 
 	int			num_sort_nodes;
+
 	double		max_output_sort_workmem;
+	double		output_sort_limit;
 }			AggWorkMem;
 
 static void
@@ -6092,6 +6118,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 {
 	/* Record memory used for input sort buffers. */
 	mem->input_sort_workmem += (double) agg->numSorts * agg->sortWorkMem;
+	mem->input_sort_limit += (double) agg->numSorts * agg->sortWorkMemLimit;
 
 	/* Record memory used for output data structures. */
 	switch (agg->aggstrategy)
@@ -6102,6 +6129,9 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 			mem->max_output_sort_workmem =
 				Max(mem->max_output_sort_workmem, agg->plan.workmem);
 
+			if (mem->output_sort_limit == 0)
+				mem->output_sort_limit = agg->plan.workmem_limit;
+
 			++mem->num_sort_nodes;
 			break;
 		case AGG_HASHED:
@@ -6112,6 +6142,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 			 * lifetime of the Agg.
 			 */
 			mem->output_hash_workmem += agg->plan.workmem;
+			mem->output_hash_limit += agg->plan.workmem_limit;
 			break;
 		default:
 
@@ -6135,7 +6166,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
  * value on the main Agg node.
  */
 static void
-compute_agg_workmem(Agg *agg, double *workmem)
+compute_agg_workmem(Agg *agg, double *workmem, double *limit)
 {
 	AggWorkMem	mem;
 	ListCell   *lc;
@@ -6153,9 +6184,13 @@ compute_agg_workmem(Agg *agg, double *workmem)
 	}
 
 	*workmem = mem.input_sort_workmem + mem.output_hash_workmem;
+	*limit = mem.input_sort_limit + mem.output_hash_limit;;
 
 	/* We'll have at most two sort buffers alive, at any time. */
 	*workmem += mem.num_sort_nodes > 2 ?
 		mem.max_output_sort_workmem * 2.0 :
 		mem.max_output_sort_workmem;
+	*limit += mem.num_sort_nodes > 2 ?
+		mem.output_sort_limit * 2.0 :
+		mem.output_sort_limit;
 }
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..8aa9580558f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -30,6 +30,7 @@ OBJS = \
 	execScan.o \
 	execTuples.o \
 	execUtils.o \
+	execWorkmem.o \
 	functions.o \
 	instrument.o \
 	nodeAgg.o \
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 33b124fbb0a..bcd1822da80 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -168,6 +168,7 @@ BuildTupleHashTable(PlanState *parent,
 					Oid *collations,
 					long nbuckets,
 					Size additionalsize,
+					Size hash_mem_limit,
 					MemoryContext metacxt,
 					MemoryContext tablecxt,
 					MemoryContext tempcxt,
@@ -175,15 +176,18 @@ BuildTupleHashTable(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
-	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
 
 	Assert(nbuckets > 0);
 
-	/* Limit initial table size request to not more than hash_mem */
-	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	/*
+	 * Limit initial table size request to not more than hash_mem
+	 *
+	 * XXX - we should also limit the *maximum* table size to hash_mem.
+	 */
+	hash_mem_limit = hash_mem_limit / entrysize;
 	if (nbuckets > hash_mem_limit)
 		nbuckets = hash_mem_limit;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0493b7d5365..78fd887a84d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1050,6 +1050,12 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/* signal that this EState is not used for EPQ */
 	estate->es_epq_active = NULL;
 
+	/*
+	 * Assign working memory to SubPlan and Plan nodes, before initializing
+	 * their states.
+	 */
+	ExecAssignWorkMem(plannedstmt);
+
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
diff --git a/src/backend/executor/execSRF.c b/src/backend/executor/execSRF.c
index a03fe780a02..4b1e7e0ad1e 100644
--- a/src/backend/executor/execSRF.c
+++ b/src/backend/executor/execSRF.c
@@ -102,6 +102,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 							ExprContext *econtext,
 							MemoryContext argContext,
 							TupleDesc expectedDesc,
+							int workMem,
 							bool randomAccess)
 {
 	Tuplestorestate *tupstore = NULL;
@@ -261,7 +262,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 				MemoryContext oldcontext =
 					MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-				tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+				tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 				rsinfo.setResult = tupstore;
 				if (!returnsTuple)
 				{
@@ -396,7 +397,7 @@ no_function_result:
 		MemoryContext oldcontext =
 			MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-		tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+		tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 		rsinfo.setResult = tupstore;
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
new file mode 100644
index 00000000000..c513b90fc77
--- /dev/null
+++ b/src/backend/executor/execWorkmem.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * execWorkmem.c
+ *	 routine to set the "workmem_limit" field(s) on Plan nodes that need
+ *   workimg memory.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execWorkmem.c
+ *
+ * INTERFACE ROUTINES
+ *		ExecAssignWorkMem	- assign working memory to Plan nodes
+ *
+ *	 NOTES
+ *		Historically, every PlanState node, during initialization, looked at
+ *		the "work_mem" (plus maybe "hash_mem_multiplier") GUC, to determine
+ *		what working-memory limit was imposed on it.
+ *
+ *		Now, to allow different PlanState nodes to be restricted to different
+ *		amounts of memory, each PlanState node reads this limit off its
+ *		corresponding Plan node's "workmem_limit" field. And we populate that
+ *		field by calling ExecAssignWorkMem(), from InitPlan(), before we
+ *		initialize the PlanState nodes.
+ *
+ * 		The "workmem_limit" field is a limit "per data structure," rather than
+ *		"per PlanState". This is needed because some SQL operators (e.g.,
+ *		RecursiveUnion and Agg) require multiple data structures, and sometimes
+ *		the data structures don't all share the same memory requirement. So we
+ *		cannot always just divide a "per PlanState" limit among individual data
+ *		structures. Instead, we maintain the limits on the data structures (and
+ *		EXPLAIN, for example, sums them up into a single, human-readable
+ *		number).
+ *
+ *		Note that the *Path's* "workmem" estimate is per SQL operator, but when
+ *		we convert that Path to a Plan we also break its "workmem" estimate
+ *		down into per-data structure estimates. Some operators therefore
+ *		require additional "limit" fields, which we add to the corresponding
+ *		Plan.
+ *
+ *		We store the "workmem_limit" field(s) on the Plan, instead of the
+ *		PlanState, even though it conceptually belongs to execution rather than
+ *		to planning, because we need it to be set before initializing the
+ *		corresponding PlanState. This is a chicken-and-egg problem. We could,
+ *		of course, make ExecInitNode() a two-phase operation, but that seems
+ *		like overkill. Instead, we store these "limit" fields on the Plan, but
+ *		set them when we start execution, as part of InitPlan().
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+
+
+/* decls for local routines only used within this module */
+static void assign_workmem_subplan(SubPlan *subplan);
+static void assign_workmem_plan(Plan *plan);
+static void assign_workmem_agg(Agg *agg);
+static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+									bool *is_first_sort);
+
+/* end of local decls */
+
+
+/* ------------------------------------------------------------------------
+ *		ExecAssignWorkMem
+ *
+ *		Recursively assigns working memory to any Plans or SubPlans that need
+ *		it.
+ *
+ *		Inputs:
+ *		  'plannedstmt' is the statement to which we assign working memory
+ *
+ * ------------------------------------------------------------------------
+ */
+void
+ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	/*
+	 * No need to re-assign working memory on parallel workers, since workers
+	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
+	 *
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	/* Assign working memory to the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			assign_workmem_plan(plan);
+	}
+
+	/* And assign working memory to the main Plan tree. */
+	assign_workmem_plan(plannedstmt->planTree);
+}
+
+static void
+assign_workmem_subplan(SubPlan *subplan)
+{
+	subplan->hashtab_workmem_limit = subplan->useHashTable ?
+		normalize_workmem(get_hash_memory_limit()) : 0;
+
+	subplan->hashnul_workmem_limit =
+		subplan->useHashTable && !subplan->unknownEqFalse ?
+		normalize_workmem(get_hash_memory_limit()) : 0;
+}
+
+static void
+assign_workmem_plan(Plan *plan)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Assign working memory to this node's (hashed) SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		assign_workmem_subplan(subplan);
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		assign_workmem_subplan(subplan);
+
+	/* Assign working memory to this node. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_FunctionScan:
+		case T_IncrementalSort:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+			if (plan->workmem > 0)
+				plan->workmem_limit = work_mem;
+			break;
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			if (plan->workmem > 0)
+				plan->workmem_limit =
+					normalize_workmem(get_hash_memory_limit());
+			break;
+		case T_Agg:
+			assign_workmem_agg((Agg *) plan);
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+				plan->workmem_limit = work_mem;
+
+				if (runion->numCols > 0)
+				{
+					/* Also include memory for hash table. */
+					runion->hashWorkMemLimit =
+						normalize_workmem(get_hash_memory_limit());
+				}
+
+				break;
+			}
+		default:
+			Assert(plan->workmem == 0);
+			plan->workmem_limit = 0;
+			break;
+	}
+
+	/*
+	 * Assign working memory to this node's children. (Logic copied from
+	 * ExplainNode().)
+	 */
+	if (outerPlan(plan))
+		assign_workmem_plan(outerPlan(plan));
+
+	if (innerPlan(plan))
+		assign_workmem_plan(innerPlan(plan));
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				assign_workmem_plan(child);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_SubqueryScan:
+			assign_workmem_plan(((SubqueryScan *) plan)->subplan);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				assign_workmem_plan(child);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+assign_workmem_agg(Agg *agg)
+{
+	bool		is_first_sort = true;
+
+	/* Assign working memory to the main Agg node. */
+	assign_workmem_agg_node(agg,
+							true /* is_first */ ,
+							agg->chain == NULL /* is_last */ ,
+							&is_first_sort);
+
+	/* Assign working memory to any other grouping sets. */
+	foreach_node(Agg, aggnode, agg->chain)
+	{
+		assign_workmem_agg_node(aggnode,
+								false /* is_first */ ,
+								foreach_current_index(aggnode) ==
+								list_length(agg->chain) - 1 /* is_last */ ,
+								&is_first_sort);
+	}
+}
+
+static void
+assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+						bool *is_first_sort)
+{
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * Because nodeAgg.c will combine all AGG_HASHED nodes into a
+			 * single phase, it's easier to store the hash working-memory
+			 * limit on the first AGG_{HASHED,MIXED} node, and set it to zero
+			 * for all subsequent AGG_HASHED nodes.
+			 */
+			agg->plan.workmem_limit = is_first ?
+				normalize_workmem(get_hash_memory_limit()) : 0;
+			break;
+		case AGG_SORTED:
+
+			/*
+			 * Also store the sort-output working-memory limit on the first
+			 * AGG_SORTED node, and set it to zero for all subsequent
+			 * AGG_SORTED nodes.
+			 *
+			 * We'll need working-memory to hold the "sort_out" only if this
+			 * isn't the last Agg node (in which case there's no one to sort
+			 * our output).
+			 */
+			agg->plan.workmem_limit = *is_first_sort && !is_last ?
+				work_mem : 0;
+
+			*is_first_sort = false;
+			break;
+		default:
+			break;
+	}
+
+	/* Also include memory needed to sort the input: */
+	if (agg->numSorts > 0)
+	{
+		Assert(agg->sortWorkMem > 0);
+
+		agg->sortWorkMemLimit = work_mem;
+	}
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ceb8c8a8039..9e5bcf7ada4 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -258,6 +258,7 @@
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "executor/nodeHash.h"
 #include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -403,7 +404,8 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
 					  Bitmapset **unaggregated);
 static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets,
+							 Size hash_mem_limit);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
 static long hash_choose_num_buckets(double hashentrysize,
@@ -411,6 +413,7 @@ static long hash_choose_num_buckets(double hashentrysize,
 static int	hash_choose_num_partitions(double input_groups,
 									   double hashentrysize,
 									   int used_bits,
+									   Size hash_mem_limit,
 									   int *log2_npartitions);
 static void initialize_hash_entry(AggState *aggstate,
 								  TupleHashTable hashtable,
@@ -431,9 +434,10 @@ static HashAggBatch *hashagg_batch_new(LogicalTape *input_tape, int setno,
 									   int64 input_tuples, double input_card,
 									   int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
-static void hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset,
-							   int used_bits, double input_groups,
-							   double hashentrysize);
+static void hashagg_spill_init(HashAggSpill *spill,
+							   LogicalTapeSet *tapeset, int used_bits,
+							   double input_groups, double hashentrysize,
+							   Size hash_mem_limit);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 								TupleTableSlot *inputslot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
@@ -521,6 +525,14 @@ initialize_phase(AggState *aggstate, int newphase)
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
+		int			workmem_limit;
+
+		/*
+		 * Read the sort-output workmem_limit off the first AGG_SORTED node.
+		 * Since phase 0 is always AGG_HASHED, this will always be phase 1.
+		 */
+		workmem_limit = aggstate->phases[1].aggnode->plan.workmem_limit;
+		Assert(workmem_limit > 0);
 
 		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
 												  sortnode->numCols,
@@ -528,7 +540,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->sortOperators,
 												  sortnode->collations,
 												  sortnode->nullsFirst,
-												  work_mem,
+												  workmem_limit,
 												  NULL, TUPLESORT_NONE);
 	}
 
@@ -577,7 +589,7 @@ fetch_input_tuple(AggState *aggstate)
  */
 static void
 initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
-					 AggStatePerGroup pergroupstate)
+					 AggStatePerGroup pergroupstate, size_t workMem)
 {
 	/*
 	 * Start a fresh sort operation for each DISTINCT/ORDER BY aggregate.
@@ -591,6 +603,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 		if (pertrans->sortstates[aggstate->current_set])
 			tuplesort_end(pertrans->sortstates[aggstate->current_set]);
 
+		Assert(workMem > 0);
 
 		/*
 		 * We use a plain Datum sorter when there's a single input column;
@@ -606,7 +619,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, TUPLESORT_NONE);
+									  workMem, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -616,7 +629,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, TUPLESORT_NONE);
+									 workMem, NULL, TUPLESORT_NONE);
 	}
 
 	/*
@@ -687,7 +700,8 @@ initialize_aggregates(AggState *aggstate,
 			AggStatePerTrans pertrans = &transstates[transno];
 			AggStatePerGroup pergroupstate = &pergroup[transno];
 
-			initialize_aggregate(aggstate, pertrans, pergroupstate);
+			initialize_aggregate(aggstate, pertrans, pergroupstate,
+								 aggstate->phase->aggnode->sortWorkMemLimit);
 		}
 	}
 }
@@ -1498,7 +1512,7 @@ build_hash_tables(AggState *aggstate)
 		}
 #endif
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, setno, nbuckets, memory);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1508,7 +1522,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, long nbuckets,
+				 Size hash_mem_limit)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
@@ -1537,6 +1552,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 perhash->aggnode->grpCollations,
 											 nbuckets,
 											 additionalsize,
+											 hash_mem_limit,
 											 metacxt,
 											 hashcxt,
 											 tmpcxt,
@@ -1805,12 +1821,11 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  */
 void
 hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
-					Size *mem_limit, uint64 *ngroups_limit,
+					Size hash_mem_limit, Size *mem_limit, uint64 *ngroups_limit,
 					int *num_partitions)
 {
 	int			npartitions;
 	Size		partition_mem;
-	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
 	if (input_groups * hashentrysize <= hash_mem_limit)
@@ -1830,6 +1845,7 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups,
 											 hashentrysize,
 											 used_bits,
+											 hash_mem_limit,
 											 NULL);
 	if (num_partitions != NULL)
 		*num_partitions = npartitions;
@@ -1927,7 +1943,8 @@ hash_agg_enter_spill_mode(AggState *aggstate)
 
 			hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
+							   aggstate->hashentrysize,
+							   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 		}
 	}
 }
@@ -2014,9 +2031,9 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  */
 static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
-						   int used_bits, int *log2_npartitions)
+						   int used_bits, Size hash_mem_limit,
+						   int *log2_npartitions)
 {
-	Size		hash_mem_limit = get_hash_memory_limit();
 	double		partition_limit;
 	double		mem_wanted;
 	double		dpartitions;
@@ -2095,7 +2112,8 @@ initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
 		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
 		AggStatePerGroup pergroupstate = &pergroup[transno];
 
-		initialize_aggregate(aggstate, pertrans, pergroupstate);
+		initialize_aggregate(aggstate, pertrans, pergroupstate,
+							 aggstate->phase->aggnode->sortWorkMemLimit);
 	}
 }
 
@@ -2156,7 +2174,8 @@ lookup_hash_entries(AggState *aggstate)
 			if (spill->partitions == NULL)
 				hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 								   perhash->aggnode->numGroups,
-								   aggstate->hashentrysize);
+								   aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 
 			hashagg_spill_tuple(aggstate, spill, slot, hash);
 			pergroup[setno] = NULL;
@@ -2630,7 +2649,9 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
-						batch->used_bits, &aggstate->hash_mem_limit,
+						batch->used_bits,
+						(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
+						&aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
 	/*
@@ -2718,7 +2739,8 @@ agg_refill_hash_table(AggState *aggstate)
 				 */
 				spill_initialized = true;
 				hashagg_spill_init(&spill, tapeset, batch->used_bits,
-								   batch->input_card, aggstate->hashentrysize);
+								   batch->input_card, aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 			}
 			/* no memory for a new group, spill */
 			hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
@@ -2916,13 +2938,15 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
  */
 static void
 hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
-				   double input_groups, double hashentrysize)
+				   double input_groups, double hashentrysize,
+				   Size hash_mem_limit)
 {
 	int			npartitions;
 	int			partition_bits;
 
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
-											 used_bits, &partition_bits);
+											 used_bits, hash_mem_limit,
+											 &partition_bits);
 
 #ifdef USE_INJECTION_POINTS
 	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
@@ -3649,6 +3673,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			totalGroups += aggstate->perhash[k].aggnode->numGroups;
 
 		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 0b32c3a022f..5e006baa88d 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -91,7 +91,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	else
 	{
 		/* XXX should we use less than work_mem for this? */
-		tbm = tbm_create(work_mem * (Size) 1024,
+		Assert(node->ss.ps.plan->workmem_limit > 0);
+		tbm = tbm_create((Size) node->ss.ps.plan->workmem_limit * 1024,
 						 ((BitmapIndexScan *) node->ss.ps.plan)->isshared ?
 						 node->ss.ps.state->es_query_dsa : NULL);
 	}
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 231760ec93d..4ba32639f7d 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -143,7 +143,8 @@ MultiExecBitmapOr(BitmapOrState *node)
 			if (result == NULL) /* first subplan */
 			{
 				/* XXX should we use less than work_mem for this? */
-				result = tbm_create(work_mem * (Size) 1024,
+				Assert(subnode->plan->workmem_limit > 0);
+				result = tbm_create((Size) subnode->plan->workmem_limit * 1024,
 									((BitmapOr *) node->ps.plan)->isshared ?
 									node->ps.state->es_query_dsa : NULL);
 			}
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index e1675f66b43..2272185dce7 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -232,7 +232,8 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 		/* I am the leader */
 		prmdata->value = PointerGetDatum(scanstate);
 		scanstate->leader = scanstate;
-		scanstate->cte_table = tuplestore_begin_heap(true, false, work_mem);
+		scanstate->cte_table =
+			tuplestore_begin_heap(true, false, node->scan.plan.workmem_limit);
 		tuplestore_set_eflags(scanstate->cte_table, scanstate->eflags);
 		scanstate->readptr = 0;
 	}
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 644363582d9..bbb93a8dd58 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -95,6 +95,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											node->funcstates[0].tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
@@ -154,6 +155,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											fs->tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d54cfe5fdbe..60afda04069 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -38,6 +38,7 @@
 #include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/syscache.h"
@@ -449,6 +450,7 @@ ExecHashTableCreate(HashState *state)
 	Hash	   *node;
 	HashJoinTable hashtable;
 	Plan	   *outerNode;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;
 	int			nbuckets;
 	int			nbatch;
@@ -473,8 +475,12 @@ ExecHashTableCreate(HashState *state)
 	 */
 	rows = node->plan.parallel_aware ? node->rows_total : outerNode->plan_rows;
 
+	worker_space_allowed = (size_t) node->plan.workmem_limit * 1024;
+	Assert(worker_space_allowed > 0);
+
 	ExecChooseHashTableSize(rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
+							worker_space_allowed,
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
@@ -601,6 +607,7 @@ ExecHashTableCreate(HashState *state)
 		{
 			pstate->nbatch = nbatch;
 			pstate->space_allowed = space_allowed;
+			pstate->worker_space_allowed = worker_space_allowed;
 			pstate->growth = PHJ_GROWTH_OK;
 
 			/* Set up the shared state for coordinating batches. */
@@ -658,9 +665,10 @@ ExecHashTableCreate(HashState *state)
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
+						size_t worker_space_allowed,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t *space_allowed,
+						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
 						int *num_skew_mcvs,
@@ -690,9 +698,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
-	 * Compute in-memory hashtable size limit from GUCs.
+	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = get_hash_memory_limit();
+	hash_table_bytes = worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -709,7 +717,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		hash_table_bytes = (size_t) newlimit;
 	}
 
-	*space_allowed = hash_table_bytes;
+	*total_space_allowed = hash_table_bytes;
 
 	/*
 	 * If skew optimization is possible, estimate the number of skew buckets
@@ -813,8 +821,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		if (try_combined_hash_mem)
 		{
 			ExecChooseHashTableSize(ntuples, tupwidth, useskew,
-									false, parallel_workers,
-									space_allowed,
+									worker_space_allowed, false,
+									parallel_workers,
+									total_space_allowed,
 									numbuckets,
 									numbatches,
 									num_skew_mcvs,
@@ -1242,7 +1251,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					 * to switch from one large combined memory budget to the
 					 * regular hash_mem budget.
 					 */
-					pstate->space_allowed = get_hash_memory_limit();
+					pstate->space_allowed = pstate->worker_space_allowed;
 
 					/*
 					 * The combined hash_mem of all participants wasn't
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 975b0397e7a..503d75e364b 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -312,7 +312,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.sortOperators[nPresortedCols]),
 												&(plannode->sort.collations[nPresortedCols]),
 												&(plannode->sort.nullsFirst[nPresortedCols]),
-												work_mem,
+												plannode->sort.plan.workmem_limit,
 												NULL,
 												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
@@ -613,7 +613,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.sortOperators,
 												  plannode->sort.collations,
 												  plannode->sort.nullsFirst,
-												  work_mem,
+												  plannode->sort.plan.workmem_limit,
 												  NULL,
 												  node->bounded ?
 												  TUPLESORT_ALLOWBOUNDED :
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9798bb75365..10f764c1bd5 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -61,7 +61,8 @@ ExecMaterial(PlanState *pstate)
 	 */
 	if (tuplestorestate == NULL && node->eflags != 0)
 	{
-		tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
+		tuplestorestate =
+			tuplestore_begin_heap(true, false, node->ss.ps.plan->workmem_limit);
 		tuplestore_set_eflags(tuplestorestate, node->eflags);
 		if (node->eflags & EXEC_FLAG_MARK)
 		{
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 609deb12afb..a3fc37745ca 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -1036,7 +1036,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
 	mstate->mem_used = 0;
 
 	/* Limit the total memory consumed by the cache to this */
-	mstate->mem_limit = get_hash_memory_limit();
+	mstate->mem_limit = (Size) node->plan.workmem_limit * 1024;
 
 	/* A memory context dedicated for the cache */
 	mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..96dc8d53db3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -52,6 +52,7 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->dupCollations,
 											 node->numGroups,
 											 0,
+											 (Size) node->hashWorkMemLimit * 1024,
 											 rustate->ps.state->es_query_cxt,
 											 rustate->tableContext,
 											 rustate->tempContext,
@@ -202,8 +203,15 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/* initialize processing state */
 	rustate->recursing = false;
 	rustate->intermediate_empty = true;
-	rustate->working_table = tuplestore_begin_heap(false, false, work_mem);
-	rustate->intermediate_table = tuplestore_begin_heap(false, false, work_mem);
+
+	/*
+	 * NOTE: each of our working tables gets the same workmem_limit, since
+	 * we're going to swap them repeatedly.
+	 */
+	rustate->working_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
+	rustate->intermediate_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
 
 	/*
 	 * If hashing, we need a per-tuple memory context for comparisons, and a
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5b7ff9c3748..7b71adf05dc 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -105,6 +105,7 @@ build_hash_table(SetOpState *setopstate)
 												node->cmpCollations,
 												node->numGroups,
 												sizeof(SetOpStatePerGroupData),
+												(Size) node->plan.workmem_limit * 1024,
 												setopstate->ps.state->es_query_cxt,
 												setopstate->tableContext,
 												econtext->ecxt_per_tuple_memory,
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f603337ecd3..1da77ab1d6a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -107,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->sortOperators[0],
 												   plannode->collations[0],
 												   plannode->nullsFirst[0],
-												   work_mem,
+												   plannode->plan.workmem_limit,
 												   NULL,
 												   tuplesortopts);
 		else
@@ -117,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->sortOperators,
 												  plannode->collations,
 												  plannode->nullsFirst,
-												  work_mem,
+												  plannode->plan.workmem_limit,
 												  NULL,
 												  tuplesortopts);
 		if (node->bounded)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 49767ed6a52..73214501238 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -546,6 +546,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  node->tab_collations,
 											  nbuckets,
 											  0,
+											  (Size) subplan->hashtab_workmem_limit * 1024,
 											  node->planstate->state->es_query_cxt,
 											  node->hashtablecxt,
 											  node->hashtempcxt,
@@ -575,6 +576,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 												  node->tab_collations,
 												  nbuckets,
 												  0,
+												  (Size) subplan->hashnul_workmem_limit * 1024,
 												  node->planstate->state->es_query_cxt,
 												  node->hashtablecxt,
 												  node->hashtempcxt,
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 83ade3f9437..8a9e534a743 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -276,7 +276,8 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	/* build tuplestore for the result */
 	oldcxt = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
-	tstate->tupstore = tuplestore_begin_heap(false, false, work_mem);
+	tstate->tupstore = tuplestore_begin_heap(false, false,
+											 tstate->ss.ps.plan->workmem_limit);
 
 	/*
 	 * Each call to fetch a new set of rows - of which there may be very many
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 9a1acce2b5d..76819d140ba 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1092,7 +1092,8 @@ prepare_tuplestore(WindowAggState *winstate)
 	Assert(winstate->buffer == NULL);
 
 	/* Create new tuplestore */
-	winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
+	winstate->buffer = tuplestore_begin_heap(false, false,
+											 node->plan.workmem_limit);
 
 	/*
 	 * Set up read pointers for the tuplestore.  The current pointer doesn't
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7c1fdde842b..fecea810b6e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1119,7 +1119,6 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 
-
 	/*
 	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
 	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
@@ -2875,7 +2874,8 @@ cost_agg(Path *path, PlannerInfo *root,
 		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
 											input_width,
 											aggcosts->transitionSpace);
-		hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+		hash_agg_set_limits(hashentrysize, numGroups, 0,
+							get_hash_memory_limit(), &mem_limit,
 							&ngroups_limit, &num_partitions);
 
 		nbatches = Max((numGroups * hashentrysize) / mem_limit,
@@ -4323,6 +4323,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	ExecChooseHashTableSize(inner_path_rows_total,
 							inner_path->pathtarget->width,
 							true,	/* useskew */
+							get_hash_memory_limit(),
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
 							&space_allowed,
@@ -4651,15 +4652,19 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 		/*
 		 * Estimate working memory needed for the hashtable (and hashnulls, if
-		 * needed). The logic below MUST match the logic in buildSubPlanHash()
-		 * and ExecInitSubPlan().
+		 * needed). The "nbuckets" estimate must match the logic in
+		 * buildSubPlanHash() and ExecInitSubPlan().
 		 */
 		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
 		if (nbuckets < 1)
 			nbuckets = 1;
 
+		/*
+		 * This estimate must match the logic in subpath_is_hashable() (and
+		 * see comments there).
+		 */
 		hashentrysize = MAXALIGN(plan->plan_width) +
-			MAXALIGN(SizeofMinimalTupleHeader);
+			MAXALIGN(SizeofHeapTupleHeader);
 
 		subplan->hashtab_workmem =
 			normalize_workmem((double) nbuckets * hashentrysize);
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 50454952eb2..498a1a3a4b6 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -72,6 +72,7 @@ typedef struct ExplainState
 								 * entry */
 	int			num_workers;	/* # of worker processes planned to use */
 	double		total_workmem;	/* total working memory estimate (in bytes) */
+	double		total_workmem_limit;	/* total working-memory limit (in kB) */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d12e3f451d2..c4147876d55 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  Oid *collations,
 										  long nbuckets,
 										  Size additionalsize,
+										  Size hash_mem_limit,
 										  MemoryContext metacxt,
 										  MemoryContext tablecxt,
 										  MemoryContext tempcxt,
@@ -499,6 +500,7 @@ extern Tuplestorestate *ExecMakeTableFunctionResult(SetExprState *setexpr,
 													ExprContext *econtext,
 													MemoryContext argContext,
 													TupleDesc expectedDesc,
+													int workMem,
 													bool randomAccess);
 extern SetExprState *ExecInitFunctionResultSet(Expr *expr,
 											   ExprContext *econtext, PlanState *parent);
@@ -724,4 +726,9 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execWorkmem.c
+ */
+extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..9b184c47322 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -253,7 +253,8 @@ typedef struct ParallelHashJoinState
 	ParallelHashGrowth growth;	/* control batch/bucket growth */
 	dsa_pointer chunk_work_queue;	/* chunk work queue */
 	int			nparticipants;
-	size_t		space_allowed;
+	size_t		space_allowed;	/* -- might be shared with other workers */
+	size_t		worker_space_allowed;	/* -- exclusive to this worker */
 	size_t		total_tuples;	/* total number of inner tuples */
 	LWLock		lock;			/* lock protecting the above */
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 34b82d0f5d1..728006b3ff5 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -329,8 +329,9 @@ extern void ExecReScanAgg(AggState *node);
 extern Size hash_agg_entry_size(int numTrans, Size tupleWidth,
 								Size transitionSpace);
 extern void hash_agg_set_limits(double hashentrysize, double input_groups,
-								int used_bits, Size *mem_limit,
-								uint64 *ngroups_limit, int *num_partitions);
+								int used_bits, Size hash_mem_limit,
+								Size *mem_limit, uint64 *ngroups_limit,
+								int *num_partitions);
 
 /* parallel instrumentation support */
 extern void ExecAggEstimate(AggState *node, ParallelContext *pcxt);
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index fc5b20994dd..6a40730c065 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -57,9 +57,10 @@ extern bool ExecParallelScanHashTableForUnmatched(HashJoinState *hjstate,
 extern void ExecHashTableReset(HashJoinTable hashtable);
 extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
+									size_t worker_space_allowed,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t *space_allowed,
+									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
 									int *num_skew_mcvs,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d3f8fd7bd6c..445953c77d3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,7 @@ typedef struct Plan
 	Cost		total_cost;
 
 	int			workmem;		/* estimated work_mem (in KB) */
+	int			workmem_limit;	/* work_mem limit per parallel worker (in KB) */
 
 	/*
 	 * planner's estimate of result size of this plan step
@@ -237,7 +238,7 @@ typedef struct Plan
 
 /* ----------------
  *	 Result node -
- *		If no outer plan, evaluate a variable-free targetlist.
+z *		If no outer plan, evaluate a variable-free targetlist.
  *		If outer plan, return tuples from outer plan (after a level of
  *		projection as shown by targetlist).
  *
@@ -433,6 +434,8 @@ typedef struct RecursiveUnion
 
 	/* estimated work_mem for hash table (in KB) */
 	int			hashWorkMem;
+	/* work_mem reserved for hash table */
+	int			hashWorkMemLimit;
 } RecursiveUnion;
 
 /* ----------------
@@ -1158,6 +1161,9 @@ typedef struct Agg
 	/* estimated work_mem needed to sort each input (in KB) */
 	int			sortWorkMem;
 
+	/* work_mem limit to sort one input (in KB) */
+	int			sortWorkMemLimit;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index b7d6b0fe7dc..7232d07e8b8 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1111,6 +1111,8 @@ typedef struct SubPlan
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
 	int			hashtab_workmem;	/* estimated hashtable work_mem (in KB) */
 	int			hashnul_workmem;	/* estimated hashnulls work_mem (in KB) */
+	int			hashtab_workmem_limit;	/* hashtable work_mem limit (in kB) */
+	int			hashnul_workmem_limit;	/* hashnulls work_mem limit (in kB) */
 } SubPlan;
 
 /*
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
index 215180808f4..c1a3bdd93d2 100644
--- a/src/test/regress/expected/workmem.out
+++ b/src/test/regress/expected/workmem.out
@@ -29,17 +29,18 @@ order by unique1;
 ');
                          workmem_filter                          
 -----------------------------------------------------------------
- Sort  (work_mem=N kB)
+ Sort  (work_mem=N kB limit=4096 kB)
    Sort Key: onek.unique1
    ->  Nested Loop
-         ->  HashAggregate  (work_mem=N kB)
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                Group Key: "*VALUES*".column1, "*VALUES*".column2
                ->  Values Scan on "*VALUES*"
          ->  Index Scan using onek_unique1 on onek
                Index Cond: (unique1 = "*VALUES*".column1)
                Filter: ("*VALUES*".column2 = ten)
  Total Working Memory: N kB
-(10 rows)
+ Total Working Memory Limit: 12288 kB
+(11 rows)
 
 select *
 from onek
@@ -64,18 +65,19 @@ order by unique1;
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- Sort  (work_mem=N kB)
+ Sort  (work_mem=N kB limit=4096 kB)
    Sort Key: onek.unique1
    ->  Nested Loop
          ->  Unique
-               ->  Sort  (work_mem=N kB)
+               ->  Sort  (work_mem=N kB limit=4096 kB)
                      Sort Key: "*VALUES*".column1, "*VALUES*".column2
                      ->  Values Scan on "*VALUES*"
          ->  Index Scan using onek_unique1 on onek
                Index Cond: (unique1 = "*VALUES*".column1)
                Filter: ("*VALUES*".column2 = ten)
  Total Working Memory: N kB
-(11 rows)
+ Total Working Memory Limit: 8192 kB
+(12 rows)
 
 select *
 from onek
@@ -95,17 +97,18 @@ explain (costs off, work_mem on)
 select * from (select * from tenk1 order by four) t order by four, ten
 limit 1;
 ');
-             workmem_filter              
------------------------------------------
+                    workmem_filter                     
+-------------------------------------------------------
  Limit
-   ->  Incremental Sort  (work_mem=N kB)
+   ->  Incremental Sort  (work_mem=N kB limit=8192 kB)
          Sort Key: tenk1.four, tenk1.ten
          Presorted Key: tenk1.four
-         ->  Sort  (work_mem=N kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
                Sort Key: tenk1.four
                ->  Seq Scan on tenk1
  Total Working Memory: N kB
-(8 rows)
+ Total Working Memory Limit: 12288 kB
+(9 rows)
 
 select * from (select * from tenk1 order by four) t order by four, ten
 limit 1;
@@ -131,16 +134,17 @@ where exists (select 1 from tenk1 t3
    ->  Nested Loop
          ->  Hash Join
                Hash Cond: (t3.thousand = t1.unique1)
-               ->  HashAggregate  (work_mem=N kB)
+               ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                      Group Key: t3.thousand, t3.tenthous
                      ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
-               ->  Hash  (work_mem=N kB)
+               ->  Hash  (work_mem=N kB limit=8192 kB)
                      ->  Index Only Scan using onek_unique1 on onek t1
                            Index Cond: (unique1 < 1)
          ->  Index Only Scan using tenk1_hundred on tenk1 t2
                Index Cond: (hundred = t3.tenthous)
  Total Working Memory: N kB
-(13 rows)
+ Total Working Memory Limit: 16384 kB
+(14 rows)
 
 select count(*) from (
 select t1.unique1, t2.hundred
@@ -165,23 +169,24 @@ from int4_tbl t1, int4_tbl t2
 where t4.f1 is null
 ) t;
 ');
-                       workmem_filter                        
--------------------------------------------------------------
+                              workmem_filter                              
+--------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Nested Loop Left Join
                Filter: (t4.f1 IS NULL)
                ->  Seq Scan on int4_tbl t2
-               ->  Materialize  (work_mem=N kB)
+               ->  Materialize  (work_mem=N kB limit=4096 kB)
                      ->  Nested Loop Left Join
                            Join Filter: (t3.f1 > 1)
                            ->  Seq Scan on int4_tbl t3
                                  Filter: (f1 > 0)
-                           ->  Materialize  (work_mem=N kB)
+                           ->  Materialize  (work_mem=N kB limit=4096 kB)
                                  ->  Seq Scan on int4_tbl t4
          ->  Seq Scan on int4_tbl t1
  Total Working Memory: N kB
-(14 rows)
+ Total Working Memory Limit: 8192 kB
+(15 rows)
 
 select count(*) from (
 select t1.f1
@@ -204,16 +209,17 @@ group by grouping sets((a, b), (a));
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- WindowAgg  (work_mem=N kB)
-   ->  Sort  (work_mem=N kB)
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
          Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
-         ->  HashAggregate  (work_mem=N kB)
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                Hash Key: "*VALUES*".column1, "*VALUES*".column2
                Hash Key: "*VALUES*".column1
                ->  Values Scan on "*VALUES*"
                      Filter: (column1 = column2)
  Total Working Memory: N kB
-(9 rows)
+ Total Working Memory Limit: 16384 kB
+(10 rows)
 
 select a, b, row_number() over (order by a, b nulls first)
 from (values (1, 1), (2, 2)) as t (a, b) where a = b
@@ -236,10 +242,10 @@ group by grouping sets((a, b), (a), (b), (c), (d));
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- WindowAgg  (work_mem=N kB)
-   ->  Sort  (work_mem=N kB)
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
          Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
-         ->  GroupAggregate  (work_mem=N kB)
+         ->  GroupAggregate  (work_mem=N kB limit=8192 kB)
                Group Key: "*VALUES*".column1, "*VALUES*".column2
                Group Key: "*VALUES*".column1
                Sort Key: "*VALUES*".column2
@@ -248,12 +254,13 @@ group by grouping sets((a, b), (a), (b), (c), (d));
                  Group Key: "*VALUES*".column3
                Sort Key: "*VALUES*".column4
                  Group Key: "*VALUES*".column4
-               ->  Sort  (work_mem=N kB)
+               ->  Sort  (work_mem=N kB limit=4096 kB)
                      Sort Key: "*VALUES*".column1
                      ->  Values Scan on "*VALUES*"
                            Filter: (column1 = column2)
  Total Working Memory: N kB
-(17 rows)
+ Total Working Memory Limit: 20480 kB
+(18 rows)
 
 select a, b, row_number() over (order by a, b nulls first)
 from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
@@ -282,17 +289,18 @@ select workmem_filter('
 explain (costs off, work_mem on)
 select length(stringu1) from tenk1 group by length(stringu1);
 ');
-                   workmem_filter                   
-----------------------------------------------------
- Finalize HashAggregate  (work_mem=N kB)
+                          workmem_filter                           
+-------------------------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB limit=8192 kB)
    Group Key: (length((stringu1)::text))
    ->  Gather
          Workers Planned: 4
-         ->  Partial HashAggregate  (work_mem=N kB)
+         ->  Partial HashAggregate  (work_mem=N kB limit=40960 kB)
                Group Key: length((stringu1)::text)
                ->  Parallel Seq Scan on tenk1
  Total Working Memory: N kB
-(8 rows)
+ Total Working Memory Limit: 49152 kB
+(9 rows)
 
 select length(stringu1) from tenk1 group by length(stringu1);
  length 
@@ -307,12 +315,13 @@ reset max_parallel_workers_per_gather;
 -- Agg (simple) [no work_mem]
 explain (costs off, work_mem on)
 select MAX(length(stringu1)) from tenk1;
-         QUERY PLAN         
-----------------------------
+            QUERY PLAN            
+----------------------------------
  Aggregate
    ->  Seq Scan on tenk1
  Total Working Memory: 0 kB
-(3 rows)
+ Total Working Memory Limit: 0 kB
+(4 rows)
 
 select MAX(length(stringu1)) from tenk1;
  max 
@@ -328,12 +337,13 @@ select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
 ) t;
 ');
-                      workmem_filter                       
------------------------------------------------------------
+                             workmem_filter                              
+-------------------------------------------------------------------------
  Aggregate
-   ->  Function Scan on generate_series a  (work_mem=N kB)
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(3 rows)
+ Total Working Memory Limit: 4096 kB
+(4 rows)
 
 select count(*) from (
 select sum(n) over(partition by m)
@@ -352,12 +362,13 @@ from rows from(generate_series(1, 5),
                generate_series(2, 10),
                generate_series(4, 15));
 ');
-                     workmem_filter                      
----------------------------------------------------------
+                             workmem_filter                             
+------------------------------------------------------------------------
  Aggregate
-   ->  Function Scan on generate_series  (work_mem=N kB)
+   ->  Function Scan on generate_series  (work_mem=N kB limit=12288 kB)
  Total Working Memory: N kB
-(3 rows)
+ Total Working Memory Limit: 12288 kB
+(4 rows)
 
 select count(*)
 from rows from(generate_series(1, 5),
@@ -384,13 +395,14 @@ SELECT  xmltable.*
                                   unit text PATH ''SIZE/@unit'',
                                   premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
 ');
-                      workmem_filter                      
-----------------------------------------------------------
+                             workmem_filter                             
+------------------------------------------------------------------------
  Nested Loop
    ->  Seq Scan on xmldata
-   ->  Table Function Scan on "xmltable"  (work_mem=N kB)
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(4 rows)
+ Total Working Memory Limit: 4096 kB
+(5 rows)
 
 SELECT  xmltable.*
    FROM (SELECT data FROM xmldata) x,
@@ -418,7 +430,8 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
    ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
          Filter: (unique2 <> 10)
  Total Working Memory: 0 kB
-(5 rows)
+ Total Working Memory Limit: 0 kB
+(6 rows)
 
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1 
@@ -435,11 +448,12 @@ select count(*) from
                           workmem_filter                          
 ------------------------------------------------------------------
  Aggregate
-   ->  HashSetOp Intersect  (work_mem=N kB)
+   ->  HashSetOp Intersect  (work_mem=N kB limit=8192 kB)
          ->  Seq Scan on tenk1
          ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
  Total Working Memory: N kB
-(5 rows)
+ Total Working Memory Limit: 8192 kB
+(6 rows)
 
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -456,23 +470,24 @@ cross join lateral (with recursive x(a) as (
           select o.four as a union select a + 1 from x where a < 10)
     select * from x) ss where o.ten = 1;
 ');
-                       workmem_filter                       
-------------------------------------------------------------
+                              workmem_filter                               
+---------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Memoize  (work_mem=N kB)
+         ->  Memoize  (work_mem=N kB limit=8192 kB)
                Cache Key: o.four
                Cache Mode: binary
-               ->  CTE Scan on x  (work_mem=N kB)
+               ->  CTE Scan on x  (work_mem=N kB limit=4096 kB)
                      CTE x
-                       ->  Recursive Union  (work_mem=N kB)
+                       ->  Recursive Union  (work_mem=N kB limit=16384 kB)
                              ->  Result
                              ->  WorkTable Scan on x x_1
                                    Filter: (a < 10)
  Total Working Memory: N kB
-(14 rows)
+ Total Working Memory Limit: 28672 kB
+(15 rows)
 
 select sum(o.four), sum(ss.a) from onek o
 cross join lateral (with recursive x(a) as (
@@ -491,20 +506,21 @@ WITH q1(x,y) AS (
   )
 SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
 ');
-                   workmem_filter                   
-----------------------------------------------------
+                          workmem_filter                          
+------------------------------------------------------------------
  Aggregate
    CTE q1
-     ->  HashAggregate  (work_mem=N kB)
+     ->  HashAggregate  (work_mem=N kB limit=8192 kB)
            Group Key: tenk1.hundred
            ->  Seq Scan on tenk1
    InitPlan 2
      ->  Aggregate
-           ->  CTE Scan on q1 qsub  (work_mem=N kB)
-   ->  CTE Scan on q1  (work_mem=N kB)
+           ->  CTE Scan on q1 qsub  (work_mem=N kB limit=4096 kB)
+   ->  CTE Scan on q1  (work_mem=N kB limit=4096 kB)
          Filter: ((y)::numeric > (InitPlan 2).col1)
  Total Working Memory: N kB
-(11 rows)
+ Total Working Memory Limit: 16384 kB
+(12 rows)
 
 WITH q1(x,y) AS (
     SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
@@ -522,15 +538,16 @@ select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
 limit 5;
 ');
-                            workmem_filter                             
------------------------------------------------------------------------
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
  Limit
-   ->  WindowAgg  (work_mem=N kB)
-         ->  Sort  (work_mem=N kB)
+   ->  WindowAgg  (work_mem=N kB limit=4096 kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
                Sort Key: ((a.n < 3))
-               ->  Function Scan on generate_series a  (work_mem=N kB)
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(6 rows)
+ Total Working Memory Limit: 12288 kB
+(7 rows)
 
 select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
@@ -560,20 +577,21 @@ select * from tenk1 a join tenk1 b on
          ->  Bitmap Heap Scan on tenk1 b
                Recheck Cond: ((hundred = 4) OR (unique1 = 2))
                ->  BitmapOr
-                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB)
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB limit=4096 kB)
                            Index Cond: (hundred = 4)
-                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
                            Index Cond: (unique1 = 2)
-         ->  Materialize  (work_mem=N kB)
+         ->  Materialize  (work_mem=N kB limit=4096 kB)
                ->  Bitmap Heap Scan on tenk1 a
                      Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
                      ->  BitmapOr
-                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB)
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB limit=4096 kB)
                                  Index Cond: (unique2 = 3)
-                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
                                  Index Cond: (unique1 = 1)
  Total Working Memory: N kB
-(19 rows)
+ Total Working Memory Limit: 20480 kB
+(20 rows)
 
 select count(*) from (
 select * from tenk1 a join tenk1 b on
@@ -589,15 +607,16 @@ select workmem_filter('
 explain (costs off, work_mem on)
 select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
 ');
-       workmem_filter       
-----------------------------
- Result  (work_mem=N kB)
+             workmem_filter             
+----------------------------------------
+ Result  (work_mem=N kB limit=16384 kB)
    SubPlan 1
      ->  Append
            ->  Result
            ->  Result
  Total Working Memory: N kB
-(6 rows)
+ Total Working Memory Limit: 16384 kB
+(7 rows)
 
 select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
  ?column? 
@@ -612,16 +631,17 @@ select 1 = any (select (select 1) where 1 = any (select 1));
 ');
                          workmem_filter                         
 ----------------------------------------------------------------
- Result  (work_mem=N kB)
+ Result  (work_mem=N kB limit=16384 kB)
    SubPlan 3
-     ->  Result  (work_mem=N kB)
+     ->  Result  (work_mem=N kB limit=8192 kB)
            One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
            InitPlan 1
              ->  Result
            SubPlan 2
              ->  Result
  Total Working Memory: N kB
-(9 rows)
+ Total Working Memory Limit: 24576 kB
+(10 rows)
 
 select 1 = any (select (select 1) where 1 = any (select 1));
  ?column? 
-- 
2.47.1

v01_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchapplication/octet-stream; name=v01_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchDownload

From a93e25a8a88dfbde6cc5347b7ea318c824675339 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 21 Feb 2025 00:41:31 +0000
Subject: [PATCH 4/4] Add "workmem_hook" to allow extensions to override
 per-node work_mem

---
 contrib/Makefile                     |   3 +-
 contrib/workmem/Makefile             |  20 +
 contrib/workmem/expected/workmem.out | 676 +++++++++++++++++++++++++++
 contrib/workmem/meson.build          |  28 ++
 contrib/workmem/sql/workmem.sql      | 304 ++++++++++++
 contrib/workmem/workmem.c            | 654 ++++++++++++++++++++++++++
 src/backend/executor/execWorkmem.c   |  37 +-
 src/include/executor/executor.h      |   4 +
 8 files changed, 1716 insertions(+), 10 deletions(-)
 create mode 100644 contrib/workmem/Makefile
 create mode 100644 contrib/workmem/expected/workmem.out
 create mode 100644 contrib/workmem/meson.build
 create mode 100644 contrib/workmem/sql/workmem.sql
 create mode 100644 contrib/workmem/workmem.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..b4880ab7067 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,7 +50,8 @@ SUBDIRS = \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
-		vacuumlo
+		vacuumlo	\
+		workmem
 
 ifeq ($(with_ssl),openssl)
 SUBDIRS += pgcrypto sslinfo
diff --git a/contrib/workmem/Makefile b/contrib/workmem/Makefile
new file mode 100644
index 00000000000..f920cdb9964
--- /dev/null
+++ b/contrib/workmem/Makefile
@@ -0,0 +1,20 @@
+# contrib/workmem/Makefile
+
+MODULE_big = workmem
+OBJS = \
+	$(WIN32RES) \
+	workmem.o
+PGFILEDESC = "workmem - extension that adjusts PostgreSQL work_mem per node"
+
+REGRESS = workmem
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/workmem
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/workmem/expected/workmem.out b/contrib/workmem/expected/workmem.out
new file mode 100644
index 00000000000..a2c6d3be4d2
--- /dev/null
+++ b/contrib/workmem/expected/workmem.out
@@ -0,0 +1,676 @@
+load 'workmem';
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=25600 kB)
+   ->  Sort  (work_mem=N kB limit=25600 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=51200 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20480 kB)
+   ->  Sort  (work_mem=N kB limit=20480 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=40960 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=20480 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=102400 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=102399 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102399 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                    
+--------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=34133 kB)
+         ->  Sort  (work_mem=N kB limit=34133 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=34134 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter              
+-----------------------------------------
+ Result  (work_mem=N kB limit=102400 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=68267 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=34133 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=1024 kB)
+   ->  Sort  (work_mem=N kB limit=1024 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=2048 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=819 kB)
+   ->  Sort  (work_mem=N kB limit=819 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=1638 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=820 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=4095 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4095 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=1365 kB)
+         ->  Sort  (work_mem=N kB limit=1365 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=1366 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+            workmem_filter             
+---------------------------------------
+ Result  (work_mem=N kB limit=4096 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=2731 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=1365 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20 kB)
+   ->  Sort  (work_mem=N kB limit=20 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=40 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=16 kB)
+   ->  Sort  (work_mem=N kB limit=16 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=32 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=16 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=80 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                           workmem_filter                            
+---------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=78 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 78 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                                  workmem_filter                                   
+-----------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=26 kB)
+         ->  Sort  (work_mem=N kB limit=27 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=27 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+           workmem_filter            
+-------------------------------------
+ Result  (work_mem=N kB limit=80 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=54 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=26 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/meson.build b/contrib/workmem/meson.build
new file mode 100644
index 00000000000..fce8030ba45
--- /dev/null
+++ b/contrib/workmem/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+workmem_sources = files(
+  'workmem.c',
+)
+
+if host_system == 'windows'
+  workmem_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'workmem',
+    '--FILEDESC', 'workmem - extension that adjusts PostgreSQL work_mem per node',])
+endif
+
+workmem = shared_module('workmem',
+  workmem_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += workmem
+
+tests += {
+  'name': 'workmem',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'workmem',
+    ],
+  },
+}
diff --git a/contrib/workmem/sql/workmem.sql b/contrib/workmem/sql/workmem.sql
new file mode 100644
index 00000000000..e6dbc35bf10
--- /dev/null
+++ b/contrib/workmem/sql/workmem.sql
@@ -0,0 +1,304 @@
+load 'workmem';
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
+
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/workmem.c b/contrib/workmem/workmem.c
new file mode 100644
index 00000000000..c758e49c162
--- /dev/null
+++ b/contrib/workmem/workmem.c
@@ -0,0 +1,654 @@
+/*-------------------------------------------------------------------------
+ *
+ * workmem.c
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/workmem/workmem.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+
+PG_MODULE_MAGIC;
+
+/* Local variables */
+
+/*
+ * A Target represents a collection of data structures, belonging to an
+ * execution node, that all share the same memory limit.
+ *
+ * For example, in parallel query, every parallel worker (plus the leader)
+ * gets a copy of the execution node, and therefore a copy of all of that
+ * node's work_mem limits. In this case, we'll track a single Target, but its
+ * count will include (1 + num_workers), because this Target gets "applied"
+ * to (1 + num_workers) execution nodes.
+ */
+typedef struct Target
+{
+	/* # of data structures to which target applies: */
+	int			count;
+	/* workmem estimate for each of these data structures: */
+	int			workmem;
+	/* (original) workmem limit for each of these data structures: */
+	int			limit;
+	/* workmem estimate, but capped at (original) workmem limit: */
+	int			priority;
+	/* ratio of (priority / limit); measure's Target's "greediness": */
+	double		ratio;
+	/* link to target's actual limit, so we can set it: */
+	int		   *target_limit;
+}			Target;
+
+typedef struct WorkMemStats
+{
+	/* total # of data structures that get working memory: */
+	uint64		count;
+	/* total working memory estimated for this query: */
+	uint64		workmem;
+	/* total working memory (currently) reserved for this query: */
+	uint64		limit;
+	/* total "capped" working memory estimate: */
+	uint64		priority;
+	/* list of Targets, used to update actual workmem limits: */
+	List	   *targets;
+}			WorkMemStats;
+
+/* GUC variables */
+static int	workmem_query_work_mem = 100 * 1024;	/* kB */
+
+/* internal functions */
+static void workmem_fn(PlannedStmt *plannedstmt);
+
+static int	clamp_priority(int workmem, int limit);
+static Target * make_target(int workmem, int *target_limit, int count);
+static void add_target(WorkMemStats * workmem_stats, Target * target);
+
+/* Sort comparators: sort by ratio, ascending or descending. */
+static int	target_compare_asc(const ListCell *a, const ListCell *b);
+static int	target_compare_desc(const ListCell *a, const ListCell *b);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Define custom GUC variable. */
+	DefineCustomIntVariable("workmem.query_work_mem",
+							"Amount of working-memory (in kB) to provide each "
+							"query.",
+							NULL,
+							&workmem_query_work_mem,
+							100 * 1024, /* default to 100 MB */
+							64,
+							INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_KB,
+							NULL,
+							NULL,
+							NULL);
+
+	MarkGUCPrefixReserved("workmem");
+
+	/* Install hooks. */
+	ExecAssignWorkMem_hook = workmem_fn;
+}
+
+/* Compute an Agg's working memory estimate and limit. */
+typedef struct AggWorkMem
+{
+	uint64		hash_workmem;
+	int		   *hash_limit;
+
+	int			num_sorts;
+	int			max_sort_workmem;
+	int		   *sort_limit;
+}			AggWorkMem;
+
+static void
+workmem_analyze_agg_node(Agg *agg, AggWorkMem * mem,
+						 WorkMemStats * workmem_stats)
+{
+	if (agg->sortWorkMem > 0 || agg->sortWorkMemLimit > 0)
+	{
+		/* Record memory used for input sort buffers. */
+		Target	   *target = make_target(agg->sortWorkMem,
+										 &agg->sortWorkMemLimit,
+										 agg->numSorts);
+
+		add_target(workmem_stats, target);
+	}
+
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			mem->hash_workmem += agg->plan.workmem;
+
+			/* Read hash limit from the first AGG_HASHED node. */
+			if (mem->hash_limit == NULL)
+				mem->hash_limit = &agg->plan.workmem_limit;
+
+			break;
+		case AGG_SORTED:
+
+			++mem->num_sorts;
+
+			mem->max_sort_workmem = Max(mem->max_sort_workmem, agg->plan.workmem);
+
+			/* Read sort limit from the first AGG_SORTED node. */
+			if (mem->sort_limit == NULL)
+				mem->sort_limit = &agg->plan.workmem_limit;
+
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze_agg(Agg *agg, int num_workers, WorkMemStats * workmem_stats)
+{
+	AggWorkMem	mem;
+
+	memset(&mem, 0, sizeof(mem));
+
+	/* Analyze main Agg node. */
+	workmem_analyze_agg_node(agg, &mem, workmem_stats);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach_node(Agg, aggnode, agg->chain)
+		workmem_analyze_agg_node(aggnode, &mem, workmem_stats);
+
+	/*
+	 * Working memory for hash tables, if needed. All hash tables share the
+	 * same limit:
+	 */
+	if (mem.hash_workmem > 0 || mem.hash_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.hash_workmem, mem.hash_limit,
+						1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	/*
+	 * Workimg memory for (output) sort buffers, if needed. We'll need at most
+	 * 2 sort buffers:
+	 */
+	if (mem.max_sort_workmem > 0 || mem.sort_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.max_sort_workmem, mem.sort_limit,
+						Min(mem.num_sorts, 2) * (1 + num_workers));
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_subplan(SubPlan *subplan, int num_workers,
+						WorkMemStats * workmem_stats)
+{
+	if (subplan->hashtab_workmem > 0 || subplan->hashtab_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash table */
+		Target	   *target = make_target(subplan->hashtab_workmem,
+										 &subplan->hashtab_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	if (subplan->hashnul_workmem > 0 || subplan->hashnul_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash-NULL table */
+		Target	   *target = make_target(subplan->hashnul_workmem,
+										 &subplan->hashnul_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_plan(Plan *plan, int num_workers, WorkMemStats * workmem_stats)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Analyze this node's SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	if (IsA(plan, Gather) || IsA(plan, GatherMerge))
+	{
+		/*
+		 * Parallel query apparently does not run InitPlans in parallel. Well,
+		 * currently, Gather and GatherMerge Plan nodes don't contain any
+		 * quals, so they can't contain SubPlans at all; so maybe we should
+		 * move this below the SubPlan-analysis loop, as well? For now, to
+		 * maintain consistency with explain.c, we'll just leave this here.
+		 */
+		Assert(num_workers == 0);
+
+		if (IsA(plan, Gather))
+			num_workers = ((Gather *) plan)->num_workers;
+		else
+			num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	/* Analyze this node's working memory. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 1 + num_workers);
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_Agg:
+			workmem_analyze_agg((Agg *) plan, num_workers, workmem_stats);
+			break;
+		case T_FunctionScan:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				int			nfuncs =
+					list_length(((FunctionScan *) plan)->functions);
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 nfuncs * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_IncrementalSort:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 2 * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+				Target	   *target;
+
+				/* working memory for two tuplestores */
+				target = make_target(plan->workmem, &plan->workmem_limit,
+									 2 * (1 + num_workers));
+				add_target(workmem_stats, target);
+
+				/* working memory for a hash table, if needed */
+				if (runion->hashWorkMem > 0 || runion->hashWorkMemLimit > 0)
+				{
+					target = make_target(runion->hashWorkMem,
+										 &runion->hashWorkMem,
+										 1 + num_workers);
+					add_target(workmem_stats, target);
+				}
+			}
+			break;
+		default:
+			Assert(plan->workmem == 0);
+			Assert(plan->workmem_limit == 0);
+			break;
+	}
+
+	/* Now analyze this Plan's children. */
+	if (outerPlan(plan))
+		workmem_analyze_plan(outerPlan(plan), num_workers, workmem_stats);
+
+	if (innerPlan(plan))
+		workmem_analyze_plan(innerPlan(plan), num_workers, workmem_stats);
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_SubqueryScan:
+			workmem_analyze_plan(((SubqueryScan *) plan)->subplan,
+								 num_workers, workmem_stats);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	/* Analyze the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			workmem_analyze_plan(plan, 0 /* num_workers */ , workmem_stats);
+	}
+
+	/* Analyze the main Plan tree itself. */
+	workmem_analyze_plan(plannedstmt->planTree, 0 /* num_workers */ ,
+						 workmem_stats);
+}
+
+static void
+workmem_set(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			remaining = workmem_query_work_mem;
+
+	if (workmem_stats->limit <= remaining)
+	{
+		/*
+		 * "High memory" case: we have more than enough query_work_mem; now
+		 * hand out the excess.
+		 */
+
+		/* This is memory that exceeds workmem limits. */
+		remaining -= workmem_stats->limit;
+
+		/*
+		 * Sort targets from highest ratio to lowest. When we assign memory to
+		 * a Target, we'll truncate fractional KB; so by going through the
+		 * list from highest to lowest ratio, we ensure that the lowest ratios
+		 * get the leftover fractional KBs.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/* NOTE: This is extra workmem *per data structure*. */
+			extra_workmem = (int) (fraction * remaining);
+
+			*target->target_limit += extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else if (workmem_stats->priority <= remaining)
+	{
+		/*
+		 * "Medium memory" case: we don't have enough query_work_mem to give
+		 * every target its full allotment, but we do have enough to give it
+		 * as much as (we estimate) it needs.
+		 *
+		 * So, just take some memory away from nodes that (we estimate) won't
+		 * need it.
+		 */
+
+		/* This is memory that exceeds workmem estimates. */
+		remaining -= workmem_stats->priority;
+
+		/*
+		 * Sort targets from highest ratio to lowest. We'll skip any Target
+		 * with ratio > 1.0, because (we estimate) they already need their
+		 * full allotment. Also, once a target reaches its workmem limit,
+		 * we'll stop giving it more workmem, leaving the surplus memory to be
+		 * assigned to targets with smaller ratios.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/*
+			 * Don't give the target more than its (original) limit.
+			 *
+			 * NOTE: This is extra workmem *per data structure*.
+			 */
+			extra_workmem = Min((int) (fraction * remaining),
+								target->limit - target->priority);
+
+			*target->target_limit = target->priority + extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else
+	{
+		uint64		limit = workmem_stats->limit;
+
+		/*
+		 * "Low memory" case: we are severely memory constrained, and need to
+		 * take "priority" memory away from targets that (we estimate)
+		 * actually need it. We'll do this by (effectively) reducing the
+		 * global "work_mem" limit, uniformly, for all targets, until we're
+		 * under the query_work_mem limit.
+		 */
+		elog(WARNING,
+			 "not enough working memory for query: increase "
+			 "workmem.query_work_mem");
+
+		/*
+		 * Sort targets from lowest ratio to highest. For any target whose
+		 * ratio is < the target_ratio, we'll just assign it its priority (=
+		 * workmem) as limit, and return the excess workmem to our "limit",
+		 * for use by subsequent, greedier, targets.
+		 */
+		list_sort(workmem_stats->targets, target_compare_asc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		target_ratio;
+			int			target_limit;
+
+			/*
+			 * If we restrict our targets to this ratio, we'll stay within the
+			 * query_work_mem limit.
+			 */
+			target_ratio = (double) remaining / limit;
+
+			/*
+			 * Don't give this target more than its priority request (but we
+			 * might give it less).
+			 */
+			target_limit = Min(target->priority,
+							   target_ratio * target->limit);
+			*target->target_limit = target_limit;
+
+			/* "Remaining" decreases by memory we actually assigned. */
+			remaining -= (target_limit * target->count);
+
+			/*
+			 * "Limit" decreases by target's original memory limit.
+			 *
+			 * If target_limit <= target->priority, so we restricted this
+			 * target to less memory than (we estimate) it needs, then the
+			 * target_ratio will stay the same, since, letting A = remaining,
+			 * B = limit, and R = ratio, we'll have:
+			 *
+			 * R=A/B <=> A=R*B <=> A-R*X = R*B - R*X <=> A-R*X = R * (B-X) <=>
+			 * R = (A-R*X) / (B-X)
+			 *
+			 * -- which is what we wanted to prove.
+			 *
+			 * And if target_limit > target->priority, so we didn't need to
+			 * restrict this target beyond its priority estimate, then the
+			 * target_ratio will increase. This means more memory for the
+			 * remaining, greedier, targets.
+			 */
+			limit -= (target->limit * target->count);
+
+			target_ratio = (double) remaining / limit;
+		}
+	}
+}
+
+/*
+ * workmem_fn: updates the query plan's work_mem based on query_work_mem
+ */
+static void
+workmem_fn(PlannedStmt *plannedstmt)
+{
+	WorkMemStats workmem_stats;
+	MemoryContext context,
+				oldcontext;
+
+	/*
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	if (workmem_query_work_mem == -1)
+		return;					/* disabled */
+
+	/*
+	 * Start by assigning default working memory to all of this query's Plan
+	 * nodes.
+	 */
+	standard_ExecAssignWorkMem(plannedstmt);
+
+	memset(&workmem_stats, 0, sizeof(workmem_stats));
+
+	/*
+	 * Set up our own memory context, so we can drop the metadata we generate,
+	 * all at once.
+	 */
+	context = AllocSetContextCreate(CurrentMemoryContext,
+									"workmem_fn context",
+									ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	/* Figure out how much total working memory this query wants/needs. */
+	workmem_analyze(plannedstmt, &workmem_stats);
+
+	/* Now restrict the query to workmem.query_work_mem. */
+	workmem_set(plannedstmt, &workmem_stats);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Drop all our metadata. */
+	MemoryContextDelete(context);
+}
+
+static int
+clamp_priority(int workmem, int limit)
+{
+	return Min(workmem, limit);
+}
+
+static Target *
+make_target(int workmem, int *target_limit, int count)
+{
+	Target	   *result = palloc_object(Target);
+
+	result->count = count;
+	result->workmem = workmem;
+	result->limit = *target_limit;
+	result->priority = clamp_priority(result->workmem, result->limit);
+	result->ratio = (double) result->priority / result->limit;
+	result->target_limit = target_limit;
+
+	return result;
+}
+
+static void
+add_target(WorkMemStats * workmem_stats, Target * target)
+{
+	workmem_stats->count += target->count;
+	workmem_stats->workmem += target->count * target->workmem;
+	workmem_stats->limit += target->count * target->limit;
+	workmem_stats->priority += target->count * target->priority;
+	workmem_stats->targets = lappend(workmem_stats->targets, target);
+}
+
+/* This "ascending" comparator sorts least-greedy Targets first. */
+static int
+target_compare_asc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in ascending order: smallest ratio first, then (if ratios equal)
+	 * smallest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) a->ptr_value)->workmem -
+			((Target *) b->ptr_value)->workmem;
+	}
+	else
+		return a_val > b_val ? 1 : -1;
+}
+
+/* This "descending" comparator sorts most-greedy Targets first. */
+static int
+target_compare_desc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in descending order: largest ratio first, then (if ratios equal)
+	 * largest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) b->ptr_value)->workmem -
+			((Target *) a->ptr_value)->workmem;
+	}
+	else
+		return b_val > a_val ? 1 : -1;
+}
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
index c513b90fc77..8a3e52c8968 100644
--- a/src/backend/executor/execWorkmem.c
+++ b/src/backend/executor/execWorkmem.c
@@ -57,6 +57,9 @@
 #include "optimizer/cost.h"
 
 
+/* Hook for plugins to get control in ExecAssignWorkMem */
+ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook = NULL;
+
 /* decls for local routines only used within this module */
 static void assign_workmem_subplan(SubPlan *subplan);
 static void assign_workmem_plan(Plan *plan);
@@ -81,16 +84,32 @@ static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
 void
 ExecAssignWorkMem(PlannedStmt *plannedstmt)
 {
-	/*
-	 * No need to re-assign working memory on parallel workers, since workers
-	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
-	 *
-	 * We already assigned working-memory limits on the leader, and those
-	 * limits were sent to the workers inside the serialized Plan.
-	 */
-	if (IsParallelWorker())
-		return;
+	if (ExecAssignWorkMem_hook)
+		(*ExecAssignWorkMem_hook) (plannedstmt);
+	else
+	{
+		/*
+		 * No need to re-assign working memory on parallel workers, since
+		 * workers have the same work_mem and hash_mem_multiplier GUCs as the
+		 * leader.
+		 *
+		 * We already assigned working-memory limits on the leader, and those
+		 * limits were sent to the workers inside the serialized Plan.
+		 *
+		 * We bail out here, in case the hook wants to re-assign memory on
+		 * parallel workers, and maybe wants to call
+		 * standard_ExecAssignWorkMem() first, as well.
+		 */
+		if (IsParallelWorker())
+			return;
 
+		standard_ExecAssignWorkMem(plannedstmt);
+	}
+}
+
+void
+standard_ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
 	/* Assign working memory to the Plans referred to by SubPlan objects. */
 	foreach_ptr(Plan, plan, plannedstmt->subplans)
 	{
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4147876d55..c12625d2061 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -96,6 +96,9 @@ typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
 											  bool ereport_on_violation);
 extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
+/* Hook for plugins to get control in ExecAssignWorkMem() */
+typedef void (*ExecAssignWorkMem_hook_type) (PlannedStmt *plannedstmt);
+extern PGDLLIMPORT ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook;
 
 /*
  * prototypes from functions in execAmi.c
@@ -730,5 +733,6 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
  * prototypes from functions in execWorkmem.c
  */
 extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+extern void standard_ExecAssignWorkMem(PlannedStmt *plannedstmt);
 
 #endif							/* EXECUTOR_H  */
-- 
2.47.1

#16

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: James Hunter (#15)

4 attachment(s)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Mon, Feb 24, 2025 at 12:46 PM James Hunter <james.hunter.pg@gmail.com> wrote:

Attached please find the patch set I mentioned, above, in [1]. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

Only change in revision 2 is to Patch 3: adding 'execWorkmem.c' to
meson.build. As I use gcc "Makefile" on my dev machine, I did not
notice this omission until CFBot complained.

I bumped rev numbers on all other patches, even though they have not
changed, because I am unfamiliar with CFBot and am trying not to
confuse it (to minimize unnecessary email churn...)

Anyway, the patch set Works On My PC, and with any luck it will work
on CFBot as well now.

James

Attachments:

v02_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patchapplication/octet-stream; name=v02_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patchDownload

From 099366618d3f15f69bd9542d7d31f82148889a11 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 24 Jan 2025 20:48:39 +0000
Subject: [PATCH 1/4] EXPLAIN now takes "work_mem" option, to display estimated
 working memory

This commit adds option "WORK_MEM" to the existing EXPLAIN command. When
set to ON, the EXPLAIN output will include text of the form "(work_mem=
5.67 kB)" on every plan node that uses working memory.

The output is an *estimate*, typically based on the estimated number of
input rows for that plan node.

Normalize "working-memory" estimates to a minimum of 64 KB

The minimum possible value of the "work_mem" GUC is 64 KB. This commit
changes the tracking + output for "EXPLAIN (WORK_MEM ON)" so that it
reports a minimum of 64 KB for every node or subcomponent that requires
working memory.

It also rounds "nbytes" up to the nearest whole KB (= ceil()), and
changes the EXPLAIN output to report a whole integer, rather than to
two decimal places. Note that 1 KB = 1.6 percent of the 64 KB
minimum.

To allow for future optimizers to make decisions at Path time, this commit
aggregates the Path's total working memory onto the Path's "workmem" field.
To allow the executor to restrict memory usage by individual data
structure, it then breaks that total working memory into per-data structure
working memory, on the Plan.

Also adds a "Total Working Memory" line at the bottom of the
plan output.
---
 src/backend/commands/explain.c          | 207 ++++++++
 src/backend/executor/nodeHash.c         |  15 +-
 src/backend/nodes/tidbitmap.c           |  18 +
 src/backend/optimizer/path/costsize.c   | 387 ++++++++++++++-
 src/backend/optimizer/plan/createplan.c | 215 +++++++-
 src/backend/optimizer/prep/prepagg.c    |  12 +
 src/backend/optimizer/util/pathnode.c   |  53 +-
 src/include/commands/explain.h          |   3 +
 src/include/executor/nodeHash.h         |   3 +-
 src/include/nodes/pathnodes.h           |  11 +
 src/include/nodes/plannodes.h           |  11 +
 src/include/nodes/primnodes.h           |   2 +
 src/include/nodes/tidbitmap.h           |   1 +
 src/include/optimizer/cost.h            |  12 +-
 src/include/optimizer/planmain.h        |   2 +-
 src/test/regress/expected/workmem.out   | 631 ++++++++++++++++++++++++
 src/test/regress/parallel_schedule      |   2 +-
 src/test/regress/sql/workmem.sql        | 303 ++++++++++++
 18 files changed, 1828 insertions(+), 60 deletions(-)
 create mode 100644 src/test/regress/expected/workmem.out
 create mode 100644 src/test/regress/sql/workmem.sql

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c0d614866a9..e09d7f868c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -180,6 +180,8 @@ static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 static SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+static void compute_subplan_workmem(List *plans, double *workmem);
+static void compute_agg_workmem(Agg *agg, double *workmem);
 
 
 
@@ -235,6 +237,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		}
 		else if (strcmp(opt->defname, "memory") == 0)
 			es->memory = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "work_mem") == 0)
+			es->work_mem = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "serialize") == 0)
 		{
 			if (opt->arg)
@@ -835,6 +839,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 		ExplainPropertyFloat("Execution Time", "ms", 1000.0 * totaltime, 3,
 							 es);
 
+	if (es->work_mem)
+	{
+		ExplainPropertyFloat("Total Working Memory", "kB",
+							 es->total_workmem, 0, es);
+	}
+
 	ExplainCloseGroup("Query", NULL, true, es);
 }
 
@@ -1970,6 +1980,77 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
+	if (es->work_mem)
+	{
+		double		plan_workmem = 0.0;
+
+		/*
+		 * Include working memory used by this Plan's SubPlan objects, whether
+		 * they are included on the Plan's initPlan or subPlan lists.
+		 */
+		compute_subplan_workmem(planstate->initPlan, &plan_workmem);
+		compute_subplan_workmem(planstate->subPlan, &plan_workmem);
+
+		/* Include working memory used by this Plan, itself. */
+		switch (nodeTag(plan))
+		{
+			case T_Agg:
+				compute_agg_workmem((Agg *) plan, &plan_workmem);
+				break;
+			case T_FunctionScan:
+				{
+					FunctionScan *fscan = (FunctionScan *) plan;
+
+					plan_workmem += (double) plan->workmem *
+						list_length(fscan->functions);
+					break;
+				}
+			case T_IncrementalSort:
+
+				/*
+				 * IncrementalSort creates two Tuplestores, each of
+				 * (estimated) size workmem.
+				 */
+				plan_workmem = (double) plan->workmem * 2;
+				break;
+			case T_RecursiveUnion:
+				{
+					RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+					/*
+					 * RecursiveUnion creates two Tuplestores, each of
+					 * (estimated) size workmem, plus (possibly) a hash table
+					 * of size hashWorkMem.
+					 */
+					plan_workmem += (double) plan->workmem * 2 +
+						runion->hashWorkMem;
+					break;
+				}
+			default:
+				if (plan->workmem > 0)
+					plan_workmem += plan->workmem;
+				break;
+		}
+
+		/*
+		 * Every parallel worker (plus the leader) gets its own copy of
+		 * working memory.
+		 */
+		plan_workmem *= (1 + es->num_workers);
+
+		es->total_workmem += plan_workmem;
+
+		if (plan_workmem > 0.0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "  (work_mem=%.0f kB)",
+								 plan_workmem);
+			else
+				ExplainPropertyFloat("Working Memory", "kB",
+									 plan_workmem, 0, es);
+		}
+	}
+
 	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
@@ -2536,6 +2617,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (planstate->initPlan)
 		ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/*
+		 * Other than initPlan-s, every node below us gets the # of planned
+		 * workers we specified.
+		 */
+		Assert(es->num_workers == 0);
+
+		if (nodeTag(plan) == T_Gather)
+			es->num_workers = ((Gather *) plan)->num_workers;
+		else
+			es->num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
 	/* lefttree */
 	if (outerPlanState(planstate))
 		ExplainNode(outerPlanState(planstate), ancestors,
@@ -2592,6 +2687,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainCloseGroup("Plans", "Plans", false, es);
 	}
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/* End of parallel sub-tree. */
+		es->num_workers = 0;
+	}
+
 	/* in text format, undo whatever indentation we added */
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent = save_indent;
@@ -5952,3 +6053,109 @@ GetSerializationMetrics(DestReceiver *dest)
 
 	return empty;
 }
+
+/*
+ * compute_subplan_work_mem - compute total workmem for a SubPlan object
+ *
+ * If a SubPlan object uses a hash table, then that hash table needs working
+ * memory. We display that working memory on the owning Plan. This function
+ * increments work_mem counters to include the SubPlan's working-memory.
+ */
+static void
+compute_subplan_workmem(List *plans, double *workmem)
+{
+	foreach_node(SubPlanState, sps, plans)
+	{
+		SubPlan    *sp = sps->subplan;
+
+		if (sp->hashtab_workmem > 0)
+			*workmem += sp->hashtab_workmem;
+
+		if (sp->hashnul_workmem > 0)
+			*workmem += sp->hashnul_workmem;
+	}
+}
+
+/* Compute an Agg's working memory estimate. */
+typedef struct AggWorkMem
+{
+	double		input_sort_workmem;
+
+	double		output_hash_workmem;
+
+	int			num_sort_nodes;
+	double		max_output_sort_workmem;
+}			AggWorkMem;
+
+static void
+compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
+{
+	/* Record memory used for input sort buffers. */
+	mem->input_sort_workmem += (double) agg->numSorts * agg->sortWorkMem;
+
+	/* Record memory used for output data structures. */
+	switch (agg->aggstrategy)
+	{
+		case AGG_SORTED:
+
+			/* We'll have at most two sort buffers alive, at any time. */
+			mem->max_output_sort_workmem =
+				Max(mem->max_output_sort_workmem, agg->plan.workmem);
+
+			++mem->num_sort_nodes;
+			break;
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * All hash tables created by "hash" phases are kept for the
+			 * lifetime of the Agg.
+			 */
+			mem->output_hash_workmem += agg->plan.workmem;
+			break;
+		default:
+
+			/*
+			 * "Plain" phases don't use working memory (they output a single
+			 * aggregated tuple).
+			 */
+			break;
+	}
+}
+
+/*
+ * compute_agg_workmem - compute total workmem for an Agg node
+ *
+ * An Agg node might point to a chain of additional Agg nodes. When we explain
+ * the plan, we display only the first, "main" Agg node. However, to make life
+ * easier for the executor, we stored the estimated working memory ("workmem")
+ * on each individual Agg node.
+ *
+ * This function returns the combined workmem, so that we can display this
+ * value on the main Agg node.
+ */
+static void
+compute_agg_workmem(Agg *agg, double *workmem)
+{
+	AggWorkMem	mem;
+	ListCell   *lc;
+
+	memset(&mem, 0, sizeof(mem));
+
+	compute_agg_workmem_node(agg, &mem);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach(lc, agg->chain)
+	{
+		Agg		   *aggnode = (Agg *) lfirst(lc);
+
+		compute_agg_workmem_node(aggnode, &mem);
+	}
+
+	*workmem = mem.input_sort_workmem + mem.output_hash_workmem;
+
+	/* We'll have at most two sort buffers alive, at any time. */
+	*workmem += mem.num_sort_nodes > 2 ?
+		mem.max_output_sort_workmem * 2.0 :
+		mem.max_output_sort_workmem;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..d54cfe5fdbe 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
@@ -452,6 +453,7 @@ ExecHashTableCreate(HashState *state)
 	int			nbuckets;
 	int			nbatch;
 	double		rows;
+	int			workmem;		/* ignored */
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	MemoryContext oldcxt;
@@ -477,7 +479,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
 							&space_allowed,
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
 	/* nbuckets must be a power of 2 */
 	log2_nbuckets = my_log2(nbuckets);
@@ -661,7 +663,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						size_t *space_allowed,
 						int *numbuckets,
 						int *numbatches,
-						int *num_skew_mcvs)
+						int *num_skew_mcvs,
+						int *workmem)
 {
 	int			tupsize;
 	double		inner_rel_bytes;
@@ -792,6 +795,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * the required bucket headers, we will need multiple batches.
 	 */
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	*workmem = normalize_workmem(inner_rel_bytes + bucket_bytes);
+
 	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
@@ -811,7 +817,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									space_allowed,
 									numbuckets,
 									numbatches,
-									num_skew_mcvs);
+									num_skew_mcvs,
+									workmem);
 			return;
 		}
 
@@ -929,7 +936,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
 	Assert(nbuckets > 0);
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 66b3c387d53..43df31cdb21 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -1558,6 +1558,24 @@ tbm_calculate_entries(Size maxbytes)
 	return (int) nbuckets;
 }
 
+/*
+ * tbm_calculate_bytes
+ *
+ * Estimate number of bytes needed to store maxentries hashtable entries.
+ *
+ * This function is the inverse of tbm_calculate_entries(), and is used to
+ * estimate a work_mem limit, based on cardinality.
+ */
+double
+tbm_calculate_bytes(double maxentries)
+{
+	maxentries = Min(maxentries, INT_MAX - 1);	/* safety limit */
+	maxentries = Max(maxentries, 16);	/* sanity limit */
+
+	return maxentries * (sizeof(PagetableEntry) + sizeof(Pointer) +
+						 sizeof(Pointer));
+}
+
 /*
  * Create a shared or private bitmap iterator and start iteration.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 73d78617009..7c1fdde842b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -104,6 +104,7 @@
 #include "optimizer/plancat.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -200,9 +201,14 @@ static Cost append_nonpartial_cost(List *subpaths, int numpaths,
 								   int parallel_workers);
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static int32 get_expr_width(PlannerInfo *root, const Node *expr);
-static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static void compute_sort_output_sizes(double input_tuples, int input_width,
+									  double limit_tuples,
+									  double *output_tuples,
+									  double *output_bytes);
+static double compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+									 Cardinality max_ancestor_rows);
 
 
 /*
@@ -1112,6 +1118,18 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->disabled_nodes = enable_bitmapscan ? 0 : 1;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+
+	/*
+	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
+	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
+	 *
+	 * (When we convert this path into a BitmapHeapScan plan, we'll break this
+	 * overall estimate down into per-node estimates, just as we do for
+	 * AggPaths.)
+	 */
+	path->workmem = compute_bitmap_workmem(baserel, bitmapqual,
+										   0.0 /* max_ancestor_rows */ );
 }
 
 /*
@@ -1587,6 +1605,16 @@ cost_functionscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem = list_length(rte->functions) *
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1644,6 +1672,16 @@ cost_tablefuncscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1740,6 +1778,9 @@ cost_ctescan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1823,7 +1864,7 @@ cost_resultscan(Path *path, PlannerInfo *root,
  * We are given Paths for the nonrecursive and recursive terms.
  */
 void
-cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
+cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -1850,12 +1891,37 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	 */
 	total_cost += cpu_tuple_cost * total_rows;
 
-	runion->disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
-	runion->startup_cost = startup_cost;
-	runion->total_cost = total_cost;
-	runion->rows = total_rows;
-	runion->pathtarget->width = Max(nrterm->pathtarget->width,
-									rterm->pathtarget->width);
+	runion->path.disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
+	runion->path.startup_cost = startup_cost;
+	runion->path.total_cost = total_cost;
+	runion->path.rows = total_rows;
+	runion->path.pathtarget->width = Max(nrterm->pathtarget->width,
+										 rterm->pathtarget->width);
+
+	/*
+	 * Include memory for working and intermediate tables. Since we'll
+	 * repeatedly swap the two tables, use 2x whichever is larger as our
+	 * estimate.
+	 */
+	runion->path.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(nrterm->rows,
+												 nrterm->pathtarget->width),
+							  relation_byte_size(rterm->rows,
+												 rterm->pathtarget->width))
+						  * 2);
+
+	if (list_length(runion->distinctList) > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		hashentrysize;
+
+		hashentrysize = MAXALIGN(runion->path.pathtarget->width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		runion->path.workmem +=
+			normalize_workmem(runion->numGroups * hashentrysize);
+	}
 }
 
 /*
@@ -1895,7 +1961,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
  */
 static void
-cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+cost_tuplesort(Cost *startup_cost, Cost *run_cost, Cost *nbytes,
 			   double tuples, int width,
 			   Cost comparison_cost, int sort_mem,
 			   double limit_tuples)
@@ -1915,17 +1981,8 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	/* Include the default cost-per-comparison */
 	comparison_cost += 2.0 * cpu_operator_cost;
 
-	/* Do we have a useful LIMIT? */
-	if (limit_tuples > 0 && limit_tuples < tuples)
-	{
-		output_tuples = limit_tuples;
-		output_bytes = relation_byte_size(output_tuples, width);
-	}
-	else
-	{
-		output_tuples = tuples;
-		output_bytes = input_bytes;
-	}
+	compute_sort_output_sizes(tuples, width, limit_tuples,
+							  &output_tuples, &output_bytes);
 
 	if (output_bytes > sort_mem_bytes)
 	{
@@ -1982,6 +2039,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	 * counting the LIMIT otherwise.
 	 */
 	*run_cost = cpu_operator_cost * tuples;
+	*nbytes = output_bytes;
 }
 
 /*
@@ -2011,6 +2069,7 @@ cost_incremental_sort(Path *path,
 				input_groups;
 	Cost		group_startup_cost,
 				group_run_cost,
+				group_nbytes,
 				group_input_run_cost;
 	List	   *presortedExprs = NIL;
 	ListCell   *l;
@@ -2085,7 +2144,7 @@ cost_incremental_sort(Path *path,
 	 * Estimate the average cost of sorting of one group where presorted keys
 	 * are equal.
 	 */
-	cost_tuplesort(&group_startup_cost, &group_run_cost,
+	cost_tuplesort(&group_startup_cost, &group_run_cost, &group_nbytes,
 				   group_tuples, width, comparison_cost, sort_mem,
 				   limit_tuples);
 
@@ -2126,6 +2185,14 @@ cost_incremental_sort(Path *path,
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Incremental sort switches between two Tuplesortstates: one that sorts
+	 * all columns ("full"), and that sorts only suffix columns ("prefix").
+	 * We'll assume they're both around the same size: large enough to hold
+	 * one sort group.
+	 */
+	path->workmem = normalize_workmem(group_nbytes * 2.0);
 }
 
 /*
@@ -2150,8 +2217,9 @@ cost_sort(Path *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	Cost		nbytes;
 
-	cost_tuplesort(&startup_cost, &run_cost,
+	cost_tuplesort(&startup_cost, &run_cost, &nbytes,
 				   tuples, width,
 				   comparison_cost, sort_mem,
 				   limit_tuples);
@@ -2162,6 +2230,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2522,6 +2591,7 @@ cost_material(Path *path,
 	path->disabled_nodes = input_disabled_nodes + (enable_material ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2592,6 +2662,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	if ((estinfo.flags & SELFLAG_USED_DEFAULT) != 0)
 		ndistinct = calls;
 
+	/* How much working memory would we need, to store every distinct tuple? */
+	mpath->path.workmem = normalize_workmem(ndistinct * est_entry_bytes);
+
 	/*
 	 * Since we've already estimated the maximum number of entries we can
 	 * store at once and know the estimated number of distinct values we'll be
@@ -2866,6 +2939,19 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->disabled_nodes = disabled_nodes;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/* Include memory needed to produce output. */
+	path->workmem =
+		compute_agg_output_workmem(root, aggstrategy, numGroups,
+								   aggcosts->transitionSpace, input_tuples,
+								   input_width, false /* cost_sort */ );
+
+	/* Also include memory needed to sort inputs (if needed): */
+	if (aggcosts->numSorts > 0)
+	{
+		path->workmem += (double) aggcosts->numSorts *
+			compute_agg_input_workmem(input_tuples, input_width);
+	}
 }
 
 /*
@@ -3100,7 +3186,7 @@ cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, WindowClause *winclause,
 			   int input_disabled_nodes,
 			   Cost input_startup_cost, Cost input_total_cost,
-			   double input_tuples)
+			   double input_tuples, int width)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -3182,6 +3268,11 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	if (startup_tuples > 1.0)
 		path->startup_cost += (total_cost - startup_cost) / input_tuples *
 			(startup_tuples - 1.0);
+
+
+	/* We need to store a window of size "startup_tuples", in a Tuplestore. */
+	path->workmem =
+		normalize_workmem(relation_byte_size(startup_tuples, width));
 }
 
 /*
@@ -3336,6 +3427,7 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost;
 	/* Save private data for final_cost_nestloop */
 	workspace->run_cost = run_cost;
+	workspace->workmem = 0;
 }
 
 /*
@@ -3799,6 +3891,14 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost + inner_run_cost;
 	/* Save private data for final_cost_mergejoin */
 	workspace->run_cost = run_cost;
+
+	/*
+	 * By itself, Merge Join requires no working memory. If it adds one or
+	 * more Sort or Material nodes, we'll track their working memory when we
+	 * create them, inside createplan.c.
+	 */
+	workspace->workmem = 0;
+
 	workspace->inner_run_cost = inner_run_cost;
 	workspace->outer_rows = outer_rows;
 	workspace->inner_rows = inner_rows;
@@ -4170,6 +4270,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	double		outer_path_rows = outer_path->rows;
 	double		inner_path_rows = inner_path->rows;
 	double		inner_path_rows_total = inner_path_rows;
+	int			workmem;
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
@@ -4227,7 +4328,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
-							&num_skew_mcvs);
+							&num_skew_mcvs,
+							&workmem);
 
 	/*
 	 * If inner relation is too big then we will need to "batch" the join,
@@ -4258,6 +4360,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->numbuckets = numbuckets;
 	workspace->numbatches = numbatches;
 	workspace->inner_rows_total = inner_path_rows_total;
+	workspace->workmem = workmem;
 }
 
 /*
@@ -4266,8 +4369,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
  *
  * Note: the numbatches estimate is also saved into 'path' for use later
  *
- * 'path' is already filled in except for the rows and cost fields and
- *		num_batches
+ * 'path' is already filled in except for the rows and cost fields,
+ *		num_batches, and workmem
  * 'workspace' is the result from initial_cost_hashjoin
  * 'extra' contains miscellaneous information about the join
  */
@@ -4284,6 +4387,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	List	   *hashclauses = path->path_hashclauses;
 	Cost		startup_cost = workspace->startup_cost;
 	Cost		run_cost = workspace->run_cost;
+	int			workmem = workspace->workmem;
 	int			numbuckets = workspace->numbuckets;
 	int			numbatches = workspace->numbatches;
 	Cost		cpu_per_tuple;
@@ -4510,6 +4614,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	path->jpath.path.startup_cost = startup_cost;
 	path->jpath.path.total_cost = startup_cost + run_cost;
+	path->jpath.path.workmem = workmem;
 }
 
 
@@ -4532,6 +4637,9 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 	if (subplan->useHashTable)
 	{
+		long		nbuckets;
+		Size		hashentrysize;
+
 		/*
 		 * If we are using a hash table for the subquery outputs, then the
 		 * cost of evaluating the query is a one-time cost.  We charge one
@@ -4541,6 +4649,37 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		sp_cost.startup += plan->total_cost +
 			cpu_operator_cost * plan->plan_rows;
 
+		/*
+		 * Estimate working memory needed for the hashtable (and hashnulls, if
+		 * needed). The logic below MUST match the logic in buildSubPlanHash()
+		 * and ExecInitSubPlan().
+		 */
+		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
+		if (nbuckets < 1)
+			nbuckets = 1;
+
+		hashentrysize = MAXALIGN(plan->plan_width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		subplan->hashtab_workmem =
+			normalize_workmem((double) nbuckets * hashentrysize);
+
+		if (!subplan->unknownEqFalse)
+		{
+			/* Also needs a hashnulls table.  */
+			if (IsA(subplan->testexpr, OpExpr))
+				nbuckets = 1;	/* there can be only one entry */
+			else
+			{
+				nbuckets /= 16;
+				if (nbuckets < 1)
+					nbuckets = 1;
+			}
+
+			subplan->hashnul_workmem =
+				normalize_workmem((double) nbuckets * hashentrysize);
+		}
+
 		/*
 		 * The per-tuple costs include the cost of evaluating the lefthand
 		 * expressions, plus the cost of probing the hashtable.  We already
@@ -6424,7 +6563,7 @@ get_expr_width(PlannerInfo *root, const Node *expr)
  *	  Estimate the storage space in bytes for a given number of tuples
  *	  of a given width (size in bytes).
  */
-static double
+double
 relation_byte_size(double tuples, int width)
 {
 	return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
@@ -6603,3 +6742,197 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * compute_sort_output_sizes
+ *	  Estimate amount of memory and rows needed to hold a Sort operator's output
+ */
+static void
+compute_sort_output_sizes(double input_tuples, int input_width,
+						  double limit_tuples,
+						  double *output_tuples, double *output_bytes)
+{
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Do we have a useful LIMIT? */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+		*output_tuples = limit_tuples;
+	else
+		*output_tuples = input_tuples;
+
+	*output_bytes = relation_byte_size(*output_tuples, input_width);
+}
+
+/*
+ * compute_agg_input_workmem
+ *	  Estimate memory (in KB) needed to hold a sort buffer for aggregate's input
+ *
+ * Some aggregates involve DISTINCT or ORDER BY, so they need to sort their
+ * input, before they can process it. We need one sort buffer per such
+ * aggregate, and this function returns that sort buffer's (estimated) size (in
+ * KB).
+ */
+int
+compute_agg_input_workmem(double input_tuples, double input_width)
+{
+	/* Account for size of one buffer needed to sort the input. */
+	return normalize_workmem(input_tuples * input_width);
+}
+
+/*
+ * compute_agg_output_workmem
+ *	  Estimate amount of memory needed (in KB) to hold an aggregate's output
+ *
+ * In a Hash aggregate, we need space for the hash table that holds the
+ * aggregated data.
+ *
+ * Sort aggregates require output space only if they are part of a Grouping
+ * Sets chain: the first aggregate writes to its "sort_out" buffer, which the
+ * second aggregate uses as its "sort_in" buffer, and sorts.
+ *
+ * In the latter case, the "Path" code already costs the sort by calling
+ * cost_sort(), so it passes "cost_sort = false" to this function, to avoid
+ * double-counting.
+ */
+int
+compute_agg_output_workmem(PlannerInfo *root, AggStrategy aggstrategy,
+						   double numGroups, uint64 transitionSpace,
+						   double input_tuples, double input_width,
+						   bool cost_sort)
+{
+	/* Account for size of hash table to hold the output. */
+	if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+	{
+		double		hashentrysize;
+
+		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
+											input_width, transitionSpace);
+		return normalize_workmem(numGroups * hashentrysize);
+	}
+
+	/* Account for the size of the "sort_out" buffer. */
+	if (cost_sort && aggstrategy == AGG_SORTED)
+	{
+		double		output_tuples;	/* ignored */
+		double		output_bytes;
+
+		Assert(aggstrategy == AGG_SORTED);
+
+		compute_sort_output_sizes(numGroups, input_width,
+								  0.0 /* limit_tuples */ ,
+								  &output_tuples, &output_bytes);
+		return normalize_workmem(output_bytes);
+	}
+
+	return 0;
+}
+
+/*
+ * compute_bitmap_workmem
+ *	  Estimate total working memory (in KB) needed by bitmapqual
+ *
+ * Although we don't fill in the workmem_est or rows fields on the bitmapqual's
+ * paths, we fill them in on the owning BitmapHeapPath. This function estimates
+ * the total work_mem needed by all BitmapOrPaths and IndexPaths inside
+ * bitmapqual.
+ */
+static double
+compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+					   Cardinality max_ancestor_rows)
+{
+	double		workmem = 0.0;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * baserel->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
+
+	if (IsA(bitmapqual, BitmapAndPath))
+	{
+		BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, apath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, BitmapOrPath))
+	{
+		BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, opath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, IndexPath))
+	{
+		/* Working memory needed for 1 TID bitmap. */
+		workmem +=
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+	}
+
+	return workmem;
+}
+
+/*
+ * normalize_workmem
+ *	  Convert a double, "bytes" working-memory estimate to an int, "KB" value
+ *
+ * Normalizes to a minimum of 64 (KB), rounding up to the nearest whole KB.
+ */
+int
+normalize_workmem(double nbytes)
+{
+	double		workmem;
+
+	/*
+	 * We'll assign working-memory to SQL operators in 1 KB increments, so
+	 * round up to the next whole KB.
+	 */
+	workmem = ceil(nbytes / 1024.0);
+
+	/*
+	 * Although some components can probably work with < 64 KB of working
+	 * memory, PostgreSQL has imposed a hard minimum of 64 KB on the
+	 * "work_mem" GUC, for a long time; so, by now, some components probably
+	 * rely on this minimum, implicitly, and would fail if we tried to assign
+	 * them < 64 KB.
+	 *
+	 * Perhaps this minimum can be relaxed, in the future; but memory sizes
+	 * keep increasing, and right now the minimum of 64 KB = 1.6 percent of
+	 * the default "work_mem" of 4 MB.
+	 *
+	 * So, even with this (overly?) cautious normalization, with the default
+	 * GUC settings, we can still achieve a working-memory reduction of
+	 * 64-to-1.
+	 */
+	workmem = Max((double) 64, workmem);
+
+	/* And clamp to MAX_KILOBYTES. */
+	workmem = Min(workmem, (double) MAX_KILOBYTES);
+
+	return (int) workmem;
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 816a2b2a576..973b86371ef 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -130,6 +130,7 @@ static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
 											   BitmapHeapPath *best_path,
 											   List *tlist, List *scan_clauses);
 static Plan *create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+								   Cardinality max_ancestor_rows,
 								   List **qual, List **indexqual, List **indexECs);
 static void bitmap_subplan_mark_shared(Plan *plan);
 static TidScan *create_tidscan_plan(PlannerInfo *root, TidPath *best_path,
@@ -1853,6 +1854,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupCollations,
 								 NIL,
 								 NIL,
+								 0, /* numSorts */
 								 best_path->path.rows,
 								 0,
 								 subplan);
@@ -1911,6 +1913,15 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 	/* Copy cost data from Path to Plan */
 	copy_generic_path_info(plan, &best_path->path);
 
+	if (IsA(plan, Unique))
+	{
+		/*
+		 * We assigned "workmem" to the Sort subplan. Clear it from the top-
+		 * level Unique node, to avoid double-counting.
+		 */
+		plan->workmem = 0;
+	}
+
 	return plan;
 }
 
@@ -2228,6 +2239,13 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
+	/*
+	 * IncrementalSort creates two sort buffers, which the Path's "workmem"
+	 * estimate combined into a single value. Split it into two now.
+	 */
+	plan->sort.plan.workmem =
+		normalize_workmem(best_path->spath.path.workmem / 2);
+
 	return plan;
 }
 
@@ -2333,12 +2351,29 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 												subplan->targetlist),
 					NIL,
 					NIL,
+					best_path->numSorts,
 					best_path->numGroups,
 					best_path->transitionSpace,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace the overall workmem estimate with that we copied from the Path
+	 * with finer-grained estimates.
+	 */
+	plan->plan.workmem =
+		compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+								   plan->transitionSpace, subplan->plan_rows,
+								   subplan->plan_width, false /* cost_sort */ );
+
+	/* Also include estimated memory needed to sort the input: */
+	if (plan->numSorts > 0)
+	{
+		plan->sortWorkMem = compute_agg_input_workmem(subplan->plan_rows,
+													  subplan->plan_width);
+	}
+
 	return plan;
 }
 
@@ -2457,8 +2492,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			RollupData *rollup = lfirst(lc);
 			AttrNumber *new_grpColIdx;
 			Plan	   *sort_plan = NULL;
-			Plan	   *agg_plan;
+			Agg		   *agg_plan;
 			AggStrategy strat;
+			bool		cost_sort;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2480,19 +2516,20 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			else
 				strat = AGG_SORTED;
 
-			agg_plan = (Plan *) make_agg(NIL,
-										 NIL,
-										 strat,
-										 AGGSPLIT_SIMPLE,
-										 list_length((List *) linitial(rollup->gsets)),
-										 new_grpColIdx,
-										 extract_grouping_ops(rollup->groupClause),
-										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
-										 NIL,
-										 rollup->numGroups,
-										 best_path->transitionSpace,
-										 sort_plan);
+			agg_plan = make_agg(NIL,
+								NIL,
+								strat,
+								AGGSPLIT_SIMPLE,
+								list_length((List *) linitial(rollup->gsets)),
+								new_grpColIdx,
+								extract_grouping_ops(rollup->groupClause),
+								extract_grouping_collations(rollup->groupClause, subplan->targetlist),
+								rollup->gsets,
+								NIL,
+								best_path->numSorts,
+								rollup->numGroups,
+								best_path->transitionSpace,
+								sort_plan);
 
 			/*
 			 * Remove stuff we don't need to avoid bloating debug output.
@@ -2503,7 +2540,36 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan->lefttree = NULL;
 			}
 
-			chain = lappend(chain, agg_plan);
+			/*
+			 * If we're an AGG_SORTED, but not the last, we need to cost
+			 * working memory needed to produce our "sort_out" buffer.
+			 */
+			cost_sort = foreach_current_index(lc) < list_length(rollups) - 1;
+
+			/*
+			 * Although this side node doesn't need accurate cost estimates,
+			 * it does need an accurate *memory* estimate, since we'll use
+			 * that estimate to distribute working memory to this side node,
+			 * at runtime.
+			 */
+
+			/* Estimated memory needed to hold the output: */
+			agg_plan->plan.workmem =
+				compute_agg_output_workmem(root, agg_plan->aggstrategy,
+										   agg_plan->numGroups,
+										   agg_plan->transitionSpace,
+										   subplan->plan_rows,
+										   subplan->plan_width, cost_sort);
+
+			/* Also include estimated memory needed to sort the input: */
+			if (agg_plan->numSorts > 0)
+			{
+				agg_plan->sortWorkMem =
+					compute_agg_input_workmem(subplan->plan_rows,
+											  subplan->plan_width);
+			}
+
+			chain = lappend(chain, (Plan *) agg_plan);
 		}
 	}
 
@@ -2514,6 +2580,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
 		int			numGroupCols;
+		bool		cost_sort;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2529,12 +2596,37 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
 						rollup->gsets,
 						chain,
+						best_path->numSorts,
 						rollup->numGroups,
 						best_path->transitionSpace,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
+
+		/*
+		 * If we're an AGG_SORTED, but not the last, we need to cost working
+		 * memory needed to produce our "sort_out" buffer.
+		 */
+		cost_sort = list_length(rollups) > 1;
+
+		/*
+		 * Replace the overall workmem estimate that we copied from the Path
+		 * with finer-grained estimates.
+		 */
+		plan->plan.workmem =
+			compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows, subplan->plan_width,
+									   cost_sort);
+
+		/* Also include estimated memory needed to sort the input: */
+		if (plan->numSorts > 0)
+		{
+			plan->sortWorkMem =
+				compute_agg_input_workmem(subplan->plan_rows,
+										  subplan->plan_width);
+		}
 	}
 
 	return (Plan *) plan;
@@ -2783,6 +2875,38 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace our overall "workmem" estimate with estimates at finer
+	 * granularity.
+	 */
+
+	/*
+	 * Include memory for working and intermediate tables.  Since we'll
+	 * repeatedly swap the two tables, use the larger of the two as our
+	 * working- memory estimate.
+	 *
+	 * NOTE: The Path's "workmem" estimate is for the whole Path, but the
+	 * Plan's "workmem" estimates are *per data structure*. So, this value is
+	 * half of the corresponding Path's value.
+	 */
+	plan->plan.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(leftplan->plan_rows,
+												 leftplan->plan_width),
+							  relation_byte_size(rightplan->plan_rows,
+												 rightplan->plan_width)));
+
+	if (plan->numCols > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		entrysize;
+
+		entrysize = sizeof(TupleHashEntryData) + plan->plan.plan_width;
+
+		plan->hashWorkMem =
+			normalize_workmem(plan->numGroups * entrysize);
+	}
+
 	return plan;
 }
 
@@ -3223,6 +3347,7 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	/* Process the bitmapqual tree into a Plan tree and qual lists */
 	bitmapqualplan = create_bitmap_subplan(root, best_path->bitmapqual,
+										   0.0 /* max_ancestor_rows */ ,
 										   &bitmapqualorig, &indexquals,
 										   &indexECs);
 
@@ -3309,6 +3434,12 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&scan_plan->scan.plan, &best_path->path);
 
+	/*
+	 * We assigned "workmem" to the "bitmapqualplan" subplan. Clear it from
+	 * the top-level BitmapHeapScan node, to avoid double-counting.
+	 */
+	scan_plan->scan.plan.workmem = 0;
+
 	return scan_plan;
 }
 
@@ -3334,9 +3465,24 @@ create_bitmap_scan_plan(PlannerInfo *root,
  */
 static Plan *
 create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+					  Cardinality max_ancestor_rows,
 					  List **qual, List **indexqual, List **indexECs)
 {
 	Plan	   *plan;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * bitmapqual->parent->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
 
 	if (IsA(bitmapqual, BitmapAndPath))
 	{
@@ -3362,6 +3508,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3373,8 +3521,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_and(subplans);
 		plan->startup_cost = apath->path.startup_cost;
 		plan->total_cost = apath->path.total_cost;
-		plan->plan_rows =
-			clamp_row_est(apath->bitmapselectivity * apath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = apath->path.parallel_safe;
@@ -3409,6 +3556,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3437,8 +3586,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			plan = (Plan *) make_bitmap_or(subplans);
 			plan->startup_cost = opath->path.startup_cost;
 			plan->total_cost = opath->path.total_cost;
-			plan->plan_rows =
-				clamp_row_est(opath->bitmapselectivity * opath->path.parent->tuples);
+			plan->plan_rows = plan_rows;
 			plan->plan_width = 0;	/* meaningless */
 			plan->parallel_aware = false;
 			plan->parallel_safe = opath->path.parallel_safe;
@@ -3484,8 +3632,9 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
-		plan->plan_rows =
-			clamp_row_est(ipath->indexselectivity * ipath->path.parent->tuples);
+		plan->workmem =
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
@@ -3796,6 +3945,14 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	/*
+	 * Replace the path's total working-memory estimate with a per-function
+	 * estimate.
+	 */
+	scan_plan->scan.plan.workmem =
+		normalize_workmem(relation_byte_size(scan_plan->scan.plan.plan_rows,
+											 scan_plan->scan.plan.plan_width));
+
 	return scan_plan;
 }
 
@@ -4615,6 +4772,9 @@ create_mergejoin_plan(PlannerInfo *root,
 		 */
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
+		matplan->workmem =
+			normalize_workmem(relation_byte_size(matplan->plan_rows,
+												 matplan->plan_width));
 
 		inner_plan = matplan;
 	}
@@ -4961,6 +5121,10 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	/* Display "workmem" on the Hash subnode, not its parent HashJoin node. */
+	hash_plan->plan.workmem = join_plan->join.plan.workmem;
+	join_plan->join.plan.workmem = 0;
+
 	return join_plan;
 }
 
@@ -5458,6 +5622,7 @@ copy_generic_path_info(Plan *dest, Path *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = (int) Min(src->workmem, (double) MAX_KILOBYTES);
 	dest->plan_rows = src->rows;
 	dest->plan_width = src->pathtarget->width;
 	dest->parallel_aware = src->parallel_aware;
@@ -5474,6 +5639,7 @@ copy_plan_costsize(Plan *dest, Plan *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = src->workmem;
 	dest->plan_rows = src->plan_rows;
 	dest->plan_width = src->plan_width;
 	/* Assume the inserted node is not parallel-aware. */
@@ -5509,6 +5675,7 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 			  limit_tuples);
 	plan->plan.startup_cost = sort_path.startup_cost;
 	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.workmem = (int) Min(sort_path.workmem, (double) MAX_KILOBYTES);
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5540,6 +5707,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 						  limit_tuples);
 	plan->sort.plan.startup_cost = sort_path.startup_cost;
 	plan->sort.plan.total_cost = sort_path.total_cost;
+	plan->sort.plan.workmem = (int) Min(sort_path.workmem,
+										(double) MAX_KILOBYTES);
 	plan->sort.plan.plan_rows = lefttree->plan_rows;
 	plan->sort.plan.plan_width = lefttree->plan_width;
 	plan->sort.plan.parallel_aware = false;
@@ -6673,7 +6842,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 		 Size transitionSpace, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6689,6 +6858,8 @@ make_agg(List *tlist, List *qual,
 	node->grpColIdx = grpColIdx;
 	node->grpOperators = grpOperators;
 	node->grpCollations = grpCollations;
+	node->numSorts = numSorts;
+	node->sortWorkMem = 0;		/* caller will fill this */
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index c0a2f04a8c3..3eba364484d 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -691,5 +691,17 @@ get_agg_clause_costs(PlannerInfo *root, AggSplit aggsplit, AggClauseCosts *costs
 			costs->finalCost.startup += argcosts.startup;
 			costs->finalCost.per_tuple += argcosts.per_tuple;
 		}
+
+		/*
+		 * How many aggrefs need to sort their input? (Each such aggref gets
+		 * its own sort buffer. The logic here MUST match the corresponding
+		 * logic in function build_pertrans_for_aggref().)
+		 */
+		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind) &&
+			!aggref->aggpresorted &&
+			(aggref->aggdistinct || aggref->aggorder))
+		{
+			++costs->numSorts;
+		}
 	}
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..c533bfb9a58 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1709,6 +1709,13 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
 	pathnode->path.rows = subpath->rows;
 
+	/*
+	 * For now, set workmem at hash memory limit. Function
+	 * cost_memoize_rescan() will adjust this field, same as it does for field
+	 * "est_entries".
+	 */
+	pathnode->path.workmem = normalize_workmem(get_hash_memory_limit());
+
 	return pathnode;
 }
 
@@ -1937,12 +1944,14 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		pathnode->path.disabled_nodes = agg_path.disabled_nodes;
 		pathnode->path.startup_cost = agg_path.startup_cost;
 		pathnode->path.total_cost = agg_path.total_cost;
+		pathnode->path.workmem = agg_path.workmem;
 	}
 	else
 	{
 		pathnode->path.disabled_nodes = sort_path.disabled_nodes;
 		pathnode->path.startup_cost = sort_path.startup_cost;
 		pathnode->path.total_cost = sort_path.total_cost;
+		pathnode->path.workmem = sort_path.workmem;
 	}
 
 	rel->cheapest_unique_path = (Path *) pathnode;
@@ -2289,6 +2298,13 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	/* Cost is the same as for a regular CTE scan */
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
+	/*
+	 * But working memory used is 0, since the worktable scan doesn't create a
+	 * tuplestore -- it just reuses a tuplestore already created by a
+	 * recursive union.
+	 */
+	pathnode->workmem = 0;
+
 	return pathnode;
 }
 
@@ -3283,6 +3299,7 @@ create_agg_path(PlannerInfo *root,
 
 	pathnode->aggstrategy = aggstrategy;
 	pathnode->aggsplit = aggsplit;
+	pathnode->numSorts = aggcosts ? aggcosts->numSorts : 0;
 	pathnode->numGroups = numGroups;
 	pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
 	pathnode->groupClause = groupClause;
@@ -3333,6 +3350,8 @@ create_groupingsets_path(PlannerInfo *root,
 	ListCell   *lc;
 	bool		is_first = true;
 	bool		is_first_sort = true;
+	int			num_sort_nodes = 0;
+	double		max_sort_workmem = 0.0;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3369,6 +3388,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->numSorts = agg_costs ? agg_costs->numSorts : 0;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 	pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
@@ -3432,6 +3452,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 subpath->pathtarget->width);
 				if (!rollup->is_hashed)
 					is_first_sort = false;
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 			else
 			{
@@ -3444,6 +3466,12 @@ create_groupingsets_path(PlannerInfo *root,
 						  work_mem,
 						  -1.0);
 
+				/*
+				 * We costed sorting the previous "sort" rollup's "sort_out"
+				 * buffer. How much memory did it need?
+				 */
+				max_sort_workmem = Max(max_sort_workmem, sort_path.workmem);
+
 				/* Account for cost of aggregation */
 
 				cost_agg(&agg_path, root,
@@ -3457,12 +3485,17 @@ create_groupingsets_path(PlannerInfo *root,
 						 sort_path.total_cost,
 						 sort_path.rows,
 						 subpath->pathtarget->width);
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 
 			pathnode->path.disabled_nodes += agg_path.disabled_nodes;
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
+
+		if (!rollup->is_hashed)
+			++num_sort_nodes;
 	}
 
 	/* add tlist eval cost for each output row */
@@ -3470,6 +3503,17 @@ create_groupingsets_path(PlannerInfo *root,
 	pathnode->path.total_cost += target->cost.startup +
 		target->cost.per_tuple * pathnode->path.rows;
 
+	/*
+	 * Include working memory needed to sort agg output. If there's only 1
+	 * sort rollup, then we don't need any memory. If there are 2 sort
+	 * rollups, we need enough memory for 1 sort buffer. If there are >= 3
+	 * sort rollups, we need only 2 sort buffers, since we're
+	 * double-buffering.
+	 */
+	pathnode->path.workmem += num_sort_nodes > 2 ?
+		max_sort_workmem * 2.0 :
+		max_sort_workmem;
+
 	return pathnode;
 }
 
@@ -3619,7 +3663,8 @@ create_windowagg_path(PlannerInfo *root,
 				   subpath->disabled_nodes,
 				   subpath->startup_cost,
 				   subpath->total_cost,
-				   subpath->rows);
+				   subpath->rows,
+				   subpath->pathtarget->width);
 
 	/* add tlist eval cost for each output row */
 	pathnode->path.startup_cost += target->cost.startup;
@@ -3744,7 +3789,11 @@ create_setop_path(PlannerInfo *root,
 			MAXALIGN(SizeofMinimalTupleHeader);
 		if (hashentrysize * numGroups > get_hash_memory_limit())
 			pathnode->path.disabled_nodes++;
+
+		pathnode->path.workmem =
+			normalize_workmem(numGroups * hashentrysize);
 	}
+
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
@@ -3795,7 +3844,7 @@ create_recursiveunion_path(PlannerInfo *root,
 	pathnode->wtParam = wtParam;
 	pathnode->numGroups = numGroups;
 
-	cost_recursive_union(&pathnode->path, leftpath, rightpath);
+	cost_recursive_union(pathnode, leftpath, rightpath);
 
 	return pathnode;
 }
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 570e7cad1fa..50454952eb2 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -53,6 +53,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		memory;			/* print planner's memory usage information */
+	bool		work_mem;		/* print work_mem estimates per node */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
@@ -69,6 +70,8 @@ typedef struct ExplainState
 	bool		hide_workers;	/* set if we find an invisible Gather */
 	int			rtable_size;	/* length of rtable excluding the RTE_GROUP
 								 * entry */
+	int			num_workers;	/* # of worker processes planned to use */
+	double		total_workmem;	/* total working memory estimate (in bytes) */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..fc5b20994dd 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -62,7 +62,8 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									size_t *space_allowed,
 									int *numbuckets,
 									int *numbatches,
-									int *num_skew_mcvs);
+									int *num_skew_mcvs,
+									int *workmem);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fbf05322c75..17eb6b52579 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -60,6 +60,7 @@ typedef struct AggClauseCosts
 	QualCost	transCost;		/* total per-input-row execution costs */
 	QualCost	finalCost;		/* total per-aggregated-row costs */
 	Size		transitionSpace;	/* space for pass-by-ref transition data */
+	int			numSorts;		/* # of required input-sort buffers */
 } AggClauseCosts;
 
 /*
@@ -1697,6 +1698,13 @@ typedef struct Path
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
 	Cost		total_cost;		/* total cost (assuming all tuples fetched) */
 
+	/*
+	 * NOTE: The Path's workmem is a double, rather than an int, because it
+	 * sometimes combines multiple working-memory estimates (e.g., for
+	 * GroupingSetsPath).
+	 */
+	Cost		workmem;		/* estimated work_mem (in KB) */
+
 	/* sort ordering of path's output; a List of PathKey nodes; see above */
 	List	   *pathkeys;
 } Path;
@@ -2290,6 +2298,7 @@ typedef struct AggPath
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy, see nodes.h */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+	int			numSorts;		/* number of inputs that require sorting */
 	Cardinality numGroups;		/* estimated number of groups in input */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	List	   *groupClause;	/* a list of SortGroupClause's */
@@ -2331,6 +2340,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	int			numSorts;		/* number of inputs that require sorting */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
@@ -3374,6 +3384,7 @@ typedef struct JoinCostWorkspace
 
 	/* Fields below here should be treated as private to costsize.c */
 	Cost		run_cost;		/* non-startup cost components */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* private for cost_nestloop code */
 	Cost		inner_run_cost; /* also used by cost_mergejoin code */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index bf1f25c0dba..67da7f091b5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -168,6 +168,8 @@ typedef struct Plan
 	/* total cost (assuming all tuples fetched) */
 	Cost		total_cost;
 
+	int			workmem;		/* estimated work_mem (in KB) */
+
 	/*
 	 * planner's estimate of result size of this plan step
 	 */
@@ -426,6 +428,9 @@ typedef struct RecursiveUnion
 
 	/* estimated number of groups in input */
 	long		numGroups;
+
+	/* estimated work_mem for hash table (in KB) */
+	int			hashWorkMem;
 } RecursiveUnion;
 
 /* ----------------
@@ -1145,6 +1150,12 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* number of inputs that require sorting */
+	int			numSorts;
+
+	/* estimated work_mem needed to sort each input (in KB) */
+	int			sortWorkMem;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 839e71d52f4..b7d6b0fe7dc 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1109,6 +1109,8 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	int			hashtab_workmem;	/* estimated hashtable work_mem (in KB) */
+	int			hashnul_workmem;	/* estimated hashnulls work_mem (in KB) */
 } SubPlan;
 
 /*
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index a6ffeac90be..df8e7de9dc2 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -85,6 +85,7 @@ extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
 													dsa_pointer dp);
 extern int	tbm_calculate_entries(Size maxbytes);
+extern double tbm_calculate_bytes(double maxentries);
 
 extern TBMIterator tbm_begin_iterate(TIDBitmap *tbm,
 									 dsa_area *dsa, dsa_pointer dsp);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 3aa3c16e442..737c553a409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -106,7 +106,7 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 									 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_resultscan(Path *path, PlannerInfo *root,
 							RelOptInfo *baserel, ParamPathInfo *param_info);
-extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
+extern void cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, int disabled_nodes,
 					  Cost input_cost, double tuples, int width,
@@ -139,7 +139,7 @@ extern void cost_windowagg(Path *path, PlannerInfo *root,
 						   List *windowFuncs, WindowClause *winclause,
 						   int input_disabled_nodes,
 						   Cost input_startup_cost, Cost input_total_cost,
-						   double input_tuples);
+						   double input_tuples, int width);
 extern void cost_group(Path *path, PlannerInfo *root,
 					   int numGroupCols, double numGroups,
 					   List *quals,
@@ -217,9 +217,17 @@ extern void set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *re
 extern void set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double relation_byte_size(double tuples, int width);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 								   Path *bitmapqual, double loop_count,
 								   Cost *cost_p, double *tuples_p);
 extern double compute_gather_rows(Path *path);
+extern int	compute_agg_input_workmem(double input_tuples, double input_width);
+extern int	compute_agg_output_workmem(PlannerInfo *root,
+									   AggStrategy aggstrategy,
+									   double numGroups, uint64 transitionSpace,
+									   double input_tuples, double input_width,
+									   bool cost_sort);
+extern int	normalize_workmem(double nbytes);
 
 #endif							/* COST_H */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5a930199611..cf3694a744f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 					 Size transitionSpace, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
 						 LimitOption limitOption, int uniqNumCols,
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
new file mode 100644
index 00000000000..215180808f4
--- /dev/null
+++ b/src/test/regress/expected/workmem.out
@@ -0,0 +1,631 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Unique -> hash agg
+set enable_hashagg = on;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                         workmem_filter                          
+-----------------------------------------------------------------
+ Sort  (work_mem=N kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  HashAggregate  (work_mem=N kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+(10 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Unique -> sort
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ Sort  (work_mem=N kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  Unique
+               ->  Sort  (work_mem=N kB)
+                     Sort Key: "*VALUES*".column1, "*VALUES*".column2
+                     ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+(11 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+             workmem_filter              
+-----------------------------------------
+ Limit
+   ->  Incremental Sort  (work_mem=N kB)
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort  (work_mem=N kB)
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+ Total Working Memory: N kB
+(8 rows)
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+    4220 |    5017 |   0 |    0 |   0 |      0 |      20 |      220 |         220 |      4220 |     4220 |  40 |   41 | IGAAAA   | ZKHAAA   | HHHHxx
+(1 row)
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+                                 workmem_filter                                 
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Hash Join
+               Hash Cond: (t3.thousand = t1.unique1)
+               ->  HashAggregate  (work_mem=N kB)
+                     Group Key: t3.thousand, t3.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
+               ->  Hash  (work_mem=N kB)
+                     ->  Index Only Scan using onek_unique1 on onek t1
+                           Index Cond: (unique1 < 1)
+         ->  Index Only Scan using tenk1_hundred on tenk1 t2
+               Index Cond: (hundred = t3.tenthous)
+ Total Working Memory: N kB
+(13 rows)
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+ count 
+-------
+   100
+(1 row)
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+                       workmem_filter                        
+-------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Nested Loop Left Join
+               Filter: (t4.f1 IS NULL)
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize  (work_mem=N kB)
+                     ->  Nested Loop Left Join
+                           Join Filter: (t3.f1 > 1)
+                           ->  Seq Scan on int4_tbl t3
+                                 Filter: (f1 > 0)
+                           ->  Materialize  (work_mem=N kB)
+                                 ->  Seq Scan on int4_tbl t4
+         ->  Seq Scan on int4_tbl t1
+ Total Working Memory: N kB
+(14 rows)
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+ count 
+-------
+     0
+(1 row)
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB)
+   ->  Sort  (work_mem=N kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+(9 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB)
+   ->  Sort  (work_mem=N kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+(17 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+                   workmem_filter                   
+----------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB)
+   Group Key: (length((stringu1)::text))
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial HashAggregate  (work_mem=N kB)
+               Group Key: length((stringu1)::text)
+               ->  Parallel Seq Scan on tenk1
+ Total Working Memory: N kB
+(8 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length 
+--------
+      6
+(1 row)
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+         QUERY PLAN         
+----------------------------
+ Aggregate
+   ->  Seq Scan on tenk1
+ Total Working Memory: 0 kB
+(3 rows)
+
+select MAX(length(stringu1)) from tenk1;
+ max 
+-----
+   6
+(1 row)
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                      workmem_filter                       
+-----------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB)
+ Total Working Memory: N kB
+(3 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                     workmem_filter                      
+---------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB)
+ Total Working Memory: N kB
+(3 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- Table Function Scan
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+                      workmem_filter                      
+----------------------------------------------------------
+ Nested Loop
+   ->  Seq Scan on xmldata
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB)
+ Total Working Memory: N kB
+(4 rows)
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+ id | _id | country_name | country_id | region_id | size | unit | premier_name 
+----+-----+--------------+------------+-----------+------+------+--------------
+(0 rows)
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ SetOp Except
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+ Total Working Memory: 0 kB
+(5 rows)
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+ unique1 
+---------
+      10
+(1 row)
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   ->  HashSetOp Intersect  (work_mem=N kB)
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+ Total Working Memory: N kB
+(5 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+ count 
+-------
+  5000
+(1 row)
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+                       workmem_filter                       
+------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  Memoize  (work_mem=N kB)
+               Cache Key: o.four
+               Cache Mode: binary
+               ->  CTE Scan on x  (work_mem=N kB)
+                     CTE x
+                       ->  Recursive Union  (work_mem=N kB)
+                             ->  Result
+                             ->  WorkTable Scan on x x_1
+                                   Filter: (a < 10)
+ Total Working Memory: N kB
+(14 rows)
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+ sum  | sum  
+------+------
+ 1700 | 5350
+(1 row)
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+                   workmem_filter                   
+----------------------------------------------------
+ Aggregate
+   CTE q1
+     ->  HashAggregate  (work_mem=N kB)
+           Group Key: tenk1.hundred
+           ->  Seq Scan on tenk1
+   InitPlan 2
+     ->  Aggregate
+           ->  CTE Scan on q1 qsub  (work_mem=N kB)
+   ->  CTE Scan on q1  (work_mem=N kB)
+         Filter: ((y)::numeric > (InitPlan 2).col1)
+ Total Working Memory: N kB
+(11 rows)
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+ count 
+-------
+    50
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB)
+         ->  Sort  (work_mem=N kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB)
+ Total Working Memory: N kB
+(6 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+                                            workmem_filter                                             
+-------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (((a.unique1 = 1) AND (b.unique1 = 2)) OR ((a.unique2 = 3) AND (b.hundred = 4)))
+         ->  Bitmap Heap Scan on tenk1 b
+               Recheck Cond: ((hundred = 4) OR (unique1 = 2))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB)
+                           Index Cond: (hundred = 4)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                           Index Cond: (unique1 = 2)
+         ->  Materialize  (work_mem=N kB)
+               ->  Bitmap Heap Scan on tenk1 a
+                     Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
+                     ->  BitmapOr
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB)
+                                 Index Cond: (unique2 = 3)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                                 Index Cond: (unique1 = 1)
+ Total Working Memory: N kB
+(19 rows)
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+ count 
+-------
+   101
+(1 row)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+       workmem_filter       
+----------------------------
+ Result  (work_mem=N kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+(6 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+(9 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..1089e3bdf96 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate workmem
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/workmem.sql b/src/test/regress/sql/workmem.sql
new file mode 100644
index 00000000000..5878f2aa4c4
--- /dev/null
+++ b/src/test/regress/sql/workmem.sql
@@ -0,0 +1,303 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Unique -> hash agg
+set enable_hashagg = on;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Unique -> sort
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+
+select length(stringu1) from tenk1 group by length(stringu1);
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+
+select MAX(length(stringu1)) from tenk1;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- Table Function Scan
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
-- 
2.47.1

v02_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchapplication/octet-stream; name=v02_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchDownload

From ea57eb88096287fe55251903081adced4d1f3bc4 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Thu, 20 Feb 2025 17:33:48 +0000
Subject: [PATCH 2/4] Store non-init-plan SubPlan objects in Plan list

We currently track SubPlan objects, on Plans, via either the plan->initPlan
list, for init plans; or via whatever expression contains the SubPlan, for
regular sub plans.

A SubPlan object can itself use working memory, if it uses a hash table.
This hash table is associated with the SubPlan itself, and not with the
Plan to which the SubPlan points.

To allow us to assign working memory to an individual SubPlan, this commit
stores a link to the regular SubPlan, inside a new plan->subPlan list,
when we finalize the (parent) Plan whose expression contains the regular
SubPlan.

Unlike the existing plan->initPlan list, we will not use the new plan->
subPlan list to initialize SubPlan nodes -- that must be done when we
initialize the expression that contains the SubPlan. Instead, we will use
it, during InitPlan() but before ExecInitNode(), to assign a working-
memory limit to the SubPlan.
---
 src/backend/optimizer/plan/setrefs.c | 284 +++++++++++++++++----------
 src/include/nodes/plannodes.h        |   2 +
 2 files changed, 177 insertions(+), 109 deletions(-)

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 999a5a8ab5a..8a4e77baa90 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -58,6 +58,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	int			rtoffset;
 	double		num_exec;
 } fix_scan_expr_context;
@@ -65,6 +66,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *outer_itlist;
 	indexed_tlist *inner_itlist;
 	Index		acceptable_rel;
@@ -76,6 +78,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *subplan_itlist;
 	int			newvarno;
 	int			rtoffset;
@@ -127,8 +130,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, plan, lst, rtoffset, num_exec) \
+	((List *) fix_scan_expr(root, plan, (Node *) (lst), rtoffset, num_exec))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -157,7 +160,7 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 										int rtoffset);
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
-static Node *fix_scan_expr(PlannerInfo *root, Node *node,
+static Node *fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node,
 						   int rtoffset, double num_exec);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
@@ -183,7 +186,7 @@ static Var *search_indexed_tlist_for_sortgroupref(Expr *node,
 												  Index sortgroupref,
 												  indexed_tlist *itlist,
 												  int newvarno);
-static List *fix_join_expr(PlannerInfo *root,
+static List *fix_join_expr(PlannerInfo *root, Plan *plan,
 						   List *clauses,
 						   indexed_tlist *outer_itlist,
 						   indexed_tlist *inner_itlist,
@@ -193,7 +196,7 @@ static List *fix_join_expr(PlannerInfo *root,
 						   double num_exec);
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
-static Node *fix_upper_expr(PlannerInfo *root,
+static Node *fix_upper_expr(PlannerInfo *root, Plan *plan,
 							Node *node,
 							indexed_tlist *subplan_itlist,
 							int newvarno,
@@ -202,7 +205,7 @@ static Node *fix_upper_expr(PlannerInfo *root,
 							double num_exec);
 static Node *fix_upper_expr_mutator(Node *node,
 									fix_upper_expr_context *context);
-static List *set_returning_clause_references(PlannerInfo *root,
+static List *set_returning_clause_references(PlannerInfo *root, Plan *plan,
 											 List *rlist,
 											 Plan *topplan,
 											 Index resultRelation,
@@ -633,10 +636,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -646,13 +649,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablesample = (TableSampleClause *)
-					fix_scan_expr(root, (Node *) splan->tablesample,
+					fix_scan_expr(root, plan, (Node *) splan->tablesample,
 								  rtoffset, 1);
 			}
 			break;
@@ -662,22 +665,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual,
+					fix_scan_list(root, plan, splan->indexqual,
 								  rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexorderby =
-					fix_scan_list(root, splan->indexorderby,
+					fix_scan_list(root, plan, splan->indexorderby,
 								  rtoffset, 1);
 				splan->indexorderbyorig =
-					fix_scan_list(root, splan->indexorderbyorig,
+					fix_scan_list(root, plan, splan->indexorderbyorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -697,9 +700,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, plan, splan->indexqual, rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -709,13 +712,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->bitmapqualorig =
-					fix_scan_list(root, splan->bitmapqualorig,
+					fix_scan_list(root, plan, splan->bitmapqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -725,13 +728,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidquals =
-					fix_scan_list(root, splan->tidquals,
+					fix_scan_list(root, plan, splan->tidquals,
 								  rtoffset, 1);
 			}
 			break;
@@ -741,13 +744,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidrangequals =
-					fix_scan_list(root, splan->tidrangequals,
+					fix_scan_list(root, plan, splan->tidrangequals,
 								  rtoffset, 1);
 			}
 			break;
@@ -762,13 +765,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, plan, splan->functions, rtoffset, 1);
 			}
 			break;
 		case T_TableFuncScan:
@@ -777,13 +780,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablefunc = (TableFunc *)
-					fix_scan_expr(root, (Node *) splan->tablefunc,
+					fix_scan_expr(root, plan, (Node *) splan->tablefunc,
 								  rtoffset, 1);
 			}
 			break;
@@ -793,13 +796,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->values_lists =
-					fix_scan_list(root, splan->values_lists,
+					fix_scan_list(root, plan, splan->values_lists,
 								  rtoffset, 1);
 			}
 			break;
@@ -809,10 +812,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -822,10 +825,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -835,10 +838,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -877,7 +880,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 */
 				set_dummy_tlist_references(plan, rtoffset);
 
-				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
+				mplan->param_exprs = fix_scan_list(root, plan, mplan->param_exprs,
 												   rtoffset,
 												   NUM_EXEC_TLIST(plan));
 				break;
@@ -939,9 +942,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitOffset, rtoffset, 1);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitCount, rtoffset, 1);
 			}
 			break;
 		case T_Agg:
@@ -994,14 +997,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
-				wplan->runCondition = fix_scan_list(root,
+					fix_scan_expr(root, plan, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root, plan,
 													wplan->runCondition,
 													rtoffset,
 													NUM_EXEC_TLIST(plan));
-				wplan->runConditionOrig = fix_scan_list(root,
+				wplan->runConditionOrig = fix_scan_list(root, plan,
 														wplan->runConditionOrig,
 														rtoffset,
 														NUM_EXEC_TLIST(plan));
@@ -1043,15 +1046,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					}
 
 					splan->plan.targetlist =
-						fix_scan_list(root, splan->plan.targetlist,
+						fix_scan_list(root, plan, splan->plan.targetlist,
 									  rtoffset, NUM_EXEC_TLIST(plan));
 					splan->plan.qual =
-						fix_scan_list(root, splan->plan.qual,
+						fix_scan_list(root, plan, splan->plan.qual,
 									  rtoffset, NUM_EXEC_QUAL(plan));
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->resconstantqual, rtoffset,
+								  1);
 			}
 			break;
 		case T_ProjectSet:
@@ -1066,7 +1070,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->withCheckOptionLists =
-					fix_scan_list(root, splan->withCheckOptionLists,
+					fix_scan_list(root, plan, splan->withCheckOptionLists,
 								  rtoffset, 1);
 
 				if (splan->returningLists)
@@ -1086,7 +1090,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						List	   *rlist = (List *) lfirst(lcrl);
 						Index		resultrel = lfirst_int(lcrr);
 
-						rlist = set_returning_clause_references(root,
+						rlist = set_returning_clause_references(root, plan,
 																rlist,
 																subplan,
 																resultrel,
@@ -1121,13 +1125,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					itlist = build_tlist_index(splan->exclRelTlist);
 
 					splan->onConflictSet =
-						fix_join_expr(root, splan->onConflictSet,
+						fix_join_expr(root, plan, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
 
 					splan->onConflictWhere = (Node *)
-						fix_join_expr(root, (List *) splan->onConflictWhere,
+						fix_join_expr(root, plan, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
@@ -1135,7 +1139,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, plan, splan->exclRelTlist, rtoffset, 1);
 				}
 
 				/*
@@ -1186,7 +1190,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 							MergeAction *action = (MergeAction *) lfirst(l);
 
 							/* Fix targetList of each action. */
-							action->targetList = fix_join_expr(root,
+							action->targetList = fix_join_expr(root, plan,
 															   action->targetList,
 															   NULL, itlist,
 															   resultrel,
@@ -1195,7 +1199,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   NUM_EXEC_TLIST(plan));
 
 							/* Fix quals too. */
-							action->qual = (Node *) fix_join_expr(root,
+							action->qual = (Node *) fix_join_expr(root, plan,
 																  (List *) action->qual,
 																  NULL, itlist,
 																  resultrel,
@@ -1206,7 +1210,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 						/* Fix join condition too. */
 						mergeJoinCondition = (Node *)
-							fix_join_expr(root,
+							fix_join_expr(root, plan,
 										  (List *) mergeJoinCondition,
 										  NULL, itlist,
 										  resultrel,
@@ -1353,7 +1357,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 
 	plan->scan.scanrelid += rtoffset;
 	plan->scan.plan.targetlist = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.targetlist,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1361,7 +1365,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_TLIST((Plan *) plan));
 	plan->scan.plan.qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.qual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1369,7 +1373,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	plan->recheckqual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->recheckqual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1377,13 +1381,13 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
-	plan->indexqual = fix_scan_list(root, plan->indexqual,
+	plan->indexqual = fix_scan_list(root, (Plan *) plan, plan->indexqual,
 									rtoffset, 1);
 	/* indexorderby is already transformed to reference index columns */
-	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
+	plan->indexorderby = fix_scan_list(root, (Plan *) plan, plan->indexorderby,
 									   rtoffset, 1);
 	/* indextlist must NOT be transformed to reference index columns */
-	plan->indextlist = fix_scan_list(root, plan->indextlist,
+	plan->indextlist = fix_scan_list(root, (Plan *) plan, plan->indextlist,
 									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 
 	pfree(index_itlist);
@@ -1430,10 +1434,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		 */
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
-			fix_scan_list(root, plan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 		plan->scan.plan.qual =
-			fix_scan_list(root, plan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
 
 		result = (Plan *) plan;
@@ -1599,7 +1603,7 @@ set_foreignscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(fscan->fdw_scan_tlist);
 
 		fscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1607,7 +1611,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1615,7 +1619,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1623,7 +1627,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_recheck_quals,
 						   itlist,
 						   INDEX_VAR,
@@ -1633,7 +1637,7 @@ set_foreignscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
-			fix_scan_list(root, fscan->fdw_scan_tlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 	}
 	else
@@ -1643,16 +1647,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 * way
 		 */
 		fscan->scan.plan.targetlist =
-			fix_scan_list(root, fscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual =
-			fix_scan_list(root, fscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs =
-			fix_scan_list(root, fscan->fdw_exprs,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals =
-			fix_scan_list(root, fscan->fdw_recheck_quals,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_recheck_quals,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 	}
 
@@ -1685,7 +1689,7 @@ set_customscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(cscan->custom_scan_tlist);
 
 		cscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1693,7 +1697,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1701,7 +1705,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->custom_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1711,20 +1715,20 @@ set_customscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
-			fix_scan_list(root, cscan->custom_scan_tlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
-			fix_scan_list(root, cscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual =
-			fix_scan_list(root, cscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs =
-			fix_scan_list(root, cscan->custom_exprs,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 	}
 
@@ -1752,7 +1756,8 @@ set_customscan_references(PlannerInfo *root,
  * startup time.
  */
 static int
-register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
+register_partpruneinfo(PlannerInfo *root, Plan *plan, int part_prune_index,
+					   int rtoffset)
 {
 	PlannerGlobal *glob = root->glob;
 	PartitionPruneInfo *pinfo;
@@ -1776,10 +1781,10 @@ register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
 
 			prelinfo->rtindex += rtoffset;
 			prelinfo->initial_pruning_steps =
-				fix_scan_list(root, prelinfo->initial_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->initial_pruning_steps,
 							  rtoffset, 1);
 			prelinfo->exec_pruning_steps =
-				fix_scan_list(root, prelinfo->exec_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->exec_pruning_steps,
 							  rtoffset, 1);
 
 			for (i = 0; i < prelinfo->nparts; i++)
@@ -1863,7 +1868,8 @@ set_append_references(PlannerInfo *root,
 	 */
 	if (aplan->part_prune_index >= 0)
 		aplan->part_prune_index =
-			register_partpruneinfo(root, aplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) aplan,
+								   aplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(aplan->plan.lefttree == NULL);
@@ -1931,7 +1937,8 @@ set_mergeappend_references(PlannerInfo *root,
 	 */
 	if (mplan->part_prune_index >= 0)
 		mplan->part_prune_index =
-			register_partpruneinfo(root, mplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) mplan,
+								   mplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(mplan->plan.lefttree == NULL);
@@ -1958,7 +1965,7 @@ set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	 */
 	outer_itlist = build_tlist_index(outer_plan->targetlist);
 	hplan->hashkeys = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) hplan->hashkeys,
 					   outer_itlist,
 					   OUTER_VAR,
@@ -2194,7 +2201,8 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * replacing Aggref nodes that should be replaced by initplan output Params,
  * choosing the best implementation for AlternativeSubPlans,
  * looking up operator opcode info for OpExpr and related nodes,
- * and adding OIDs from regclass Const nodes into root->glob->relationOids.
+ * adding OIDs from regclass Const nodes into root->glob->relationOids, and
+ * recording Subplans that use hash tables.
  *
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
@@ -2204,11 +2212,13 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node, int rtoffset,
+			  double num_exec)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
 
@@ -2299,8 +2309,21 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 															 (AlternativeSubPlan *) node,
 															 context->num_exec),
 									 context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_scan_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 static bool
@@ -2312,6 +2335,17 @@ fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this SubPlan so that we can assign working memory to it (if
+		 * needed).
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
 	return expression_tree_walker(node, fix_scan_expr_walker, context);
 }
 
@@ -2341,7 +2375,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * NestLoopParams now, because those couldn't refer to nullable
 	 * subexpressions.
 	 */
-	join->joinqual = fix_join_expr(root,
+	join->joinqual = fix_join_expr(root, (Plan *) join,
 								   join->joinqual,
 								   outer_itlist,
 								   inner_itlist,
@@ -2371,7 +2405,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 			 * make things match up perfectly seems well out of proportion to
 			 * the value.
 			 */
-			nlp->paramval = (Var *) fix_upper_expr(root,
+			nlp->paramval = (Var *) fix_upper_expr(root, (Plan *) join,
 												   (Node *) nlp->paramval,
 												   outer_itlist,
 												   OUTER_VAR,
@@ -2388,7 +2422,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		MergeJoin  *mj = (MergeJoin *) join;
 
-		mj->mergeclauses = fix_join_expr(root,
+		mj->mergeclauses = fix_join_expr(root, (Plan *) join,
 										 mj->mergeclauses,
 										 outer_itlist,
 										 inner_itlist,
@@ -2401,7 +2435,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		HashJoin   *hj = (HashJoin *) join;
 
-		hj->hashclauses = fix_join_expr(root,
+		hj->hashclauses = fix_join_expr(root, (Plan *) join,
 										hj->hashclauses,
 										outer_itlist,
 										inner_itlist,
@@ -2414,7 +2448,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 		 * HashJoin's hashkeys are used to look for matching tuples from its
 		 * outer plan (not the Hash node!) in the hashtable.
 		 */
-		hj->hashkeys = (List *) fix_upper_expr(root,
+		hj->hashkeys = (List *) fix_upper_expr(root, (Plan *) join,
 											   (Node *) hj->hashkeys,
 											   outer_itlist,
 											   OUTER_VAR,
@@ -2433,7 +2467,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * be, so we just tell fix_join_expr to accept superset nullingrels
 	 * matches instead of exact ones.
 	 */
-	join->plan.targetlist = fix_join_expr(root,
+	join->plan.targetlist = fix_join_expr(root, (Plan *) join,
 										  join->plan.targetlist,
 										  outer_itlist,
 										  inner_itlist,
@@ -2441,7 +2475,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
 										  NUM_EXEC_TLIST((Plan *) join));
-	join->plan.qual = fix_join_expr(root,
+	join->plan.qual = fix_join_expr(root, (Plan *) join,
 									join->plan.qual,
 									outer_itlist,
 									inner_itlist,
@@ -2519,7 +2553,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 													  subplan_itlist,
 													  OUTER_VAR);
 			if (!newexpr)
-				newexpr = fix_upper_expr(root,
+				newexpr = fix_upper_expr(root, plan,
 										 (Node *) tle->expr,
 										 subplan_itlist,
 										 OUTER_VAR,
@@ -2528,7 +2562,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 										 NUM_EXEC_TLIST(plan));
 		}
 		else
-			newexpr = fix_upper_expr(root,
+			newexpr = fix_upper_expr(root, plan,
 									 (Node *) tle->expr,
 									 subplan_itlist,
 									 OUTER_VAR,
@@ -2542,7 +2576,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	plan->targetlist = output_targetlist;
 
 	plan->qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) plan->qual,
 					   subplan_itlist,
 					   OUTER_VAR,
@@ -3081,6 +3115,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  *    the source relation elements, outer_itlist = NULL and acceptable_rel
  *    the target relation.
  *
+ * 'plan' is the Plan node to which the clauses belong
  * 'clauses' is the targetlist or list of join clauses
  * 'outer_itlist' is the indexed target list of the outer join relation,
  *		or NULL
@@ -3097,6 +3132,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  */
 static List *
 fix_join_expr(PlannerInfo *root,
+			  Plan *plan,
 			  List *clauses,
 			  indexed_tlist *outer_itlist,
 			  indexed_tlist *inner_itlist,
@@ -3108,6 +3144,7 @@ fix_join_expr(PlannerInfo *root,
 	fix_join_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.outer_itlist = outer_itlist;
 	context.inner_itlist = inner_itlist;
 	context.acceptable_rel = acceptable_rel;
@@ -3234,7 +3271,19 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 															 context->num_exec),
 									 context);
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_join_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_join_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3258,6 +3307,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * expensive, so we don't want to try it in the common case where the
  * subplan tlist is just a flattened list of Vars.)
  *
+ * 'plan': the Plan node to which the expression belongs
  * 'node': the tree to be fixed (a target item or qual)
  * 'subplan_itlist': indexed target list for subplan (or index)
  * 'newvarno': varno to use for Vars referencing tlist elements
@@ -3271,6 +3321,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  */
 static Node *
 fix_upper_expr(PlannerInfo *root,
+			   Plan *plan,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
 			   int newvarno,
@@ -3281,6 +3332,7 @@ fix_upper_expr(PlannerInfo *root,
 	fix_upper_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.subplan_itlist = subplan_itlist;
 	context.newvarno = newvarno;
 	context.rtoffset = rtoffset;
@@ -3358,8 +3410,21 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
 															  (AlternativeSubPlan *) node,
 															  context->num_exec),
 									  context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_upper_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_upper_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3377,9 +3442,10 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * We also must perform opcode lookup and add regclass OIDs to
  * root->glob->relationOids.
  *
+ * 'plan': the ModifyTable node itself
  * 'rlist': the RETURNING targetlist to be fixed
  * 'topplan': the top subplan node that will be just below the ModifyTable
- *		node (note it's not yet passed through set_plan_refs)
+ *		node
  * 'resultRelation': RT index of the associated result relation
  * 'rtoffset': how much to increment varnos by
  *
@@ -3391,7 +3457,7 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * Note: resultRelation is not yet adjusted by rtoffset.
  */
 static List *
-set_returning_clause_references(PlannerInfo *root,
+set_returning_clause_references(PlannerInfo *root, Plan *plan,
 								List *rlist,
 								Plan *topplan,
 								Index resultRelation,
@@ -3415,7 +3481,7 @@ set_returning_clause_references(PlannerInfo *root,
 	 */
 	itlist = build_tlist_index_other_vars(topplan->targetlist, resultRelation);
 
-	rlist = fix_join_expr(root,
+	rlist = fix_join_expr(root, plan,
 						  rlist,
 						  itlist,
 						  NULL,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 67da7f091b5..d3f8fd7bd6c 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -206,6 +206,8 @@ typedef struct Plan
 	struct Plan *righttree;
 	/* Init Plan nodes (un-correlated expr subselects) */
 	List	   *initPlan;
+	/* Regular Sub Plan nodes (cf. "initPlan", above) */
+	List	   *subPlan;
 
 	/*
 	 * Information for management of parameter-change-driven rescanning
-- 
2.47.1

v02_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patchapplication/octet-stream; name=v02_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patchDownload

From a7a8eeeb2ccebd765b704ff2e86f7769cd359531 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 21 Feb 2025 00:16:22 +0000
Subject: [PATCH 3/4] EXPLAIN (WORK_MEM ON) now shows working memory limit

This commit moves the working-memory limit that an executor node checks, at
runtime, from the "work_mem" and "hash_mem_multiplier" GUCs, to a new
field, "work_mem", added to the Plan node. To preserve backward
compatibility, it also copies the "work_mem", etc., values from these GUCs
to the new field.

This field is on the Plan node, instead of the PlanState, because it needs
to be set before we can call ExecInitNode(). Many PlanStates look at their
working-memory limit when creating their data structures, during
initialization. So the field is on the Plan node, but set between planning
and execution phases.

Also modifies "EXPLAIN (WORK_MEM ON)" so that it also displays this
working-memory limit.
---
 src/backend/commands/explain.c             |  59 ++++-
 src/backend/executor/Makefile              |   1 +
 src/backend/executor/execGrouping.c        |  10 +-
 src/backend/executor/execMain.c            |   6 +
 src/backend/executor/execSRF.c             |   5 +-
 src/backend/executor/execWorkmem.c         | 281 +++++++++++++++++++++
 src/backend/executor/meson.build           |   1 +
 src/backend/executor/nodeAgg.c             |  69 +++--
 src/backend/executor/nodeBitmapIndexscan.c |   3 +-
 src/backend/executor/nodeBitmapOr.c        |   3 +-
 src/backend/executor/nodeCtescan.c         |   3 +-
 src/backend/executor/nodeFunctionscan.c    |   2 +
 src/backend/executor/nodeHash.c            |  23 +-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeMaterial.c        |   3 +-
 src/backend/executor/nodeMemoize.c         |   2 +-
 src/backend/executor/nodeRecursiveunion.c  |  12 +-
 src/backend/executor/nodeSetOp.c           |   1 +
 src/backend/executor/nodeSort.c            |   4 +-
 src/backend/executor/nodeSubplan.c         |   2 +
 src/backend/executor/nodeTableFuncscan.c   |   3 +-
 src/backend/executor/nodeWindowAgg.c       |   3 +-
 src/backend/optimizer/path/costsize.c      |  15 +-
 src/include/commands/explain.h             |   1 +
 src/include/executor/executor.h            |   7 +
 src/include/executor/hashjoin.h            |   3 +-
 src/include/executor/nodeAgg.h             |   5 +-
 src/include/executor/nodeHash.h            |   3 +-
 src/include/nodes/plannodes.h              |   8 +-
 src/include/nodes/primnodes.h              |   2 +
 src/test/regress/expected/workmem.out      | 184 ++++++++------
 31 files changed, 577 insertions(+), 151 deletions(-)
 create mode 100644 src/backend/executor/execWorkmem.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e09d7f868c9..07c6d34764b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -180,8 +180,8 @@ static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 static SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
-static void compute_subplan_workmem(List *plans, double *workmem);
-static void compute_agg_workmem(Agg *agg, double *workmem);
+static void compute_subplan_workmem(List *plans, double *workmem, double *limit);
+static void compute_agg_workmem(Agg *agg, double *workmem, double *limit);
 
 
 
@@ -843,6 +843,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 	{
 		ExplainPropertyFloat("Total Working Memory", "kB",
 							 es->total_workmem, 0, es);
+		ExplainPropertyFloat("Total Working Memory Limit", "kB",
+							 es->total_workmem_limit, 0, es);
 	}
 
 	ExplainCloseGroup("Query", NULL, true, es);
@@ -1983,19 +1985,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->work_mem)
 	{
 		double		plan_workmem = 0.0;
+		double		plan_limit = 0.0;
 
 		/*
 		 * Include working memory used by this Plan's SubPlan objects, whether
 		 * they are included on the Plan's initPlan or subPlan lists.
 		 */
-		compute_subplan_workmem(planstate->initPlan, &plan_workmem);
-		compute_subplan_workmem(planstate->subPlan, &plan_workmem);
+		compute_subplan_workmem(planstate->initPlan, &plan_workmem, &plan_limit);
+		compute_subplan_workmem(planstate->subPlan, &plan_workmem, &plan_limit);
 
 		/* Include working memory used by this Plan, itself. */
 		switch (nodeTag(plan))
 		{
 			case T_Agg:
-				compute_agg_workmem((Agg *) plan, &plan_workmem);
+				compute_agg_workmem((Agg *) plan, &plan_workmem, &plan_limit);
 				break;
 			case T_FunctionScan:
 				{
@@ -2003,6 +2006,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 					plan_workmem += (double) plan->workmem *
 						list_length(fscan->functions);
+					plan_limit += (double) plan->workmem_limit *
+						list_length(fscan->functions);
 					break;
 				}
 			case T_IncrementalSort:
@@ -2011,7 +2016,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				 * IncrementalSort creates two Tuplestores, each of
 				 * (estimated) size workmem.
 				 */
-				plan_workmem = (double) plan->workmem * 2;
+				plan_workmem += (double) plan->workmem * 2;
+				plan_limit += (double) plan->workmem_limit * 2;
 				break;
 			case T_RecursiveUnion:
 				{
@@ -2024,11 +2030,15 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					 */
 					plan_workmem += (double) plan->workmem * 2 +
 						runion->hashWorkMem;
+					plan_limit += (double) plan->workmem_limit * 2 +
+						runion->hashWorkMemLimit;
 					break;
 				}
 			default:
 				if (plan->workmem > 0)
 					plan_workmem += plan->workmem;
+				if (plan->workmem_limit > 0)
+					plan_limit += plan->workmem_limit;
 				break;
 		}
 
@@ -2037,17 +2047,23 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		 * working memory.
 		 */
 		plan_workmem *= (1 + es->num_workers);
+		plan_limit *= (1 + es->num_workers);
 
 		es->total_workmem += plan_workmem;
+		es->total_workmem_limit += plan_limit;
 
-		if (plan_workmem > 0.0)
+		if (plan_workmem > 0.0 || plan_limit > 0.0)
 		{
 			if (es->format == EXPLAIN_FORMAT_TEXT)
-				appendStringInfo(es->str, "  (work_mem=%.0f kB)",
-								 plan_workmem);
+				appendStringInfo(es->str, "  (work_mem=%.0f kB limit=%.0f kB)",
+								 plan_workmem, plan_limit);
 			else
+			{
 				ExplainPropertyFloat("Working Memory", "kB",
 									 plan_workmem, 0, es);
+				ExplainPropertyFloat("Working Memory Limit", "kB",
+									 plan_limit, 0, es);
+			}
 		}
 	}
 
@@ -6062,29 +6078,39 @@ GetSerializationMetrics(DestReceiver *dest)
  * increments work_mem counters to include the SubPlan's working-memory.
  */
 static void
-compute_subplan_workmem(List *plans, double *workmem)
+compute_subplan_workmem(List *plans, double *workmem, double *limit)
 {
 	foreach_node(SubPlanState, sps, plans)
 	{
 		SubPlan    *sp = sps->subplan;
 
 		if (sp->hashtab_workmem > 0)
+		{
 			*workmem += sp->hashtab_workmem;
+			*limit += sp->hashtab_workmem_limit;
+		}
 
 		if (sp->hashnul_workmem > 0)
+		{
 			*workmem += sp->hashnul_workmem;
+			*limit += sp->hashnul_workmem_limit;
+		}
 	}
 }
 
-/* Compute an Agg's working memory estimate. */
+/* Compute an Agg's working memory estimate and limit. */
 typedef struct AggWorkMem
 {
 	double		input_sort_workmem;
+	double		input_sort_limit;
 
 	double		output_hash_workmem;
+	double		output_hash_limit;
 
 	int			num_sort_nodes;
+
 	double		max_output_sort_workmem;
+	double		output_sort_limit;
 }			AggWorkMem;
 
 static void
@@ -6092,6 +6118,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 {
 	/* Record memory used for input sort buffers. */
 	mem->input_sort_workmem += (double) agg->numSorts * agg->sortWorkMem;
+	mem->input_sort_limit += (double) agg->numSorts * agg->sortWorkMemLimit;
 
 	/* Record memory used for output data structures. */
 	switch (agg->aggstrategy)
@@ -6102,6 +6129,9 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 			mem->max_output_sort_workmem =
 				Max(mem->max_output_sort_workmem, agg->plan.workmem);
 
+			if (mem->output_sort_limit == 0)
+				mem->output_sort_limit = agg->plan.workmem_limit;
+
 			++mem->num_sort_nodes;
 			break;
 		case AGG_HASHED:
@@ -6112,6 +6142,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 			 * lifetime of the Agg.
 			 */
 			mem->output_hash_workmem += agg->plan.workmem;
+			mem->output_hash_limit += agg->plan.workmem_limit;
 			break;
 		default:
 
@@ -6135,7 +6166,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
  * value on the main Agg node.
  */
 static void
-compute_agg_workmem(Agg *agg, double *workmem)
+compute_agg_workmem(Agg *agg, double *workmem, double *limit)
 {
 	AggWorkMem	mem;
 	ListCell   *lc;
@@ -6153,9 +6184,13 @@ compute_agg_workmem(Agg *agg, double *workmem)
 	}
 
 	*workmem = mem.input_sort_workmem + mem.output_hash_workmem;
+	*limit = mem.input_sort_limit + mem.output_hash_limit;;
 
 	/* We'll have at most two sort buffers alive, at any time. */
 	*workmem += mem.num_sort_nodes > 2 ?
 		mem.max_output_sort_workmem * 2.0 :
 		mem.max_output_sort_workmem;
+	*limit += mem.num_sort_nodes > 2 ?
+		mem.output_sort_limit * 2.0 :
+		mem.output_sort_limit;
 }
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..8aa9580558f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -30,6 +30,7 @@ OBJS = \
 	execScan.o \
 	execTuples.o \
 	execUtils.o \
+	execWorkmem.o \
 	functions.o \
 	instrument.o \
 	nodeAgg.o \
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 33b124fbb0a..bcd1822da80 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -168,6 +168,7 @@ BuildTupleHashTable(PlanState *parent,
 					Oid *collations,
 					long nbuckets,
 					Size additionalsize,
+					Size hash_mem_limit,
 					MemoryContext metacxt,
 					MemoryContext tablecxt,
 					MemoryContext tempcxt,
@@ -175,15 +176,18 @@ BuildTupleHashTable(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
-	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
 
 	Assert(nbuckets > 0);
 
-	/* Limit initial table size request to not more than hash_mem */
-	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	/*
+	 * Limit initial table size request to not more than hash_mem
+	 *
+	 * XXX - we should also limit the *maximum* table size to hash_mem.
+	 */
+	hash_mem_limit = hash_mem_limit / entrysize;
 	if (nbuckets > hash_mem_limit)
 		nbuckets = hash_mem_limit;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0493b7d5365..78fd887a84d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1050,6 +1050,12 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/* signal that this EState is not used for EPQ */
 	estate->es_epq_active = NULL;
 
+	/*
+	 * Assign working memory to SubPlan and Plan nodes, before initializing
+	 * their states.
+	 */
+	ExecAssignWorkMem(plannedstmt);
+
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
diff --git a/src/backend/executor/execSRF.c b/src/backend/executor/execSRF.c
index a03fe780a02..4b1e7e0ad1e 100644
--- a/src/backend/executor/execSRF.c
+++ b/src/backend/executor/execSRF.c
@@ -102,6 +102,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 							ExprContext *econtext,
 							MemoryContext argContext,
 							TupleDesc expectedDesc,
+							int workMem,
 							bool randomAccess)
 {
 	Tuplestorestate *tupstore = NULL;
@@ -261,7 +262,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 				MemoryContext oldcontext =
 					MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-				tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+				tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 				rsinfo.setResult = tupstore;
 				if (!returnsTuple)
 				{
@@ -396,7 +397,7 @@ no_function_result:
 		MemoryContext oldcontext =
 			MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-		tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+		tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 		rsinfo.setResult = tupstore;
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
new file mode 100644
index 00000000000..c513b90fc77
--- /dev/null
+++ b/src/backend/executor/execWorkmem.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * execWorkmem.c
+ *	 routine to set the "workmem_limit" field(s) on Plan nodes that need
+ *   workimg memory.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execWorkmem.c
+ *
+ * INTERFACE ROUTINES
+ *		ExecAssignWorkMem	- assign working memory to Plan nodes
+ *
+ *	 NOTES
+ *		Historically, every PlanState node, during initialization, looked at
+ *		the "work_mem" (plus maybe "hash_mem_multiplier") GUC, to determine
+ *		what working-memory limit was imposed on it.
+ *
+ *		Now, to allow different PlanState nodes to be restricted to different
+ *		amounts of memory, each PlanState node reads this limit off its
+ *		corresponding Plan node's "workmem_limit" field. And we populate that
+ *		field by calling ExecAssignWorkMem(), from InitPlan(), before we
+ *		initialize the PlanState nodes.
+ *
+ * 		The "workmem_limit" field is a limit "per data structure," rather than
+ *		"per PlanState". This is needed because some SQL operators (e.g.,
+ *		RecursiveUnion and Agg) require multiple data structures, and sometimes
+ *		the data structures don't all share the same memory requirement. So we
+ *		cannot always just divide a "per PlanState" limit among individual data
+ *		structures. Instead, we maintain the limits on the data structures (and
+ *		EXPLAIN, for example, sums them up into a single, human-readable
+ *		number).
+ *
+ *		Note that the *Path's* "workmem" estimate is per SQL operator, but when
+ *		we convert that Path to a Plan we also break its "workmem" estimate
+ *		down into per-data structure estimates. Some operators therefore
+ *		require additional "limit" fields, which we add to the corresponding
+ *		Plan.
+ *
+ *		We store the "workmem_limit" field(s) on the Plan, instead of the
+ *		PlanState, even though it conceptually belongs to execution rather than
+ *		to planning, because we need it to be set before initializing the
+ *		corresponding PlanState. This is a chicken-and-egg problem. We could,
+ *		of course, make ExecInitNode() a two-phase operation, but that seems
+ *		like overkill. Instead, we store these "limit" fields on the Plan, but
+ *		set them when we start execution, as part of InitPlan().
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+
+
+/* decls for local routines only used within this module */
+static void assign_workmem_subplan(SubPlan *subplan);
+static void assign_workmem_plan(Plan *plan);
+static void assign_workmem_agg(Agg *agg);
+static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+									bool *is_first_sort);
+
+/* end of local decls */
+
+
+/* ------------------------------------------------------------------------
+ *		ExecAssignWorkMem
+ *
+ *		Recursively assigns working memory to any Plans or SubPlans that need
+ *		it.
+ *
+ *		Inputs:
+ *		  'plannedstmt' is the statement to which we assign working memory
+ *
+ * ------------------------------------------------------------------------
+ */
+void
+ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	/*
+	 * No need to re-assign working memory on parallel workers, since workers
+	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
+	 *
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	/* Assign working memory to the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			assign_workmem_plan(plan);
+	}
+
+	/* And assign working memory to the main Plan tree. */
+	assign_workmem_plan(plannedstmt->planTree);
+}
+
+static void
+assign_workmem_subplan(SubPlan *subplan)
+{
+	subplan->hashtab_workmem_limit = subplan->useHashTable ?
+		normalize_workmem(get_hash_memory_limit()) : 0;
+
+	subplan->hashnul_workmem_limit =
+		subplan->useHashTable && !subplan->unknownEqFalse ?
+		normalize_workmem(get_hash_memory_limit()) : 0;
+}
+
+static void
+assign_workmem_plan(Plan *plan)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Assign working memory to this node's (hashed) SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		assign_workmem_subplan(subplan);
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		assign_workmem_subplan(subplan);
+
+	/* Assign working memory to this node. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_FunctionScan:
+		case T_IncrementalSort:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+			if (plan->workmem > 0)
+				plan->workmem_limit = work_mem;
+			break;
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			if (plan->workmem > 0)
+				plan->workmem_limit =
+					normalize_workmem(get_hash_memory_limit());
+			break;
+		case T_Agg:
+			assign_workmem_agg((Agg *) plan);
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+				plan->workmem_limit = work_mem;
+
+				if (runion->numCols > 0)
+				{
+					/* Also include memory for hash table. */
+					runion->hashWorkMemLimit =
+						normalize_workmem(get_hash_memory_limit());
+				}
+
+				break;
+			}
+		default:
+			Assert(plan->workmem == 0);
+			plan->workmem_limit = 0;
+			break;
+	}
+
+	/*
+	 * Assign working memory to this node's children. (Logic copied from
+	 * ExplainNode().)
+	 */
+	if (outerPlan(plan))
+		assign_workmem_plan(outerPlan(plan));
+
+	if (innerPlan(plan))
+		assign_workmem_plan(innerPlan(plan));
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				assign_workmem_plan(child);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_SubqueryScan:
+			assign_workmem_plan(((SubqueryScan *) plan)->subplan);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				assign_workmem_plan(child);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+assign_workmem_agg(Agg *agg)
+{
+	bool		is_first_sort = true;
+
+	/* Assign working memory to the main Agg node. */
+	assign_workmem_agg_node(agg,
+							true /* is_first */ ,
+							agg->chain == NULL /* is_last */ ,
+							&is_first_sort);
+
+	/* Assign working memory to any other grouping sets. */
+	foreach_node(Agg, aggnode, agg->chain)
+	{
+		assign_workmem_agg_node(aggnode,
+								false /* is_first */ ,
+								foreach_current_index(aggnode) ==
+								list_length(agg->chain) - 1 /* is_last */ ,
+								&is_first_sort);
+	}
+}
+
+static void
+assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+						bool *is_first_sort)
+{
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * Because nodeAgg.c will combine all AGG_HASHED nodes into a
+			 * single phase, it's easier to store the hash working-memory
+			 * limit on the first AGG_{HASHED,MIXED} node, and set it to zero
+			 * for all subsequent AGG_HASHED nodes.
+			 */
+			agg->plan.workmem_limit = is_first ?
+				normalize_workmem(get_hash_memory_limit()) : 0;
+			break;
+		case AGG_SORTED:
+
+			/*
+			 * Also store the sort-output working-memory limit on the first
+			 * AGG_SORTED node, and set it to zero for all subsequent
+			 * AGG_SORTED nodes.
+			 *
+			 * We'll need working-memory to hold the "sort_out" only if this
+			 * isn't the last Agg node (in which case there's no one to sort
+			 * our output).
+			 */
+			agg->plan.workmem_limit = *is_first_sort && !is_last ?
+				work_mem : 0;
+
+			*is_first_sort = false;
+			break;
+		default:
+			break;
+	}
+
+	/* Also include memory needed to sort the input: */
+	if (agg->numSorts > 0)
+	{
+		Assert(agg->sortWorkMem > 0);
+
+		agg->sortWorkMemLimit = work_mem;
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..4e65974f5f3 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -18,6 +18,7 @@ backend_sources += files(
   'execScan.c',
   'execTuples.c',
   'execUtils.c',
+  'execWorkmem.c',
   'functions.c',
   'instrument.c',
   'nodeAgg.c',
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ceb8c8a8039..9e5bcf7ada4 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -258,6 +258,7 @@
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "executor/nodeHash.h"
 #include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -403,7 +404,8 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
 					  Bitmapset **unaggregated);
 static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets,
+							 Size hash_mem_limit);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
 static long hash_choose_num_buckets(double hashentrysize,
@@ -411,6 +413,7 @@ static long hash_choose_num_buckets(double hashentrysize,
 static int	hash_choose_num_partitions(double input_groups,
 									   double hashentrysize,
 									   int used_bits,
+									   Size hash_mem_limit,
 									   int *log2_npartitions);
 static void initialize_hash_entry(AggState *aggstate,
 								  TupleHashTable hashtable,
@@ -431,9 +434,10 @@ static HashAggBatch *hashagg_batch_new(LogicalTape *input_tape, int setno,
 									   int64 input_tuples, double input_card,
 									   int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
-static void hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset,
-							   int used_bits, double input_groups,
-							   double hashentrysize);
+static void hashagg_spill_init(HashAggSpill *spill,
+							   LogicalTapeSet *tapeset, int used_bits,
+							   double input_groups, double hashentrysize,
+							   Size hash_mem_limit);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 								TupleTableSlot *inputslot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
@@ -521,6 +525,14 @@ initialize_phase(AggState *aggstate, int newphase)
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
+		int			workmem_limit;
+
+		/*
+		 * Read the sort-output workmem_limit off the first AGG_SORTED node.
+		 * Since phase 0 is always AGG_HASHED, this will always be phase 1.
+		 */
+		workmem_limit = aggstate->phases[1].aggnode->plan.workmem_limit;
+		Assert(workmem_limit > 0);
 
 		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
 												  sortnode->numCols,
@@ -528,7 +540,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->sortOperators,
 												  sortnode->collations,
 												  sortnode->nullsFirst,
-												  work_mem,
+												  workmem_limit,
 												  NULL, TUPLESORT_NONE);
 	}
 
@@ -577,7 +589,7 @@ fetch_input_tuple(AggState *aggstate)
  */
 static void
 initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
-					 AggStatePerGroup pergroupstate)
+					 AggStatePerGroup pergroupstate, size_t workMem)
 {
 	/*
 	 * Start a fresh sort operation for each DISTINCT/ORDER BY aggregate.
@@ -591,6 +603,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 		if (pertrans->sortstates[aggstate->current_set])
 			tuplesort_end(pertrans->sortstates[aggstate->current_set]);
 
+		Assert(workMem > 0);
 
 		/*
 		 * We use a plain Datum sorter when there's a single input column;
@@ -606,7 +619,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, TUPLESORT_NONE);
+									  workMem, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -616,7 +629,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, TUPLESORT_NONE);
+									 workMem, NULL, TUPLESORT_NONE);
 	}
 
 	/*
@@ -687,7 +700,8 @@ initialize_aggregates(AggState *aggstate,
 			AggStatePerTrans pertrans = &transstates[transno];
 			AggStatePerGroup pergroupstate = &pergroup[transno];
 
-			initialize_aggregate(aggstate, pertrans, pergroupstate);
+			initialize_aggregate(aggstate, pertrans, pergroupstate,
+								 aggstate->phase->aggnode->sortWorkMemLimit);
 		}
 	}
 }
@@ -1498,7 +1512,7 @@ build_hash_tables(AggState *aggstate)
 		}
 #endif
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, setno, nbuckets, memory);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1508,7 +1522,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, long nbuckets,
+				 Size hash_mem_limit)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
@@ -1537,6 +1552,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 perhash->aggnode->grpCollations,
 											 nbuckets,
 											 additionalsize,
+											 hash_mem_limit,
 											 metacxt,
 											 hashcxt,
 											 tmpcxt,
@@ -1805,12 +1821,11 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  */
 void
 hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
-					Size *mem_limit, uint64 *ngroups_limit,
+					Size hash_mem_limit, Size *mem_limit, uint64 *ngroups_limit,
 					int *num_partitions)
 {
 	int			npartitions;
 	Size		partition_mem;
-	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
 	if (input_groups * hashentrysize <= hash_mem_limit)
@@ -1830,6 +1845,7 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups,
 											 hashentrysize,
 											 used_bits,
+											 hash_mem_limit,
 											 NULL);
 	if (num_partitions != NULL)
 		*num_partitions = npartitions;
@@ -1927,7 +1943,8 @@ hash_agg_enter_spill_mode(AggState *aggstate)
 
 			hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
+							   aggstate->hashentrysize,
+							   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 		}
 	}
 }
@@ -2014,9 +2031,9 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  */
 static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
-						   int used_bits, int *log2_npartitions)
+						   int used_bits, Size hash_mem_limit,
+						   int *log2_npartitions)
 {
-	Size		hash_mem_limit = get_hash_memory_limit();
 	double		partition_limit;
 	double		mem_wanted;
 	double		dpartitions;
@@ -2095,7 +2112,8 @@ initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
 		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
 		AggStatePerGroup pergroupstate = &pergroup[transno];
 
-		initialize_aggregate(aggstate, pertrans, pergroupstate);
+		initialize_aggregate(aggstate, pertrans, pergroupstate,
+							 aggstate->phase->aggnode->sortWorkMemLimit);
 	}
 }
 
@@ -2156,7 +2174,8 @@ lookup_hash_entries(AggState *aggstate)
 			if (spill->partitions == NULL)
 				hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 								   perhash->aggnode->numGroups,
-								   aggstate->hashentrysize);
+								   aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 
 			hashagg_spill_tuple(aggstate, spill, slot, hash);
 			pergroup[setno] = NULL;
@@ -2630,7 +2649,9 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
-						batch->used_bits, &aggstate->hash_mem_limit,
+						batch->used_bits,
+						(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
+						&aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
 	/*
@@ -2718,7 +2739,8 @@ agg_refill_hash_table(AggState *aggstate)
 				 */
 				spill_initialized = true;
 				hashagg_spill_init(&spill, tapeset, batch->used_bits,
-								   batch->input_card, aggstate->hashentrysize);
+								   batch->input_card, aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 			}
 			/* no memory for a new group, spill */
 			hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
@@ -2916,13 +2938,15 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
  */
 static void
 hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
-				   double input_groups, double hashentrysize)
+				   double input_groups, double hashentrysize,
+				   Size hash_mem_limit)
 {
 	int			npartitions;
 	int			partition_bits;
 
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
-											 used_bits, &partition_bits);
+											 used_bits, hash_mem_limit,
+											 &partition_bits);
 
 #ifdef USE_INJECTION_POINTS
 	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
@@ -3649,6 +3673,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			totalGroups += aggstate->perhash[k].aggnode->numGroups;
 
 		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 0b32c3a022f..5e006baa88d 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -91,7 +91,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	else
 	{
 		/* XXX should we use less than work_mem for this? */
-		tbm = tbm_create(work_mem * (Size) 1024,
+		Assert(node->ss.ps.plan->workmem_limit > 0);
+		tbm = tbm_create((Size) node->ss.ps.plan->workmem_limit * 1024,
 						 ((BitmapIndexScan *) node->ss.ps.plan)->isshared ?
 						 node->ss.ps.state->es_query_dsa : NULL);
 	}
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 231760ec93d..4ba32639f7d 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -143,7 +143,8 @@ MultiExecBitmapOr(BitmapOrState *node)
 			if (result == NULL) /* first subplan */
 			{
 				/* XXX should we use less than work_mem for this? */
-				result = tbm_create(work_mem * (Size) 1024,
+				Assert(subnode->plan->workmem_limit > 0);
+				result = tbm_create((Size) subnode->plan->workmem_limit * 1024,
 									((BitmapOr *) node->ps.plan)->isshared ?
 									node->ps.state->es_query_dsa : NULL);
 			}
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index e1675f66b43..2272185dce7 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -232,7 +232,8 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 		/* I am the leader */
 		prmdata->value = PointerGetDatum(scanstate);
 		scanstate->leader = scanstate;
-		scanstate->cte_table = tuplestore_begin_heap(true, false, work_mem);
+		scanstate->cte_table =
+			tuplestore_begin_heap(true, false, node->scan.plan.workmem_limit);
 		tuplestore_set_eflags(scanstate->cte_table, scanstate->eflags);
 		scanstate->readptr = 0;
 	}
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 644363582d9..bbb93a8dd58 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -95,6 +95,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											node->funcstates[0].tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
@@ -154,6 +155,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											fs->tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d54cfe5fdbe..60afda04069 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -38,6 +38,7 @@
 #include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/syscache.h"
@@ -449,6 +450,7 @@ ExecHashTableCreate(HashState *state)
 	Hash	   *node;
 	HashJoinTable hashtable;
 	Plan	   *outerNode;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;
 	int			nbuckets;
 	int			nbatch;
@@ -473,8 +475,12 @@ ExecHashTableCreate(HashState *state)
 	 */
 	rows = node->plan.parallel_aware ? node->rows_total : outerNode->plan_rows;
 
+	worker_space_allowed = (size_t) node->plan.workmem_limit * 1024;
+	Assert(worker_space_allowed > 0);
+
 	ExecChooseHashTableSize(rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
+							worker_space_allowed,
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
@@ -601,6 +607,7 @@ ExecHashTableCreate(HashState *state)
 		{
 			pstate->nbatch = nbatch;
 			pstate->space_allowed = space_allowed;
+			pstate->worker_space_allowed = worker_space_allowed;
 			pstate->growth = PHJ_GROWTH_OK;
 
 			/* Set up the shared state for coordinating batches. */
@@ -658,9 +665,10 @@ ExecHashTableCreate(HashState *state)
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
+						size_t worker_space_allowed,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t *space_allowed,
+						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
 						int *num_skew_mcvs,
@@ -690,9 +698,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
-	 * Compute in-memory hashtable size limit from GUCs.
+	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = get_hash_memory_limit();
+	hash_table_bytes = worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -709,7 +717,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		hash_table_bytes = (size_t) newlimit;
 	}
 
-	*space_allowed = hash_table_bytes;
+	*total_space_allowed = hash_table_bytes;
 
 	/*
 	 * If skew optimization is possible, estimate the number of skew buckets
@@ -813,8 +821,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		if (try_combined_hash_mem)
 		{
 			ExecChooseHashTableSize(ntuples, tupwidth, useskew,
-									false, parallel_workers,
-									space_allowed,
+									worker_space_allowed, false,
+									parallel_workers,
+									total_space_allowed,
 									numbuckets,
 									numbatches,
 									num_skew_mcvs,
@@ -1242,7 +1251,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					 * to switch from one large combined memory budget to the
 					 * regular hash_mem budget.
 					 */
-					pstate->space_allowed = get_hash_memory_limit();
+					pstate->space_allowed = pstate->worker_space_allowed;
 
 					/*
 					 * The combined hash_mem of all participants wasn't
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 975b0397e7a..503d75e364b 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -312,7 +312,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.sortOperators[nPresortedCols]),
 												&(plannode->sort.collations[nPresortedCols]),
 												&(plannode->sort.nullsFirst[nPresortedCols]),
-												work_mem,
+												plannode->sort.plan.workmem_limit,
 												NULL,
 												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
@@ -613,7 +613,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.sortOperators,
 												  plannode->sort.collations,
 												  plannode->sort.nullsFirst,
-												  work_mem,
+												  plannode->sort.plan.workmem_limit,
 												  NULL,
 												  node->bounded ?
 												  TUPLESORT_ALLOWBOUNDED :
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9798bb75365..10f764c1bd5 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -61,7 +61,8 @@ ExecMaterial(PlanState *pstate)
 	 */
 	if (tuplestorestate == NULL && node->eflags != 0)
 	{
-		tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
+		tuplestorestate =
+			tuplestore_begin_heap(true, false, node->ss.ps.plan->workmem_limit);
 		tuplestore_set_eflags(tuplestorestate, node->eflags);
 		if (node->eflags & EXEC_FLAG_MARK)
 		{
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 609deb12afb..a3fc37745ca 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -1036,7 +1036,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
 	mstate->mem_used = 0;
 
 	/* Limit the total memory consumed by the cache to this */
-	mstate->mem_limit = get_hash_memory_limit();
+	mstate->mem_limit = (Size) node->plan.workmem_limit * 1024;
 
 	/* A memory context dedicated for the cache */
 	mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..96dc8d53db3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -52,6 +52,7 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->dupCollations,
 											 node->numGroups,
 											 0,
+											 (Size) node->hashWorkMemLimit * 1024,
 											 rustate->ps.state->es_query_cxt,
 											 rustate->tableContext,
 											 rustate->tempContext,
@@ -202,8 +203,15 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/* initialize processing state */
 	rustate->recursing = false;
 	rustate->intermediate_empty = true;
-	rustate->working_table = tuplestore_begin_heap(false, false, work_mem);
-	rustate->intermediate_table = tuplestore_begin_heap(false, false, work_mem);
+
+	/*
+	 * NOTE: each of our working tables gets the same workmem_limit, since
+	 * we're going to swap them repeatedly.
+	 */
+	rustate->working_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
+	rustate->intermediate_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
 
 	/*
 	 * If hashing, we need a per-tuple memory context for comparisons, and a
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5b7ff9c3748..7b71adf05dc 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -105,6 +105,7 @@ build_hash_table(SetOpState *setopstate)
 												node->cmpCollations,
 												node->numGroups,
 												sizeof(SetOpStatePerGroupData),
+												(Size) node->plan.workmem_limit * 1024,
 												setopstate->ps.state->es_query_cxt,
 												setopstate->tableContext,
 												econtext->ecxt_per_tuple_memory,
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f603337ecd3..1da77ab1d6a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -107,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->sortOperators[0],
 												   plannode->collations[0],
 												   plannode->nullsFirst[0],
-												   work_mem,
+												   plannode->plan.workmem_limit,
 												   NULL,
 												   tuplesortopts);
 		else
@@ -117,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->sortOperators,
 												  plannode->collations,
 												  plannode->nullsFirst,
-												  work_mem,
+												  plannode->plan.workmem_limit,
 												  NULL,
 												  tuplesortopts);
 		if (node->bounded)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 49767ed6a52..73214501238 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -546,6 +546,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  node->tab_collations,
 											  nbuckets,
 											  0,
+											  (Size) subplan->hashtab_workmem_limit * 1024,
 											  node->planstate->state->es_query_cxt,
 											  node->hashtablecxt,
 											  node->hashtempcxt,
@@ -575,6 +576,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 												  node->tab_collations,
 												  nbuckets,
 												  0,
+												  (Size) subplan->hashnul_workmem_limit * 1024,
 												  node->planstate->state->es_query_cxt,
 												  node->hashtablecxt,
 												  node->hashtempcxt,
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 83ade3f9437..8a9e534a743 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -276,7 +276,8 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	/* build tuplestore for the result */
 	oldcxt = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
-	tstate->tupstore = tuplestore_begin_heap(false, false, work_mem);
+	tstate->tupstore = tuplestore_begin_heap(false, false,
+											 tstate->ss.ps.plan->workmem_limit);
 
 	/*
 	 * Each call to fetch a new set of rows - of which there may be very many
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 9a1acce2b5d..76819d140ba 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1092,7 +1092,8 @@ prepare_tuplestore(WindowAggState *winstate)
 	Assert(winstate->buffer == NULL);
 
 	/* Create new tuplestore */
-	winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
+	winstate->buffer = tuplestore_begin_heap(false, false,
+											 node->plan.workmem_limit);
 
 	/*
 	 * Set up read pointers for the tuplestore.  The current pointer doesn't
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7c1fdde842b..fecea810b6e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1119,7 +1119,6 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 
-
 	/*
 	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
 	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
@@ -2875,7 +2874,8 @@ cost_agg(Path *path, PlannerInfo *root,
 		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
 											input_width,
 											aggcosts->transitionSpace);
-		hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+		hash_agg_set_limits(hashentrysize, numGroups, 0,
+							get_hash_memory_limit(), &mem_limit,
 							&ngroups_limit, &num_partitions);
 
 		nbatches = Max((numGroups * hashentrysize) / mem_limit,
@@ -4323,6 +4323,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	ExecChooseHashTableSize(inner_path_rows_total,
 							inner_path->pathtarget->width,
 							true,	/* useskew */
+							get_hash_memory_limit(),
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
 							&space_allowed,
@@ -4651,15 +4652,19 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 		/*
 		 * Estimate working memory needed for the hashtable (and hashnulls, if
-		 * needed). The logic below MUST match the logic in buildSubPlanHash()
-		 * and ExecInitSubPlan().
+		 * needed). The "nbuckets" estimate must match the logic in
+		 * buildSubPlanHash() and ExecInitSubPlan().
 		 */
 		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
 		if (nbuckets < 1)
 			nbuckets = 1;
 
+		/*
+		 * This estimate must match the logic in subpath_is_hashable() (and
+		 * see comments there).
+		 */
 		hashentrysize = MAXALIGN(plan->plan_width) +
-			MAXALIGN(SizeofMinimalTupleHeader);
+			MAXALIGN(SizeofHeapTupleHeader);
 
 		subplan->hashtab_workmem =
 			normalize_workmem((double) nbuckets * hashentrysize);
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 50454952eb2..498a1a3a4b6 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -72,6 +72,7 @@ typedef struct ExplainState
 								 * entry */
 	int			num_workers;	/* # of worker processes planned to use */
 	double		total_workmem;	/* total working memory estimate (in bytes) */
+	double		total_workmem_limit;	/* total working-memory limit (in kB) */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d12e3f451d2..c4147876d55 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  Oid *collations,
 										  long nbuckets,
 										  Size additionalsize,
+										  Size hash_mem_limit,
 										  MemoryContext metacxt,
 										  MemoryContext tablecxt,
 										  MemoryContext tempcxt,
@@ -499,6 +500,7 @@ extern Tuplestorestate *ExecMakeTableFunctionResult(SetExprState *setexpr,
 													ExprContext *econtext,
 													MemoryContext argContext,
 													TupleDesc expectedDesc,
+													int workMem,
 													bool randomAccess);
 extern SetExprState *ExecInitFunctionResultSet(Expr *expr,
 											   ExprContext *econtext, PlanState *parent);
@@ -724,4 +726,9 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execWorkmem.c
+ */
+extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..9b184c47322 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -253,7 +253,8 @@ typedef struct ParallelHashJoinState
 	ParallelHashGrowth growth;	/* control batch/bucket growth */
 	dsa_pointer chunk_work_queue;	/* chunk work queue */
 	int			nparticipants;
-	size_t		space_allowed;
+	size_t		space_allowed;	/* -- might be shared with other workers */
+	size_t		worker_space_allowed;	/* -- exclusive to this worker */
 	size_t		total_tuples;	/* total number of inner tuples */
 	LWLock		lock;			/* lock protecting the above */
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 34b82d0f5d1..728006b3ff5 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -329,8 +329,9 @@ extern void ExecReScanAgg(AggState *node);
 extern Size hash_agg_entry_size(int numTrans, Size tupleWidth,
 								Size transitionSpace);
 extern void hash_agg_set_limits(double hashentrysize, double input_groups,
-								int used_bits, Size *mem_limit,
-								uint64 *ngroups_limit, int *num_partitions);
+								int used_bits, Size hash_mem_limit,
+								Size *mem_limit, uint64 *ngroups_limit,
+								int *num_partitions);
 
 /* parallel instrumentation support */
 extern void ExecAggEstimate(AggState *node, ParallelContext *pcxt);
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index fc5b20994dd..6a40730c065 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -57,9 +57,10 @@ extern bool ExecParallelScanHashTableForUnmatched(HashJoinState *hjstate,
 extern void ExecHashTableReset(HashJoinTable hashtable);
 extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
+									size_t worker_space_allowed,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t *space_allowed,
+									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
 									int *num_skew_mcvs,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d3f8fd7bd6c..445953c77d3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,7 @@ typedef struct Plan
 	Cost		total_cost;
 
 	int			workmem;		/* estimated work_mem (in KB) */
+	int			workmem_limit;	/* work_mem limit per parallel worker (in KB) */
 
 	/*
 	 * planner's estimate of result size of this plan step
@@ -237,7 +238,7 @@ typedef struct Plan
 
 /* ----------------
  *	 Result node -
- *		If no outer plan, evaluate a variable-free targetlist.
+z *		If no outer plan, evaluate a variable-free targetlist.
  *		If outer plan, return tuples from outer plan (after a level of
  *		projection as shown by targetlist).
  *
@@ -433,6 +434,8 @@ typedef struct RecursiveUnion
 
 	/* estimated work_mem for hash table (in KB) */
 	int			hashWorkMem;
+	/* work_mem reserved for hash table */
+	int			hashWorkMemLimit;
 } RecursiveUnion;
 
 /* ----------------
@@ -1158,6 +1161,9 @@ typedef struct Agg
 	/* estimated work_mem needed to sort each input (in KB) */
 	int			sortWorkMem;
 
+	/* work_mem limit to sort one input (in KB) */
+	int			sortWorkMemLimit;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index b7d6b0fe7dc..7232d07e8b8 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1111,6 +1111,8 @@ typedef struct SubPlan
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
 	int			hashtab_workmem;	/* estimated hashtable work_mem (in KB) */
 	int			hashnul_workmem;	/* estimated hashnulls work_mem (in KB) */
+	int			hashtab_workmem_limit;	/* hashtable work_mem limit (in kB) */
+	int			hashnul_workmem_limit;	/* hashnulls work_mem limit (in kB) */
 } SubPlan;
 
 /*
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
index 215180808f4..c1a3bdd93d2 100644
--- a/src/test/regress/expected/workmem.out
+++ b/src/test/regress/expected/workmem.out
@@ -29,17 +29,18 @@ order by unique1;
 ');
                          workmem_filter                          
 -----------------------------------------------------------------
- Sort  (work_mem=N kB)
+ Sort  (work_mem=N kB limit=4096 kB)
    Sort Key: onek.unique1
    ->  Nested Loop
-         ->  HashAggregate  (work_mem=N kB)
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                Group Key: "*VALUES*".column1, "*VALUES*".column2
                ->  Values Scan on "*VALUES*"
          ->  Index Scan using onek_unique1 on onek
                Index Cond: (unique1 = "*VALUES*".column1)
                Filter: ("*VALUES*".column2 = ten)
  Total Working Memory: N kB
-(10 rows)
+ Total Working Memory Limit: 12288 kB
+(11 rows)
 
 select *
 from onek
@@ -64,18 +65,19 @@ order by unique1;
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- Sort  (work_mem=N kB)
+ Sort  (work_mem=N kB limit=4096 kB)
    Sort Key: onek.unique1
    ->  Nested Loop
          ->  Unique
-               ->  Sort  (work_mem=N kB)
+               ->  Sort  (work_mem=N kB limit=4096 kB)
                      Sort Key: "*VALUES*".column1, "*VALUES*".column2
                      ->  Values Scan on "*VALUES*"
          ->  Index Scan using onek_unique1 on onek
                Index Cond: (unique1 = "*VALUES*".column1)
                Filter: ("*VALUES*".column2 = ten)
  Total Working Memory: N kB
-(11 rows)
+ Total Working Memory Limit: 8192 kB
+(12 rows)
 
 select *
 from onek
@@ -95,17 +97,18 @@ explain (costs off, work_mem on)
 select * from (select * from tenk1 order by four) t order by four, ten
 limit 1;
 ');
-             workmem_filter              
------------------------------------------
+                    workmem_filter                     
+-------------------------------------------------------
  Limit
-   ->  Incremental Sort  (work_mem=N kB)
+   ->  Incremental Sort  (work_mem=N kB limit=8192 kB)
          Sort Key: tenk1.four, tenk1.ten
          Presorted Key: tenk1.four
-         ->  Sort  (work_mem=N kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
                Sort Key: tenk1.four
                ->  Seq Scan on tenk1
  Total Working Memory: N kB
-(8 rows)
+ Total Working Memory Limit: 12288 kB
+(9 rows)
 
 select * from (select * from tenk1 order by four) t order by four, ten
 limit 1;
@@ -131,16 +134,17 @@ where exists (select 1 from tenk1 t3
    ->  Nested Loop
          ->  Hash Join
                Hash Cond: (t3.thousand = t1.unique1)
-               ->  HashAggregate  (work_mem=N kB)
+               ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                      Group Key: t3.thousand, t3.tenthous
                      ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
-               ->  Hash  (work_mem=N kB)
+               ->  Hash  (work_mem=N kB limit=8192 kB)
                      ->  Index Only Scan using onek_unique1 on onek t1
                            Index Cond: (unique1 < 1)
          ->  Index Only Scan using tenk1_hundred on tenk1 t2
                Index Cond: (hundred = t3.tenthous)
  Total Working Memory: N kB
-(13 rows)
+ Total Working Memory Limit: 16384 kB
+(14 rows)
 
 select count(*) from (
 select t1.unique1, t2.hundred
@@ -165,23 +169,24 @@ from int4_tbl t1, int4_tbl t2
 where t4.f1 is null
 ) t;
 ');
-                       workmem_filter                        
--------------------------------------------------------------
+                              workmem_filter                              
+--------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Nested Loop Left Join
                Filter: (t4.f1 IS NULL)
                ->  Seq Scan on int4_tbl t2
-               ->  Materialize  (work_mem=N kB)
+               ->  Materialize  (work_mem=N kB limit=4096 kB)
                      ->  Nested Loop Left Join
                            Join Filter: (t3.f1 > 1)
                            ->  Seq Scan on int4_tbl t3
                                  Filter: (f1 > 0)
-                           ->  Materialize  (work_mem=N kB)
+                           ->  Materialize  (work_mem=N kB limit=4096 kB)
                                  ->  Seq Scan on int4_tbl t4
          ->  Seq Scan on int4_tbl t1
  Total Working Memory: N kB
-(14 rows)
+ Total Working Memory Limit: 8192 kB
+(15 rows)
 
 select count(*) from (
 select t1.f1
@@ -204,16 +209,17 @@ group by grouping sets((a, b), (a));
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- WindowAgg  (work_mem=N kB)
-   ->  Sort  (work_mem=N kB)
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
          Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
-         ->  HashAggregate  (work_mem=N kB)
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                Hash Key: "*VALUES*".column1, "*VALUES*".column2
                Hash Key: "*VALUES*".column1
                ->  Values Scan on "*VALUES*"
                      Filter: (column1 = column2)
  Total Working Memory: N kB
-(9 rows)
+ Total Working Memory Limit: 16384 kB
+(10 rows)
 
 select a, b, row_number() over (order by a, b nulls first)
 from (values (1, 1), (2, 2)) as t (a, b) where a = b
@@ -236,10 +242,10 @@ group by grouping sets((a, b), (a), (b), (c), (d));
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- WindowAgg  (work_mem=N kB)
-   ->  Sort  (work_mem=N kB)
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
          Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
-         ->  GroupAggregate  (work_mem=N kB)
+         ->  GroupAggregate  (work_mem=N kB limit=8192 kB)
                Group Key: "*VALUES*".column1, "*VALUES*".column2
                Group Key: "*VALUES*".column1
                Sort Key: "*VALUES*".column2
@@ -248,12 +254,13 @@ group by grouping sets((a, b), (a), (b), (c), (d));
                  Group Key: "*VALUES*".column3
                Sort Key: "*VALUES*".column4
                  Group Key: "*VALUES*".column4
-               ->  Sort  (work_mem=N kB)
+               ->  Sort  (work_mem=N kB limit=4096 kB)
                      Sort Key: "*VALUES*".column1
                      ->  Values Scan on "*VALUES*"
                            Filter: (column1 = column2)
  Total Working Memory: N kB
-(17 rows)
+ Total Working Memory Limit: 20480 kB
+(18 rows)
 
 select a, b, row_number() over (order by a, b nulls first)
 from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
@@ -282,17 +289,18 @@ select workmem_filter('
 explain (costs off, work_mem on)
 select length(stringu1) from tenk1 group by length(stringu1);
 ');
-                   workmem_filter                   
-----------------------------------------------------
- Finalize HashAggregate  (work_mem=N kB)
+                          workmem_filter                           
+-------------------------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB limit=8192 kB)
    Group Key: (length((stringu1)::text))
    ->  Gather
          Workers Planned: 4
-         ->  Partial HashAggregate  (work_mem=N kB)
+         ->  Partial HashAggregate  (work_mem=N kB limit=40960 kB)
                Group Key: length((stringu1)::text)
                ->  Parallel Seq Scan on tenk1
  Total Working Memory: N kB
-(8 rows)
+ Total Working Memory Limit: 49152 kB
+(9 rows)
 
 select length(stringu1) from tenk1 group by length(stringu1);
  length 
@@ -307,12 +315,13 @@ reset max_parallel_workers_per_gather;
 -- Agg (simple) [no work_mem]
 explain (costs off, work_mem on)
 select MAX(length(stringu1)) from tenk1;
-         QUERY PLAN         
-----------------------------
+            QUERY PLAN            
+----------------------------------
  Aggregate
    ->  Seq Scan on tenk1
  Total Working Memory: 0 kB
-(3 rows)
+ Total Working Memory Limit: 0 kB
+(4 rows)
 
 select MAX(length(stringu1)) from tenk1;
  max 
@@ -328,12 +337,13 @@ select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
 ) t;
 ');
-                      workmem_filter                       
------------------------------------------------------------
+                             workmem_filter                              
+-------------------------------------------------------------------------
  Aggregate
-   ->  Function Scan on generate_series a  (work_mem=N kB)
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(3 rows)
+ Total Working Memory Limit: 4096 kB
+(4 rows)
 
 select count(*) from (
 select sum(n) over(partition by m)
@@ -352,12 +362,13 @@ from rows from(generate_series(1, 5),
                generate_series(2, 10),
                generate_series(4, 15));
 ');
-                     workmem_filter                      
----------------------------------------------------------
+                             workmem_filter                             
+------------------------------------------------------------------------
  Aggregate
-   ->  Function Scan on generate_series  (work_mem=N kB)
+   ->  Function Scan on generate_series  (work_mem=N kB limit=12288 kB)
  Total Working Memory: N kB
-(3 rows)
+ Total Working Memory Limit: 12288 kB
+(4 rows)
 
 select count(*)
 from rows from(generate_series(1, 5),
@@ -384,13 +395,14 @@ SELECT  xmltable.*
                                   unit text PATH ''SIZE/@unit'',
                                   premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
 ');
-                      workmem_filter                      
-----------------------------------------------------------
+                             workmem_filter                             
+------------------------------------------------------------------------
  Nested Loop
    ->  Seq Scan on xmldata
-   ->  Table Function Scan on "xmltable"  (work_mem=N kB)
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(4 rows)
+ Total Working Memory Limit: 4096 kB
+(5 rows)
 
 SELECT  xmltable.*
    FROM (SELECT data FROM xmldata) x,
@@ -418,7 +430,8 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
    ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
          Filter: (unique2 <> 10)
  Total Working Memory: 0 kB
-(5 rows)
+ Total Working Memory Limit: 0 kB
+(6 rows)
 
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1 
@@ -435,11 +448,12 @@ select count(*) from
                           workmem_filter                          
 ------------------------------------------------------------------
  Aggregate
-   ->  HashSetOp Intersect  (work_mem=N kB)
+   ->  HashSetOp Intersect  (work_mem=N kB limit=8192 kB)
          ->  Seq Scan on tenk1
          ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
  Total Working Memory: N kB
-(5 rows)
+ Total Working Memory Limit: 8192 kB
+(6 rows)
 
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -456,23 +470,24 @@ cross join lateral (with recursive x(a) as (
           select o.four as a union select a + 1 from x where a < 10)
     select * from x) ss where o.ten = 1;
 ');
-                       workmem_filter                       
-------------------------------------------------------------
+                              workmem_filter                               
+---------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Memoize  (work_mem=N kB)
+         ->  Memoize  (work_mem=N kB limit=8192 kB)
                Cache Key: o.four
                Cache Mode: binary
-               ->  CTE Scan on x  (work_mem=N kB)
+               ->  CTE Scan on x  (work_mem=N kB limit=4096 kB)
                      CTE x
-                       ->  Recursive Union  (work_mem=N kB)
+                       ->  Recursive Union  (work_mem=N kB limit=16384 kB)
                              ->  Result
                              ->  WorkTable Scan on x x_1
                                    Filter: (a < 10)
  Total Working Memory: N kB
-(14 rows)
+ Total Working Memory Limit: 28672 kB
+(15 rows)
 
 select sum(o.four), sum(ss.a) from onek o
 cross join lateral (with recursive x(a) as (
@@ -491,20 +506,21 @@ WITH q1(x,y) AS (
   )
 SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
 ');
-                   workmem_filter                   
-----------------------------------------------------
+                          workmem_filter                          
+------------------------------------------------------------------
  Aggregate
    CTE q1
-     ->  HashAggregate  (work_mem=N kB)
+     ->  HashAggregate  (work_mem=N kB limit=8192 kB)
            Group Key: tenk1.hundred
            ->  Seq Scan on tenk1
    InitPlan 2
      ->  Aggregate
-           ->  CTE Scan on q1 qsub  (work_mem=N kB)
-   ->  CTE Scan on q1  (work_mem=N kB)
+           ->  CTE Scan on q1 qsub  (work_mem=N kB limit=4096 kB)
+   ->  CTE Scan on q1  (work_mem=N kB limit=4096 kB)
          Filter: ((y)::numeric > (InitPlan 2).col1)
  Total Working Memory: N kB
-(11 rows)
+ Total Working Memory Limit: 16384 kB
+(12 rows)
 
 WITH q1(x,y) AS (
     SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
@@ -522,15 +538,16 @@ select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
 limit 5;
 ');
-                            workmem_filter                             
------------------------------------------------------------------------
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
  Limit
-   ->  WindowAgg  (work_mem=N kB)
-         ->  Sort  (work_mem=N kB)
+   ->  WindowAgg  (work_mem=N kB limit=4096 kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
                Sort Key: ((a.n < 3))
-               ->  Function Scan on generate_series a  (work_mem=N kB)
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(6 rows)
+ Total Working Memory Limit: 12288 kB
+(7 rows)
 
 select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
@@ -560,20 +577,21 @@ select * from tenk1 a join tenk1 b on
          ->  Bitmap Heap Scan on tenk1 b
                Recheck Cond: ((hundred = 4) OR (unique1 = 2))
                ->  BitmapOr
-                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB)
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB limit=4096 kB)
                            Index Cond: (hundred = 4)
-                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
                            Index Cond: (unique1 = 2)
-         ->  Materialize  (work_mem=N kB)
+         ->  Materialize  (work_mem=N kB limit=4096 kB)
                ->  Bitmap Heap Scan on tenk1 a
                      Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
                      ->  BitmapOr
-                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB)
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB limit=4096 kB)
                                  Index Cond: (unique2 = 3)
-                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
                                  Index Cond: (unique1 = 1)
  Total Working Memory: N kB
-(19 rows)
+ Total Working Memory Limit: 20480 kB
+(20 rows)
 
 select count(*) from (
 select * from tenk1 a join tenk1 b on
@@ -589,15 +607,16 @@ select workmem_filter('
 explain (costs off, work_mem on)
 select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
 ');
-       workmem_filter       
-----------------------------
- Result  (work_mem=N kB)
+             workmem_filter             
+----------------------------------------
+ Result  (work_mem=N kB limit=16384 kB)
    SubPlan 1
      ->  Append
            ->  Result
            ->  Result
  Total Working Memory: N kB
-(6 rows)
+ Total Working Memory Limit: 16384 kB
+(7 rows)
 
 select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
  ?column? 
@@ -612,16 +631,17 @@ select 1 = any (select (select 1) where 1 = any (select 1));
 ');
                          workmem_filter                         
 ----------------------------------------------------------------
- Result  (work_mem=N kB)
+ Result  (work_mem=N kB limit=16384 kB)
    SubPlan 3
-     ->  Result  (work_mem=N kB)
+     ->  Result  (work_mem=N kB limit=8192 kB)
            One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
            InitPlan 1
              ->  Result
            SubPlan 2
              ->  Result
  Total Working Memory: N kB
-(9 rows)
+ Total Working Memory Limit: 24576 kB
+(10 rows)
 
 select 1 = any (select (select 1) where 1 = any (select 1));
  ?column? 
-- 
2.47.1

v02_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchapplication/octet-stream; name=v02_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchDownload

From 13df98b6939852b8dd18be7adc5702c6ba38e1fb Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 21 Feb 2025 00:41:31 +0000
Subject: [PATCH 4/4] Add "workmem_hook" to allow extensions to override
 per-node work_mem

---
 contrib/Makefile                     |   3 +-
 contrib/workmem/Makefile             |  20 +
 contrib/workmem/expected/workmem.out | 676 +++++++++++++++++++++++++++
 contrib/workmem/meson.build          |  28 ++
 contrib/workmem/sql/workmem.sql      | 304 ++++++++++++
 contrib/workmem/workmem.c            | 654 ++++++++++++++++++++++++++
 src/backend/executor/execWorkmem.c   |  37 +-
 src/include/executor/executor.h      |   4 +
 8 files changed, 1716 insertions(+), 10 deletions(-)
 create mode 100644 contrib/workmem/Makefile
 create mode 100644 contrib/workmem/expected/workmem.out
 create mode 100644 contrib/workmem/meson.build
 create mode 100644 contrib/workmem/sql/workmem.sql
 create mode 100644 contrib/workmem/workmem.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..b4880ab7067 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,7 +50,8 @@ SUBDIRS = \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
-		vacuumlo
+		vacuumlo	\
+		workmem
 
 ifeq ($(with_ssl),openssl)
 SUBDIRS += pgcrypto sslinfo
diff --git a/contrib/workmem/Makefile b/contrib/workmem/Makefile
new file mode 100644
index 00000000000..f920cdb9964
--- /dev/null
+++ b/contrib/workmem/Makefile
@@ -0,0 +1,20 @@
+# contrib/workmem/Makefile
+
+MODULE_big = workmem
+OBJS = \
+	$(WIN32RES) \
+	workmem.o
+PGFILEDESC = "workmem - extension that adjusts PostgreSQL work_mem per node"
+
+REGRESS = workmem
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/workmem
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/workmem/expected/workmem.out b/contrib/workmem/expected/workmem.out
new file mode 100644
index 00000000000..a2c6d3be4d2
--- /dev/null
+++ b/contrib/workmem/expected/workmem.out
@@ -0,0 +1,676 @@
+load 'workmem';
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=25600 kB)
+   ->  Sort  (work_mem=N kB limit=25600 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=51200 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20480 kB)
+   ->  Sort  (work_mem=N kB limit=20480 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=40960 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=20480 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=102400 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=102399 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102399 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                    
+--------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=34133 kB)
+         ->  Sort  (work_mem=N kB limit=34133 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=34134 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter              
+-----------------------------------------
+ Result  (work_mem=N kB limit=102400 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=68267 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=34133 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=1024 kB)
+   ->  Sort  (work_mem=N kB limit=1024 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=2048 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=819 kB)
+   ->  Sort  (work_mem=N kB limit=819 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=1638 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=820 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=4095 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4095 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=1365 kB)
+         ->  Sort  (work_mem=N kB limit=1365 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=1366 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+            workmem_filter             
+---------------------------------------
+ Result  (work_mem=N kB limit=4096 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=2731 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=1365 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20 kB)
+   ->  Sort  (work_mem=N kB limit=20 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=40 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=16 kB)
+   ->  Sort  (work_mem=N kB limit=16 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=32 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=16 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=80 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                           workmem_filter                            
+---------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=78 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 78 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                                  workmem_filter                                   
+-----------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=26 kB)
+         ->  Sort  (work_mem=N kB limit=27 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=27 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+           workmem_filter            
+-------------------------------------
+ Result  (work_mem=N kB limit=80 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=54 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=26 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/meson.build b/contrib/workmem/meson.build
new file mode 100644
index 00000000000..fce8030ba45
--- /dev/null
+++ b/contrib/workmem/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+workmem_sources = files(
+  'workmem.c',
+)
+
+if host_system == 'windows'
+  workmem_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'workmem',
+    '--FILEDESC', 'workmem - extension that adjusts PostgreSQL work_mem per node',])
+endif
+
+workmem = shared_module('workmem',
+  workmem_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += workmem
+
+tests += {
+  'name': 'workmem',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'workmem',
+    ],
+  },
+}
diff --git a/contrib/workmem/sql/workmem.sql b/contrib/workmem/sql/workmem.sql
new file mode 100644
index 00000000000..e6dbc35bf10
--- /dev/null
+++ b/contrib/workmem/sql/workmem.sql
@@ -0,0 +1,304 @@
+load 'workmem';
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
+
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/workmem.c b/contrib/workmem/workmem.c
new file mode 100644
index 00000000000..c758e49c162
--- /dev/null
+++ b/contrib/workmem/workmem.c
@@ -0,0 +1,654 @@
+/*-------------------------------------------------------------------------
+ *
+ * workmem.c
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/workmem/workmem.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+
+PG_MODULE_MAGIC;
+
+/* Local variables */
+
+/*
+ * A Target represents a collection of data structures, belonging to an
+ * execution node, that all share the same memory limit.
+ *
+ * For example, in parallel query, every parallel worker (plus the leader)
+ * gets a copy of the execution node, and therefore a copy of all of that
+ * node's work_mem limits. In this case, we'll track a single Target, but its
+ * count will include (1 + num_workers), because this Target gets "applied"
+ * to (1 + num_workers) execution nodes.
+ */
+typedef struct Target
+{
+	/* # of data structures to which target applies: */
+	int			count;
+	/* workmem estimate for each of these data structures: */
+	int			workmem;
+	/* (original) workmem limit for each of these data structures: */
+	int			limit;
+	/* workmem estimate, but capped at (original) workmem limit: */
+	int			priority;
+	/* ratio of (priority / limit); measure's Target's "greediness": */
+	double		ratio;
+	/* link to target's actual limit, so we can set it: */
+	int		   *target_limit;
+}			Target;
+
+typedef struct WorkMemStats
+{
+	/* total # of data structures that get working memory: */
+	uint64		count;
+	/* total working memory estimated for this query: */
+	uint64		workmem;
+	/* total working memory (currently) reserved for this query: */
+	uint64		limit;
+	/* total "capped" working memory estimate: */
+	uint64		priority;
+	/* list of Targets, used to update actual workmem limits: */
+	List	   *targets;
+}			WorkMemStats;
+
+/* GUC variables */
+static int	workmem_query_work_mem = 100 * 1024;	/* kB */
+
+/* internal functions */
+static void workmem_fn(PlannedStmt *plannedstmt);
+
+static int	clamp_priority(int workmem, int limit);
+static Target * make_target(int workmem, int *target_limit, int count);
+static void add_target(WorkMemStats * workmem_stats, Target * target);
+
+/* Sort comparators: sort by ratio, ascending or descending. */
+static int	target_compare_asc(const ListCell *a, const ListCell *b);
+static int	target_compare_desc(const ListCell *a, const ListCell *b);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Define custom GUC variable. */
+	DefineCustomIntVariable("workmem.query_work_mem",
+							"Amount of working-memory (in kB) to provide each "
+							"query.",
+							NULL,
+							&workmem_query_work_mem,
+							100 * 1024, /* default to 100 MB */
+							64,
+							INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_KB,
+							NULL,
+							NULL,
+							NULL);
+
+	MarkGUCPrefixReserved("workmem");
+
+	/* Install hooks. */
+	ExecAssignWorkMem_hook = workmem_fn;
+}
+
+/* Compute an Agg's working memory estimate and limit. */
+typedef struct AggWorkMem
+{
+	uint64		hash_workmem;
+	int		   *hash_limit;
+
+	int			num_sorts;
+	int			max_sort_workmem;
+	int		   *sort_limit;
+}			AggWorkMem;
+
+static void
+workmem_analyze_agg_node(Agg *agg, AggWorkMem * mem,
+						 WorkMemStats * workmem_stats)
+{
+	if (agg->sortWorkMem > 0 || agg->sortWorkMemLimit > 0)
+	{
+		/* Record memory used for input sort buffers. */
+		Target	   *target = make_target(agg->sortWorkMem,
+										 &agg->sortWorkMemLimit,
+										 agg->numSorts);
+
+		add_target(workmem_stats, target);
+	}
+
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			mem->hash_workmem += agg->plan.workmem;
+
+			/* Read hash limit from the first AGG_HASHED node. */
+			if (mem->hash_limit == NULL)
+				mem->hash_limit = &agg->plan.workmem_limit;
+
+			break;
+		case AGG_SORTED:
+
+			++mem->num_sorts;
+
+			mem->max_sort_workmem = Max(mem->max_sort_workmem, agg->plan.workmem);
+
+			/* Read sort limit from the first AGG_SORTED node. */
+			if (mem->sort_limit == NULL)
+				mem->sort_limit = &agg->plan.workmem_limit;
+
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze_agg(Agg *agg, int num_workers, WorkMemStats * workmem_stats)
+{
+	AggWorkMem	mem;
+
+	memset(&mem, 0, sizeof(mem));
+
+	/* Analyze main Agg node. */
+	workmem_analyze_agg_node(agg, &mem, workmem_stats);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach_node(Agg, aggnode, agg->chain)
+		workmem_analyze_agg_node(aggnode, &mem, workmem_stats);
+
+	/*
+	 * Working memory for hash tables, if needed. All hash tables share the
+	 * same limit:
+	 */
+	if (mem.hash_workmem > 0 || mem.hash_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.hash_workmem, mem.hash_limit,
+						1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	/*
+	 * Workimg memory for (output) sort buffers, if needed. We'll need at most
+	 * 2 sort buffers:
+	 */
+	if (mem.max_sort_workmem > 0 || mem.sort_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.max_sort_workmem, mem.sort_limit,
+						Min(mem.num_sorts, 2) * (1 + num_workers));
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_subplan(SubPlan *subplan, int num_workers,
+						WorkMemStats * workmem_stats)
+{
+	if (subplan->hashtab_workmem > 0 || subplan->hashtab_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash table */
+		Target	   *target = make_target(subplan->hashtab_workmem,
+										 &subplan->hashtab_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	if (subplan->hashnul_workmem > 0 || subplan->hashnul_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash-NULL table */
+		Target	   *target = make_target(subplan->hashnul_workmem,
+										 &subplan->hashnul_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_plan(Plan *plan, int num_workers, WorkMemStats * workmem_stats)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Analyze this node's SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	if (IsA(plan, Gather) || IsA(plan, GatherMerge))
+	{
+		/*
+		 * Parallel query apparently does not run InitPlans in parallel. Well,
+		 * currently, Gather and GatherMerge Plan nodes don't contain any
+		 * quals, so they can't contain SubPlans at all; so maybe we should
+		 * move this below the SubPlan-analysis loop, as well? For now, to
+		 * maintain consistency with explain.c, we'll just leave this here.
+		 */
+		Assert(num_workers == 0);
+
+		if (IsA(plan, Gather))
+			num_workers = ((Gather *) plan)->num_workers;
+		else
+			num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	/* Analyze this node's working memory. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 1 + num_workers);
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_Agg:
+			workmem_analyze_agg((Agg *) plan, num_workers, workmem_stats);
+			break;
+		case T_FunctionScan:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				int			nfuncs =
+					list_length(((FunctionScan *) plan)->functions);
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 nfuncs * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_IncrementalSort:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 2 * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+				Target	   *target;
+
+				/* working memory for two tuplestores */
+				target = make_target(plan->workmem, &plan->workmem_limit,
+									 2 * (1 + num_workers));
+				add_target(workmem_stats, target);
+
+				/* working memory for a hash table, if needed */
+				if (runion->hashWorkMem > 0 || runion->hashWorkMemLimit > 0)
+				{
+					target = make_target(runion->hashWorkMem,
+										 &runion->hashWorkMem,
+										 1 + num_workers);
+					add_target(workmem_stats, target);
+				}
+			}
+			break;
+		default:
+			Assert(plan->workmem == 0);
+			Assert(plan->workmem_limit == 0);
+			break;
+	}
+
+	/* Now analyze this Plan's children. */
+	if (outerPlan(plan))
+		workmem_analyze_plan(outerPlan(plan), num_workers, workmem_stats);
+
+	if (innerPlan(plan))
+		workmem_analyze_plan(innerPlan(plan), num_workers, workmem_stats);
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_SubqueryScan:
+			workmem_analyze_plan(((SubqueryScan *) plan)->subplan,
+								 num_workers, workmem_stats);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	/* Analyze the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			workmem_analyze_plan(plan, 0 /* num_workers */ , workmem_stats);
+	}
+
+	/* Analyze the main Plan tree itself. */
+	workmem_analyze_plan(plannedstmt->planTree, 0 /* num_workers */ ,
+						 workmem_stats);
+}
+
+static void
+workmem_set(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			remaining = workmem_query_work_mem;
+
+	if (workmem_stats->limit <= remaining)
+	{
+		/*
+		 * "High memory" case: we have more than enough query_work_mem; now
+		 * hand out the excess.
+		 */
+
+		/* This is memory that exceeds workmem limits. */
+		remaining -= workmem_stats->limit;
+
+		/*
+		 * Sort targets from highest ratio to lowest. When we assign memory to
+		 * a Target, we'll truncate fractional KB; so by going through the
+		 * list from highest to lowest ratio, we ensure that the lowest ratios
+		 * get the leftover fractional KBs.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/* NOTE: This is extra workmem *per data structure*. */
+			extra_workmem = (int) (fraction * remaining);
+
+			*target->target_limit += extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else if (workmem_stats->priority <= remaining)
+	{
+		/*
+		 * "Medium memory" case: we don't have enough query_work_mem to give
+		 * every target its full allotment, but we do have enough to give it
+		 * as much as (we estimate) it needs.
+		 *
+		 * So, just take some memory away from nodes that (we estimate) won't
+		 * need it.
+		 */
+
+		/* This is memory that exceeds workmem estimates. */
+		remaining -= workmem_stats->priority;
+
+		/*
+		 * Sort targets from highest ratio to lowest. We'll skip any Target
+		 * with ratio > 1.0, because (we estimate) they already need their
+		 * full allotment. Also, once a target reaches its workmem limit,
+		 * we'll stop giving it more workmem, leaving the surplus memory to be
+		 * assigned to targets with smaller ratios.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/*
+			 * Don't give the target more than its (original) limit.
+			 *
+			 * NOTE: This is extra workmem *per data structure*.
+			 */
+			extra_workmem = Min((int) (fraction * remaining),
+								target->limit - target->priority);
+
+			*target->target_limit = target->priority + extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else
+	{
+		uint64		limit = workmem_stats->limit;
+
+		/*
+		 * "Low memory" case: we are severely memory constrained, and need to
+		 * take "priority" memory away from targets that (we estimate)
+		 * actually need it. We'll do this by (effectively) reducing the
+		 * global "work_mem" limit, uniformly, for all targets, until we're
+		 * under the query_work_mem limit.
+		 */
+		elog(WARNING,
+			 "not enough working memory for query: increase "
+			 "workmem.query_work_mem");
+
+		/*
+		 * Sort targets from lowest ratio to highest. For any target whose
+		 * ratio is < the target_ratio, we'll just assign it its priority (=
+		 * workmem) as limit, and return the excess workmem to our "limit",
+		 * for use by subsequent, greedier, targets.
+		 */
+		list_sort(workmem_stats->targets, target_compare_asc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		target_ratio;
+			int			target_limit;
+
+			/*
+			 * If we restrict our targets to this ratio, we'll stay within the
+			 * query_work_mem limit.
+			 */
+			target_ratio = (double) remaining / limit;
+
+			/*
+			 * Don't give this target more than its priority request (but we
+			 * might give it less).
+			 */
+			target_limit = Min(target->priority,
+							   target_ratio * target->limit);
+			*target->target_limit = target_limit;
+
+			/* "Remaining" decreases by memory we actually assigned. */
+			remaining -= (target_limit * target->count);
+
+			/*
+			 * "Limit" decreases by target's original memory limit.
+			 *
+			 * If target_limit <= target->priority, so we restricted this
+			 * target to less memory than (we estimate) it needs, then the
+			 * target_ratio will stay the same, since, letting A = remaining,
+			 * B = limit, and R = ratio, we'll have:
+			 *
+			 * R=A/B <=> A=R*B <=> A-R*X = R*B - R*X <=> A-R*X = R * (B-X) <=>
+			 * R = (A-R*X) / (B-X)
+			 *
+			 * -- which is what we wanted to prove.
+			 *
+			 * And if target_limit > target->priority, so we didn't need to
+			 * restrict this target beyond its priority estimate, then the
+			 * target_ratio will increase. This means more memory for the
+			 * remaining, greedier, targets.
+			 */
+			limit -= (target->limit * target->count);
+
+			target_ratio = (double) remaining / limit;
+		}
+	}
+}
+
+/*
+ * workmem_fn: updates the query plan's work_mem based on query_work_mem
+ */
+static void
+workmem_fn(PlannedStmt *plannedstmt)
+{
+	WorkMemStats workmem_stats;
+	MemoryContext context,
+				oldcontext;
+
+	/*
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	if (workmem_query_work_mem == -1)
+		return;					/* disabled */
+
+	/*
+	 * Start by assigning default working memory to all of this query's Plan
+	 * nodes.
+	 */
+	standard_ExecAssignWorkMem(plannedstmt);
+
+	memset(&workmem_stats, 0, sizeof(workmem_stats));
+
+	/*
+	 * Set up our own memory context, so we can drop the metadata we generate,
+	 * all at once.
+	 */
+	context = AllocSetContextCreate(CurrentMemoryContext,
+									"workmem_fn context",
+									ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	/* Figure out how much total working memory this query wants/needs. */
+	workmem_analyze(plannedstmt, &workmem_stats);
+
+	/* Now restrict the query to workmem.query_work_mem. */
+	workmem_set(plannedstmt, &workmem_stats);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Drop all our metadata. */
+	MemoryContextDelete(context);
+}
+
+static int
+clamp_priority(int workmem, int limit)
+{
+	return Min(workmem, limit);
+}
+
+static Target *
+make_target(int workmem, int *target_limit, int count)
+{
+	Target	   *result = palloc_object(Target);
+
+	result->count = count;
+	result->workmem = workmem;
+	result->limit = *target_limit;
+	result->priority = clamp_priority(result->workmem, result->limit);
+	result->ratio = (double) result->priority / result->limit;
+	result->target_limit = target_limit;
+
+	return result;
+}
+
+static void
+add_target(WorkMemStats * workmem_stats, Target * target)
+{
+	workmem_stats->count += target->count;
+	workmem_stats->workmem += target->count * target->workmem;
+	workmem_stats->limit += target->count * target->limit;
+	workmem_stats->priority += target->count * target->priority;
+	workmem_stats->targets = lappend(workmem_stats->targets, target);
+}
+
+/* This "ascending" comparator sorts least-greedy Targets first. */
+static int
+target_compare_asc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in ascending order: smallest ratio first, then (if ratios equal)
+	 * smallest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) a->ptr_value)->workmem -
+			((Target *) b->ptr_value)->workmem;
+	}
+	else
+		return a_val > b_val ? 1 : -1;
+}
+
+/* This "descending" comparator sorts most-greedy Targets first. */
+static int
+target_compare_desc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in descending order: largest ratio first, then (if ratios equal)
+	 * largest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) b->ptr_value)->workmem -
+			((Target *) a->ptr_value)->workmem;
+	}
+	else
+		return b_val > a_val ? 1 : -1;
+}
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
index c513b90fc77..8a3e52c8968 100644
--- a/src/backend/executor/execWorkmem.c
+++ b/src/backend/executor/execWorkmem.c
@@ -57,6 +57,9 @@
 #include "optimizer/cost.h"
 
 
+/* Hook for plugins to get control in ExecAssignWorkMem */
+ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook = NULL;
+
 /* decls for local routines only used within this module */
 static void assign_workmem_subplan(SubPlan *subplan);
 static void assign_workmem_plan(Plan *plan);
@@ -81,16 +84,32 @@ static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
 void
 ExecAssignWorkMem(PlannedStmt *plannedstmt)
 {
-	/*
-	 * No need to re-assign working memory on parallel workers, since workers
-	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
-	 *
-	 * We already assigned working-memory limits on the leader, and those
-	 * limits were sent to the workers inside the serialized Plan.
-	 */
-	if (IsParallelWorker())
-		return;
+	if (ExecAssignWorkMem_hook)
+		(*ExecAssignWorkMem_hook) (plannedstmt);
+	else
+	{
+		/*
+		 * No need to re-assign working memory on parallel workers, since
+		 * workers have the same work_mem and hash_mem_multiplier GUCs as the
+		 * leader.
+		 *
+		 * We already assigned working-memory limits on the leader, and those
+		 * limits were sent to the workers inside the serialized Plan.
+		 *
+		 * We bail out here, in case the hook wants to re-assign memory on
+		 * parallel workers, and maybe wants to call
+		 * standard_ExecAssignWorkMem() first, as well.
+		 */
+		if (IsParallelWorker())
+			return;
 
+		standard_ExecAssignWorkMem(plannedstmt);
+	}
+}
+
+void
+standard_ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
 	/* Assign working memory to the Plans referred to by SubPlan objects. */
 	foreach_ptr(Plan, plan, plannedstmt->subplans)
 	{
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4147876d55..c12625d2061 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -96,6 +96,9 @@ typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
 											  bool ereport_on_violation);
 extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
+/* Hook for plugins to get control in ExecAssignWorkMem() */
+typedef void (*ExecAssignWorkMem_hook_type) (PlannedStmt *plannedstmt);
+extern PGDLLIMPORT ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook;
 
 /*
  * prototypes from functions in execAmi.c
@@ -730,5 +733,6 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
  * prototypes from functions in execWorkmem.c
  */
 extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+extern void standard_ExecAssignWorkMem(PlannedStmt *plannedstmt);
 
 #endif							/* EXECUTOR_H  */
-- 
2.47.1

#17

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: James Hunter (#16)

4 attachment(s)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Mon, Feb 24, 2025 at 1:46 PM James Hunter <james.hunter.pg@gmail.com> wrote:

On Mon, Feb 24, 2025 at 12:46 PM James Hunter <james.hunter.pg@gmail.com> wrote:

Attached please find the patch set I mentioned, above, in [1]. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

Only change in revision 2 is to Patch 3: adding 'execWorkmem.c' to
meson.build. As I use gcc "Makefile" on my dev machine, I did not
notice this omission until CFBot complained.

I bumped rev numbers on all other patches, even though they have not
changed, because I am unfamiliar with CFBot and am trying not to
confuse it (to minimize unnecessary email churn...)

Anyway, the patch set Works On My PC, and with any luck it will work
on CFBot as well now.

Apologies for email churn. The attached patch set, "v03," Works On My
PC. Only change from "v02" is correcting a missing #include in my new
extension, in Patch 4. (Patches 1-3 remain unchanged from v02.)

James

Attachments:

v03_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patchapplication/octet-stream; name=v03_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patchDownload

From 099366618d3f15f69bd9542d7d31f82148889a11 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 24 Jan 2025 20:48:39 +0000
Subject: [PATCH 1/4] EXPLAIN now takes "work_mem" option, to display estimated
 working memory

This commit adds option "WORK_MEM" to the existing EXPLAIN command. When
set to ON, the EXPLAIN output will include text of the form "(work_mem=
5.67 kB)" on every plan node that uses working memory.

The output is an *estimate*, typically based on the estimated number of
input rows for that plan node.

Normalize "working-memory" estimates to a minimum of 64 KB

The minimum possible value of the "work_mem" GUC is 64 KB. This commit
changes the tracking + output for "EXPLAIN (WORK_MEM ON)" so that it
reports a minimum of 64 KB for every node or subcomponent that requires
working memory.

It also rounds "nbytes" up to the nearest whole KB (= ceil()), and
changes the EXPLAIN output to report a whole integer, rather than to
two decimal places. Note that 1 KB = 1.6 percent of the 64 KB
minimum.

To allow for future optimizers to make decisions at Path time, this commit
aggregates the Path's total working memory onto the Path's "workmem" field.
To allow the executor to restrict memory usage by individual data
structure, it then breaks that total working memory into per-data structure
working memory, on the Plan.

Also adds a "Total Working Memory" line at the bottom of the
plan output.
---
 src/backend/commands/explain.c          | 207 ++++++++
 src/backend/executor/nodeHash.c         |  15 +-
 src/backend/nodes/tidbitmap.c           |  18 +
 src/backend/optimizer/path/costsize.c   | 387 ++++++++++++++-
 src/backend/optimizer/plan/createplan.c | 215 +++++++-
 src/backend/optimizer/prep/prepagg.c    |  12 +
 src/backend/optimizer/util/pathnode.c   |  53 +-
 src/include/commands/explain.h          |   3 +
 src/include/executor/nodeHash.h         |   3 +-
 src/include/nodes/pathnodes.h           |  11 +
 src/include/nodes/plannodes.h           |  11 +
 src/include/nodes/primnodes.h           |   2 +
 src/include/nodes/tidbitmap.h           |   1 +
 src/include/optimizer/cost.h            |  12 +-
 src/include/optimizer/planmain.h        |   2 +-
 src/test/regress/expected/workmem.out   | 631 ++++++++++++++++++++++++
 src/test/regress/parallel_schedule      |   2 +-
 src/test/regress/sql/workmem.sql        | 303 ++++++++++++
 18 files changed, 1828 insertions(+), 60 deletions(-)
 create mode 100644 src/test/regress/expected/workmem.out
 create mode 100644 src/test/regress/sql/workmem.sql

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c0d614866a9..e09d7f868c9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -180,6 +180,8 @@ static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 static SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+static void compute_subplan_workmem(List *plans, double *workmem);
+static void compute_agg_workmem(Agg *agg, double *workmem);
 
 
 
@@ -235,6 +237,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		}
 		else if (strcmp(opt->defname, "memory") == 0)
 			es->memory = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "work_mem") == 0)
+			es->work_mem = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "serialize") == 0)
 		{
 			if (opt->arg)
@@ -835,6 +839,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 		ExplainPropertyFloat("Execution Time", "ms", 1000.0 * totaltime, 3,
 							 es);
 
+	if (es->work_mem)
+	{
+		ExplainPropertyFloat("Total Working Memory", "kB",
+							 es->total_workmem, 0, es);
+	}
+
 	ExplainCloseGroup("Query", NULL, true, es);
 }
 
@@ -1970,6 +1980,77 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
+	if (es->work_mem)
+	{
+		double		plan_workmem = 0.0;
+
+		/*
+		 * Include working memory used by this Plan's SubPlan objects, whether
+		 * they are included on the Plan's initPlan or subPlan lists.
+		 */
+		compute_subplan_workmem(planstate->initPlan, &plan_workmem);
+		compute_subplan_workmem(planstate->subPlan, &plan_workmem);
+
+		/* Include working memory used by this Plan, itself. */
+		switch (nodeTag(plan))
+		{
+			case T_Agg:
+				compute_agg_workmem((Agg *) plan, &plan_workmem);
+				break;
+			case T_FunctionScan:
+				{
+					FunctionScan *fscan = (FunctionScan *) plan;
+
+					plan_workmem += (double) plan->workmem *
+						list_length(fscan->functions);
+					break;
+				}
+			case T_IncrementalSort:
+
+				/*
+				 * IncrementalSort creates two Tuplestores, each of
+				 * (estimated) size workmem.
+				 */
+				plan_workmem = (double) plan->workmem * 2;
+				break;
+			case T_RecursiveUnion:
+				{
+					RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+					/*
+					 * RecursiveUnion creates two Tuplestores, each of
+					 * (estimated) size workmem, plus (possibly) a hash table
+					 * of size hashWorkMem.
+					 */
+					plan_workmem += (double) plan->workmem * 2 +
+						runion->hashWorkMem;
+					break;
+				}
+			default:
+				if (plan->workmem > 0)
+					plan_workmem += plan->workmem;
+				break;
+		}
+
+		/*
+		 * Every parallel worker (plus the leader) gets its own copy of
+		 * working memory.
+		 */
+		plan_workmem *= (1 + es->num_workers);
+
+		es->total_workmem += plan_workmem;
+
+		if (plan_workmem > 0.0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "  (work_mem=%.0f kB)",
+								 plan_workmem);
+			else
+				ExplainPropertyFloat("Working Memory", "kB",
+									 plan_workmem, 0, es);
+		}
+	}
+
 	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
@@ -2536,6 +2617,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (planstate->initPlan)
 		ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/*
+		 * Other than initPlan-s, every node below us gets the # of planned
+		 * workers we specified.
+		 */
+		Assert(es->num_workers == 0);
+
+		if (nodeTag(plan) == T_Gather)
+			es->num_workers = ((Gather *) plan)->num_workers;
+		else
+			es->num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
 	/* lefttree */
 	if (outerPlanState(planstate))
 		ExplainNode(outerPlanState(planstate), ancestors,
@@ -2592,6 +2687,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainCloseGroup("Plans", "Plans", false, es);
 	}
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/* End of parallel sub-tree. */
+		es->num_workers = 0;
+	}
+
 	/* in text format, undo whatever indentation we added */
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent = save_indent;
@@ -5952,3 +6053,109 @@ GetSerializationMetrics(DestReceiver *dest)
 
 	return empty;
 }
+
+/*
+ * compute_subplan_work_mem - compute total workmem for a SubPlan object
+ *
+ * If a SubPlan object uses a hash table, then that hash table needs working
+ * memory. We display that working memory on the owning Plan. This function
+ * increments work_mem counters to include the SubPlan's working-memory.
+ */
+static void
+compute_subplan_workmem(List *plans, double *workmem)
+{
+	foreach_node(SubPlanState, sps, plans)
+	{
+		SubPlan    *sp = sps->subplan;
+
+		if (sp->hashtab_workmem > 0)
+			*workmem += sp->hashtab_workmem;
+
+		if (sp->hashnul_workmem > 0)
+			*workmem += sp->hashnul_workmem;
+	}
+}
+
+/* Compute an Agg's working memory estimate. */
+typedef struct AggWorkMem
+{
+	double		input_sort_workmem;
+
+	double		output_hash_workmem;
+
+	int			num_sort_nodes;
+	double		max_output_sort_workmem;
+}			AggWorkMem;
+
+static void
+compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
+{
+	/* Record memory used for input sort buffers. */
+	mem->input_sort_workmem += (double) agg->numSorts * agg->sortWorkMem;
+
+	/* Record memory used for output data structures. */
+	switch (agg->aggstrategy)
+	{
+		case AGG_SORTED:
+
+			/* We'll have at most two sort buffers alive, at any time. */
+			mem->max_output_sort_workmem =
+				Max(mem->max_output_sort_workmem, agg->plan.workmem);
+
+			++mem->num_sort_nodes;
+			break;
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * All hash tables created by "hash" phases are kept for the
+			 * lifetime of the Agg.
+			 */
+			mem->output_hash_workmem += agg->plan.workmem;
+			break;
+		default:
+
+			/*
+			 * "Plain" phases don't use working memory (they output a single
+			 * aggregated tuple).
+			 */
+			break;
+	}
+}
+
+/*
+ * compute_agg_workmem - compute total workmem for an Agg node
+ *
+ * An Agg node might point to a chain of additional Agg nodes. When we explain
+ * the plan, we display only the first, "main" Agg node. However, to make life
+ * easier for the executor, we stored the estimated working memory ("workmem")
+ * on each individual Agg node.
+ *
+ * This function returns the combined workmem, so that we can display this
+ * value on the main Agg node.
+ */
+static void
+compute_agg_workmem(Agg *agg, double *workmem)
+{
+	AggWorkMem	mem;
+	ListCell   *lc;
+
+	memset(&mem, 0, sizeof(mem));
+
+	compute_agg_workmem_node(agg, &mem);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach(lc, agg->chain)
+	{
+		Agg		   *aggnode = (Agg *) lfirst(lc);
+
+		compute_agg_workmem_node(aggnode, &mem);
+	}
+
+	*workmem = mem.input_sort_workmem + mem.output_hash_workmem;
+
+	/* We'll have at most two sort buffers alive, at any time. */
+	*workmem += mem.num_sort_nodes > 2 ?
+		mem.max_output_sort_workmem * 2.0 :
+		mem.max_output_sort_workmem;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..d54cfe5fdbe 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
@@ -452,6 +453,7 @@ ExecHashTableCreate(HashState *state)
 	int			nbuckets;
 	int			nbatch;
 	double		rows;
+	int			workmem;		/* ignored */
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	MemoryContext oldcxt;
@@ -477,7 +479,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
 							&space_allowed,
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
 	/* nbuckets must be a power of 2 */
 	log2_nbuckets = my_log2(nbuckets);
@@ -661,7 +663,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						size_t *space_allowed,
 						int *numbuckets,
 						int *numbatches,
-						int *num_skew_mcvs)
+						int *num_skew_mcvs,
+						int *workmem)
 {
 	int			tupsize;
 	double		inner_rel_bytes;
@@ -792,6 +795,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * the required bucket headers, we will need multiple batches.
 	 */
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	*workmem = normalize_workmem(inner_rel_bytes + bucket_bytes);
+
 	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
@@ -811,7 +817,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									space_allowed,
 									numbuckets,
 									numbatches,
-									num_skew_mcvs);
+									num_skew_mcvs,
+									workmem);
 			return;
 		}
 
@@ -929,7 +936,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
 	Assert(nbuckets > 0);
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 66b3c387d53..43df31cdb21 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -1558,6 +1558,24 @@ tbm_calculate_entries(Size maxbytes)
 	return (int) nbuckets;
 }
 
+/*
+ * tbm_calculate_bytes
+ *
+ * Estimate number of bytes needed to store maxentries hashtable entries.
+ *
+ * This function is the inverse of tbm_calculate_entries(), and is used to
+ * estimate a work_mem limit, based on cardinality.
+ */
+double
+tbm_calculate_bytes(double maxentries)
+{
+	maxentries = Min(maxentries, INT_MAX - 1);	/* safety limit */
+	maxentries = Max(maxentries, 16);	/* sanity limit */
+
+	return maxentries * (sizeof(PagetableEntry) + sizeof(Pointer) +
+						 sizeof(Pointer));
+}
+
 /*
  * Create a shared or private bitmap iterator and start iteration.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 73d78617009..7c1fdde842b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -104,6 +104,7 @@
 #include "optimizer/plancat.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -200,9 +201,14 @@ static Cost append_nonpartial_cost(List *subpaths, int numpaths,
 								   int parallel_workers);
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static int32 get_expr_width(PlannerInfo *root, const Node *expr);
-static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static void compute_sort_output_sizes(double input_tuples, int input_width,
+									  double limit_tuples,
+									  double *output_tuples,
+									  double *output_bytes);
+static double compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+									 Cardinality max_ancestor_rows);
 
 
 /*
@@ -1112,6 +1118,18 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->disabled_nodes = enable_bitmapscan ? 0 : 1;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+
+	/*
+	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
+	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
+	 *
+	 * (When we convert this path into a BitmapHeapScan plan, we'll break this
+	 * overall estimate down into per-node estimates, just as we do for
+	 * AggPaths.)
+	 */
+	path->workmem = compute_bitmap_workmem(baserel, bitmapqual,
+										   0.0 /* max_ancestor_rows */ );
 }
 
 /*
@@ -1587,6 +1605,16 @@ cost_functionscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem = list_length(rte->functions) *
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1644,6 +1672,16 @@ cost_tablefuncscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1740,6 +1778,9 @@ cost_ctescan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1823,7 +1864,7 @@ cost_resultscan(Path *path, PlannerInfo *root,
  * We are given Paths for the nonrecursive and recursive terms.
  */
 void
-cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
+cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -1850,12 +1891,37 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	 */
 	total_cost += cpu_tuple_cost * total_rows;
 
-	runion->disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
-	runion->startup_cost = startup_cost;
-	runion->total_cost = total_cost;
-	runion->rows = total_rows;
-	runion->pathtarget->width = Max(nrterm->pathtarget->width,
-									rterm->pathtarget->width);
+	runion->path.disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
+	runion->path.startup_cost = startup_cost;
+	runion->path.total_cost = total_cost;
+	runion->path.rows = total_rows;
+	runion->path.pathtarget->width = Max(nrterm->pathtarget->width,
+										 rterm->pathtarget->width);
+
+	/*
+	 * Include memory for working and intermediate tables. Since we'll
+	 * repeatedly swap the two tables, use 2x whichever is larger as our
+	 * estimate.
+	 */
+	runion->path.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(nrterm->rows,
+												 nrterm->pathtarget->width),
+							  relation_byte_size(rterm->rows,
+												 rterm->pathtarget->width))
+						  * 2);
+
+	if (list_length(runion->distinctList) > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		hashentrysize;
+
+		hashentrysize = MAXALIGN(runion->path.pathtarget->width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		runion->path.workmem +=
+			normalize_workmem(runion->numGroups * hashentrysize);
+	}
 }
 
 /*
@@ -1895,7 +1961,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
  */
 static void
-cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+cost_tuplesort(Cost *startup_cost, Cost *run_cost, Cost *nbytes,
 			   double tuples, int width,
 			   Cost comparison_cost, int sort_mem,
 			   double limit_tuples)
@@ -1915,17 +1981,8 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	/* Include the default cost-per-comparison */
 	comparison_cost += 2.0 * cpu_operator_cost;
 
-	/* Do we have a useful LIMIT? */
-	if (limit_tuples > 0 && limit_tuples < tuples)
-	{
-		output_tuples = limit_tuples;
-		output_bytes = relation_byte_size(output_tuples, width);
-	}
-	else
-	{
-		output_tuples = tuples;
-		output_bytes = input_bytes;
-	}
+	compute_sort_output_sizes(tuples, width, limit_tuples,
+							  &output_tuples, &output_bytes);
 
 	if (output_bytes > sort_mem_bytes)
 	{
@@ -1982,6 +2039,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	 * counting the LIMIT otherwise.
 	 */
 	*run_cost = cpu_operator_cost * tuples;
+	*nbytes = output_bytes;
 }
 
 /*
@@ -2011,6 +2069,7 @@ cost_incremental_sort(Path *path,
 				input_groups;
 	Cost		group_startup_cost,
 				group_run_cost,
+				group_nbytes,
 				group_input_run_cost;
 	List	   *presortedExprs = NIL;
 	ListCell   *l;
@@ -2085,7 +2144,7 @@ cost_incremental_sort(Path *path,
 	 * Estimate the average cost of sorting of one group where presorted keys
 	 * are equal.
 	 */
-	cost_tuplesort(&group_startup_cost, &group_run_cost,
+	cost_tuplesort(&group_startup_cost, &group_run_cost, &group_nbytes,
 				   group_tuples, width, comparison_cost, sort_mem,
 				   limit_tuples);
 
@@ -2126,6 +2185,14 @@ cost_incremental_sort(Path *path,
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Incremental sort switches between two Tuplesortstates: one that sorts
+	 * all columns ("full"), and that sorts only suffix columns ("prefix").
+	 * We'll assume they're both around the same size: large enough to hold
+	 * one sort group.
+	 */
+	path->workmem = normalize_workmem(group_nbytes * 2.0);
 }
 
 /*
@@ -2150,8 +2217,9 @@ cost_sort(Path *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	Cost		nbytes;
 
-	cost_tuplesort(&startup_cost, &run_cost,
+	cost_tuplesort(&startup_cost, &run_cost, &nbytes,
 				   tuples, width,
 				   comparison_cost, sort_mem,
 				   limit_tuples);
@@ -2162,6 +2230,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2522,6 +2591,7 @@ cost_material(Path *path,
 	path->disabled_nodes = input_disabled_nodes + (enable_material ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2592,6 +2662,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	if ((estinfo.flags & SELFLAG_USED_DEFAULT) != 0)
 		ndistinct = calls;
 
+	/* How much working memory would we need, to store every distinct tuple? */
+	mpath->path.workmem = normalize_workmem(ndistinct * est_entry_bytes);
+
 	/*
 	 * Since we've already estimated the maximum number of entries we can
 	 * store at once and know the estimated number of distinct values we'll be
@@ -2866,6 +2939,19 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->disabled_nodes = disabled_nodes;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/* Include memory needed to produce output. */
+	path->workmem =
+		compute_agg_output_workmem(root, aggstrategy, numGroups,
+								   aggcosts->transitionSpace, input_tuples,
+								   input_width, false /* cost_sort */ );
+
+	/* Also include memory needed to sort inputs (if needed): */
+	if (aggcosts->numSorts > 0)
+	{
+		path->workmem += (double) aggcosts->numSorts *
+			compute_agg_input_workmem(input_tuples, input_width);
+	}
 }
 
 /*
@@ -3100,7 +3186,7 @@ cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, WindowClause *winclause,
 			   int input_disabled_nodes,
 			   Cost input_startup_cost, Cost input_total_cost,
-			   double input_tuples)
+			   double input_tuples, int width)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -3182,6 +3268,11 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	if (startup_tuples > 1.0)
 		path->startup_cost += (total_cost - startup_cost) / input_tuples *
 			(startup_tuples - 1.0);
+
+
+	/* We need to store a window of size "startup_tuples", in a Tuplestore. */
+	path->workmem =
+		normalize_workmem(relation_byte_size(startup_tuples, width));
 }
 
 /*
@@ -3336,6 +3427,7 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost;
 	/* Save private data for final_cost_nestloop */
 	workspace->run_cost = run_cost;
+	workspace->workmem = 0;
 }
 
 /*
@@ -3799,6 +3891,14 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost + inner_run_cost;
 	/* Save private data for final_cost_mergejoin */
 	workspace->run_cost = run_cost;
+
+	/*
+	 * By itself, Merge Join requires no working memory. If it adds one or
+	 * more Sort or Material nodes, we'll track their working memory when we
+	 * create them, inside createplan.c.
+	 */
+	workspace->workmem = 0;
+
 	workspace->inner_run_cost = inner_run_cost;
 	workspace->outer_rows = outer_rows;
 	workspace->inner_rows = inner_rows;
@@ -4170,6 +4270,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	double		outer_path_rows = outer_path->rows;
 	double		inner_path_rows = inner_path->rows;
 	double		inner_path_rows_total = inner_path_rows;
+	int			workmem;
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
@@ -4227,7 +4328,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
-							&num_skew_mcvs);
+							&num_skew_mcvs,
+							&workmem);
 
 	/*
 	 * If inner relation is too big then we will need to "batch" the join,
@@ -4258,6 +4360,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->numbuckets = numbuckets;
 	workspace->numbatches = numbatches;
 	workspace->inner_rows_total = inner_path_rows_total;
+	workspace->workmem = workmem;
 }
 
 /*
@@ -4266,8 +4369,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
  *
  * Note: the numbatches estimate is also saved into 'path' for use later
  *
- * 'path' is already filled in except for the rows and cost fields and
- *		num_batches
+ * 'path' is already filled in except for the rows and cost fields,
+ *		num_batches, and workmem
  * 'workspace' is the result from initial_cost_hashjoin
  * 'extra' contains miscellaneous information about the join
  */
@@ -4284,6 +4387,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	List	   *hashclauses = path->path_hashclauses;
 	Cost		startup_cost = workspace->startup_cost;
 	Cost		run_cost = workspace->run_cost;
+	int			workmem = workspace->workmem;
 	int			numbuckets = workspace->numbuckets;
 	int			numbatches = workspace->numbatches;
 	Cost		cpu_per_tuple;
@@ -4510,6 +4614,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	path->jpath.path.startup_cost = startup_cost;
 	path->jpath.path.total_cost = startup_cost + run_cost;
+	path->jpath.path.workmem = workmem;
 }
 
 
@@ -4532,6 +4637,9 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 	if (subplan->useHashTable)
 	{
+		long		nbuckets;
+		Size		hashentrysize;
+
 		/*
 		 * If we are using a hash table for the subquery outputs, then the
 		 * cost of evaluating the query is a one-time cost.  We charge one
@@ -4541,6 +4649,37 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		sp_cost.startup += plan->total_cost +
 			cpu_operator_cost * plan->plan_rows;
 
+		/*
+		 * Estimate working memory needed for the hashtable (and hashnulls, if
+		 * needed). The logic below MUST match the logic in buildSubPlanHash()
+		 * and ExecInitSubPlan().
+		 */
+		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
+		if (nbuckets < 1)
+			nbuckets = 1;
+
+		hashentrysize = MAXALIGN(plan->plan_width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		subplan->hashtab_workmem =
+			normalize_workmem((double) nbuckets * hashentrysize);
+
+		if (!subplan->unknownEqFalse)
+		{
+			/* Also needs a hashnulls table.  */
+			if (IsA(subplan->testexpr, OpExpr))
+				nbuckets = 1;	/* there can be only one entry */
+			else
+			{
+				nbuckets /= 16;
+				if (nbuckets < 1)
+					nbuckets = 1;
+			}
+
+			subplan->hashnul_workmem =
+				normalize_workmem((double) nbuckets * hashentrysize);
+		}
+
 		/*
 		 * The per-tuple costs include the cost of evaluating the lefthand
 		 * expressions, plus the cost of probing the hashtable.  We already
@@ -6424,7 +6563,7 @@ get_expr_width(PlannerInfo *root, const Node *expr)
  *	  Estimate the storage space in bytes for a given number of tuples
  *	  of a given width (size in bytes).
  */
-static double
+double
 relation_byte_size(double tuples, int width)
 {
 	return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
@@ -6603,3 +6742,197 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * compute_sort_output_sizes
+ *	  Estimate amount of memory and rows needed to hold a Sort operator's output
+ */
+static void
+compute_sort_output_sizes(double input_tuples, int input_width,
+						  double limit_tuples,
+						  double *output_tuples, double *output_bytes)
+{
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Do we have a useful LIMIT? */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+		*output_tuples = limit_tuples;
+	else
+		*output_tuples = input_tuples;
+
+	*output_bytes = relation_byte_size(*output_tuples, input_width);
+}
+
+/*
+ * compute_agg_input_workmem
+ *	  Estimate memory (in KB) needed to hold a sort buffer for aggregate's input
+ *
+ * Some aggregates involve DISTINCT or ORDER BY, so they need to sort their
+ * input, before they can process it. We need one sort buffer per such
+ * aggregate, and this function returns that sort buffer's (estimated) size (in
+ * KB).
+ */
+int
+compute_agg_input_workmem(double input_tuples, double input_width)
+{
+	/* Account for size of one buffer needed to sort the input. */
+	return normalize_workmem(input_tuples * input_width);
+}
+
+/*
+ * compute_agg_output_workmem
+ *	  Estimate amount of memory needed (in KB) to hold an aggregate's output
+ *
+ * In a Hash aggregate, we need space for the hash table that holds the
+ * aggregated data.
+ *
+ * Sort aggregates require output space only if they are part of a Grouping
+ * Sets chain: the first aggregate writes to its "sort_out" buffer, which the
+ * second aggregate uses as its "sort_in" buffer, and sorts.
+ *
+ * In the latter case, the "Path" code already costs the sort by calling
+ * cost_sort(), so it passes "cost_sort = false" to this function, to avoid
+ * double-counting.
+ */
+int
+compute_agg_output_workmem(PlannerInfo *root, AggStrategy aggstrategy,
+						   double numGroups, uint64 transitionSpace,
+						   double input_tuples, double input_width,
+						   bool cost_sort)
+{
+	/* Account for size of hash table to hold the output. */
+	if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+	{
+		double		hashentrysize;
+
+		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
+											input_width, transitionSpace);
+		return normalize_workmem(numGroups * hashentrysize);
+	}
+
+	/* Account for the size of the "sort_out" buffer. */
+	if (cost_sort && aggstrategy == AGG_SORTED)
+	{
+		double		output_tuples;	/* ignored */
+		double		output_bytes;
+
+		Assert(aggstrategy == AGG_SORTED);
+
+		compute_sort_output_sizes(numGroups, input_width,
+								  0.0 /* limit_tuples */ ,
+								  &output_tuples, &output_bytes);
+		return normalize_workmem(output_bytes);
+	}
+
+	return 0;
+}
+
+/*
+ * compute_bitmap_workmem
+ *	  Estimate total working memory (in KB) needed by bitmapqual
+ *
+ * Although we don't fill in the workmem_est or rows fields on the bitmapqual's
+ * paths, we fill them in on the owning BitmapHeapPath. This function estimates
+ * the total work_mem needed by all BitmapOrPaths and IndexPaths inside
+ * bitmapqual.
+ */
+static double
+compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+					   Cardinality max_ancestor_rows)
+{
+	double		workmem = 0.0;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * baserel->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
+
+	if (IsA(bitmapqual, BitmapAndPath))
+	{
+		BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, apath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, BitmapOrPath))
+	{
+		BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, opath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, IndexPath))
+	{
+		/* Working memory needed for 1 TID bitmap. */
+		workmem +=
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+	}
+
+	return workmem;
+}
+
+/*
+ * normalize_workmem
+ *	  Convert a double, "bytes" working-memory estimate to an int, "KB" value
+ *
+ * Normalizes to a minimum of 64 (KB), rounding up to the nearest whole KB.
+ */
+int
+normalize_workmem(double nbytes)
+{
+	double		workmem;
+
+	/*
+	 * We'll assign working-memory to SQL operators in 1 KB increments, so
+	 * round up to the next whole KB.
+	 */
+	workmem = ceil(nbytes / 1024.0);
+
+	/*
+	 * Although some components can probably work with < 64 KB of working
+	 * memory, PostgreSQL has imposed a hard minimum of 64 KB on the
+	 * "work_mem" GUC, for a long time; so, by now, some components probably
+	 * rely on this minimum, implicitly, and would fail if we tried to assign
+	 * them < 64 KB.
+	 *
+	 * Perhaps this minimum can be relaxed, in the future; but memory sizes
+	 * keep increasing, and right now the minimum of 64 KB = 1.6 percent of
+	 * the default "work_mem" of 4 MB.
+	 *
+	 * So, even with this (overly?) cautious normalization, with the default
+	 * GUC settings, we can still achieve a working-memory reduction of
+	 * 64-to-1.
+	 */
+	workmem = Max((double) 64, workmem);
+
+	/* And clamp to MAX_KILOBYTES. */
+	workmem = Min(workmem, (double) MAX_KILOBYTES);
+
+	return (int) workmem;
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 816a2b2a576..973b86371ef 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -130,6 +130,7 @@ static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
 											   BitmapHeapPath *best_path,
 											   List *tlist, List *scan_clauses);
 static Plan *create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+								   Cardinality max_ancestor_rows,
 								   List **qual, List **indexqual, List **indexECs);
 static void bitmap_subplan_mark_shared(Plan *plan);
 static TidScan *create_tidscan_plan(PlannerInfo *root, TidPath *best_path,
@@ -1853,6 +1854,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupCollations,
 								 NIL,
 								 NIL,
+								 0, /* numSorts */
 								 best_path->path.rows,
 								 0,
 								 subplan);
@@ -1911,6 +1913,15 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 	/* Copy cost data from Path to Plan */
 	copy_generic_path_info(plan, &best_path->path);
 
+	if (IsA(plan, Unique))
+	{
+		/*
+		 * We assigned "workmem" to the Sort subplan. Clear it from the top-
+		 * level Unique node, to avoid double-counting.
+		 */
+		plan->workmem = 0;
+	}
+
 	return plan;
 }
 
@@ -2228,6 +2239,13 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
+	/*
+	 * IncrementalSort creates two sort buffers, which the Path's "workmem"
+	 * estimate combined into a single value. Split it into two now.
+	 */
+	plan->sort.plan.workmem =
+		normalize_workmem(best_path->spath.path.workmem / 2);
+
 	return plan;
 }
 
@@ -2333,12 +2351,29 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 												subplan->targetlist),
 					NIL,
 					NIL,
+					best_path->numSorts,
 					best_path->numGroups,
 					best_path->transitionSpace,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace the overall workmem estimate with that we copied from the Path
+	 * with finer-grained estimates.
+	 */
+	plan->plan.workmem =
+		compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+								   plan->transitionSpace, subplan->plan_rows,
+								   subplan->plan_width, false /* cost_sort */ );
+
+	/* Also include estimated memory needed to sort the input: */
+	if (plan->numSorts > 0)
+	{
+		plan->sortWorkMem = compute_agg_input_workmem(subplan->plan_rows,
+													  subplan->plan_width);
+	}
+
 	return plan;
 }
 
@@ -2457,8 +2492,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			RollupData *rollup = lfirst(lc);
 			AttrNumber *new_grpColIdx;
 			Plan	   *sort_plan = NULL;
-			Plan	   *agg_plan;
+			Agg		   *agg_plan;
 			AggStrategy strat;
+			bool		cost_sort;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2480,19 +2516,20 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			else
 				strat = AGG_SORTED;
 
-			agg_plan = (Plan *) make_agg(NIL,
-										 NIL,
-										 strat,
-										 AGGSPLIT_SIMPLE,
-										 list_length((List *) linitial(rollup->gsets)),
-										 new_grpColIdx,
-										 extract_grouping_ops(rollup->groupClause),
-										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
-										 NIL,
-										 rollup->numGroups,
-										 best_path->transitionSpace,
-										 sort_plan);
+			agg_plan = make_agg(NIL,
+								NIL,
+								strat,
+								AGGSPLIT_SIMPLE,
+								list_length((List *) linitial(rollup->gsets)),
+								new_grpColIdx,
+								extract_grouping_ops(rollup->groupClause),
+								extract_grouping_collations(rollup->groupClause, subplan->targetlist),
+								rollup->gsets,
+								NIL,
+								best_path->numSorts,
+								rollup->numGroups,
+								best_path->transitionSpace,
+								sort_plan);
 
 			/*
 			 * Remove stuff we don't need to avoid bloating debug output.
@@ -2503,7 +2540,36 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan->lefttree = NULL;
 			}
 
-			chain = lappend(chain, agg_plan);
+			/*
+			 * If we're an AGG_SORTED, but not the last, we need to cost
+			 * working memory needed to produce our "sort_out" buffer.
+			 */
+			cost_sort = foreach_current_index(lc) < list_length(rollups) - 1;
+
+			/*
+			 * Although this side node doesn't need accurate cost estimates,
+			 * it does need an accurate *memory* estimate, since we'll use
+			 * that estimate to distribute working memory to this side node,
+			 * at runtime.
+			 */
+
+			/* Estimated memory needed to hold the output: */
+			agg_plan->plan.workmem =
+				compute_agg_output_workmem(root, agg_plan->aggstrategy,
+										   agg_plan->numGroups,
+										   agg_plan->transitionSpace,
+										   subplan->plan_rows,
+										   subplan->plan_width, cost_sort);
+
+			/* Also include estimated memory needed to sort the input: */
+			if (agg_plan->numSorts > 0)
+			{
+				agg_plan->sortWorkMem =
+					compute_agg_input_workmem(subplan->plan_rows,
+											  subplan->plan_width);
+			}
+
+			chain = lappend(chain, (Plan *) agg_plan);
 		}
 	}
 
@@ -2514,6 +2580,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
 		int			numGroupCols;
+		bool		cost_sort;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2529,12 +2596,37 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
 						rollup->gsets,
 						chain,
+						best_path->numSorts,
 						rollup->numGroups,
 						best_path->transitionSpace,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
+
+		/*
+		 * If we're an AGG_SORTED, but not the last, we need to cost working
+		 * memory needed to produce our "sort_out" buffer.
+		 */
+		cost_sort = list_length(rollups) > 1;
+
+		/*
+		 * Replace the overall workmem estimate that we copied from the Path
+		 * with finer-grained estimates.
+		 */
+		plan->plan.workmem =
+			compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows, subplan->plan_width,
+									   cost_sort);
+
+		/* Also include estimated memory needed to sort the input: */
+		if (plan->numSorts > 0)
+		{
+			plan->sortWorkMem =
+				compute_agg_input_workmem(subplan->plan_rows,
+										  subplan->plan_width);
+		}
 	}
 
 	return (Plan *) plan;
@@ -2783,6 +2875,38 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace our overall "workmem" estimate with estimates at finer
+	 * granularity.
+	 */
+
+	/*
+	 * Include memory for working and intermediate tables.  Since we'll
+	 * repeatedly swap the two tables, use the larger of the two as our
+	 * working- memory estimate.
+	 *
+	 * NOTE: The Path's "workmem" estimate is for the whole Path, but the
+	 * Plan's "workmem" estimates are *per data structure*. So, this value is
+	 * half of the corresponding Path's value.
+	 */
+	plan->plan.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(leftplan->plan_rows,
+												 leftplan->plan_width),
+							  relation_byte_size(rightplan->plan_rows,
+												 rightplan->plan_width)));
+
+	if (plan->numCols > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		entrysize;
+
+		entrysize = sizeof(TupleHashEntryData) + plan->plan.plan_width;
+
+		plan->hashWorkMem =
+			normalize_workmem(plan->numGroups * entrysize);
+	}
+
 	return plan;
 }
 
@@ -3223,6 +3347,7 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	/* Process the bitmapqual tree into a Plan tree and qual lists */
 	bitmapqualplan = create_bitmap_subplan(root, best_path->bitmapqual,
+										   0.0 /* max_ancestor_rows */ ,
 										   &bitmapqualorig, &indexquals,
 										   &indexECs);
 
@@ -3309,6 +3434,12 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&scan_plan->scan.plan, &best_path->path);
 
+	/*
+	 * We assigned "workmem" to the "bitmapqualplan" subplan. Clear it from
+	 * the top-level BitmapHeapScan node, to avoid double-counting.
+	 */
+	scan_plan->scan.plan.workmem = 0;
+
 	return scan_plan;
 }
 
@@ -3334,9 +3465,24 @@ create_bitmap_scan_plan(PlannerInfo *root,
  */
 static Plan *
 create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+					  Cardinality max_ancestor_rows,
 					  List **qual, List **indexqual, List **indexECs)
 {
 	Plan	   *plan;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * bitmapqual->parent->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
 
 	if (IsA(bitmapqual, BitmapAndPath))
 	{
@@ -3362,6 +3508,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3373,8 +3521,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_and(subplans);
 		plan->startup_cost = apath->path.startup_cost;
 		plan->total_cost = apath->path.total_cost;
-		plan->plan_rows =
-			clamp_row_est(apath->bitmapselectivity * apath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = apath->path.parallel_safe;
@@ -3409,6 +3556,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3437,8 +3586,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			plan = (Plan *) make_bitmap_or(subplans);
 			plan->startup_cost = opath->path.startup_cost;
 			plan->total_cost = opath->path.total_cost;
-			plan->plan_rows =
-				clamp_row_est(opath->bitmapselectivity * opath->path.parent->tuples);
+			plan->plan_rows = plan_rows;
 			plan->plan_width = 0;	/* meaningless */
 			plan->parallel_aware = false;
 			plan->parallel_safe = opath->path.parallel_safe;
@@ -3484,8 +3632,9 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
-		plan->plan_rows =
-			clamp_row_est(ipath->indexselectivity * ipath->path.parent->tuples);
+		plan->workmem =
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
@@ -3796,6 +3945,14 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	/*
+	 * Replace the path's total working-memory estimate with a per-function
+	 * estimate.
+	 */
+	scan_plan->scan.plan.workmem =
+		normalize_workmem(relation_byte_size(scan_plan->scan.plan.plan_rows,
+											 scan_plan->scan.plan.plan_width));
+
 	return scan_plan;
 }
 
@@ -4615,6 +4772,9 @@ create_mergejoin_plan(PlannerInfo *root,
 		 */
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
+		matplan->workmem =
+			normalize_workmem(relation_byte_size(matplan->plan_rows,
+												 matplan->plan_width));
 
 		inner_plan = matplan;
 	}
@@ -4961,6 +5121,10 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	/* Display "workmem" on the Hash subnode, not its parent HashJoin node. */
+	hash_plan->plan.workmem = join_plan->join.plan.workmem;
+	join_plan->join.plan.workmem = 0;
+
 	return join_plan;
 }
 
@@ -5458,6 +5622,7 @@ copy_generic_path_info(Plan *dest, Path *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = (int) Min(src->workmem, (double) MAX_KILOBYTES);
 	dest->plan_rows = src->rows;
 	dest->plan_width = src->pathtarget->width;
 	dest->parallel_aware = src->parallel_aware;
@@ -5474,6 +5639,7 @@ copy_plan_costsize(Plan *dest, Plan *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = src->workmem;
 	dest->plan_rows = src->plan_rows;
 	dest->plan_width = src->plan_width;
 	/* Assume the inserted node is not parallel-aware. */
@@ -5509,6 +5675,7 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 			  limit_tuples);
 	plan->plan.startup_cost = sort_path.startup_cost;
 	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.workmem = (int) Min(sort_path.workmem, (double) MAX_KILOBYTES);
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5540,6 +5707,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 						  limit_tuples);
 	plan->sort.plan.startup_cost = sort_path.startup_cost;
 	plan->sort.plan.total_cost = sort_path.total_cost;
+	plan->sort.plan.workmem = (int) Min(sort_path.workmem,
+										(double) MAX_KILOBYTES);
 	plan->sort.plan.plan_rows = lefttree->plan_rows;
 	plan->sort.plan.plan_width = lefttree->plan_width;
 	plan->sort.plan.parallel_aware = false;
@@ -6673,7 +6842,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 		 Size transitionSpace, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6689,6 +6858,8 @@ make_agg(List *tlist, List *qual,
 	node->grpColIdx = grpColIdx;
 	node->grpOperators = grpOperators;
 	node->grpCollations = grpCollations;
+	node->numSorts = numSorts;
+	node->sortWorkMem = 0;		/* caller will fill this */
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index c0a2f04a8c3..3eba364484d 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -691,5 +691,17 @@ get_agg_clause_costs(PlannerInfo *root, AggSplit aggsplit, AggClauseCosts *costs
 			costs->finalCost.startup += argcosts.startup;
 			costs->finalCost.per_tuple += argcosts.per_tuple;
 		}
+
+		/*
+		 * How many aggrefs need to sort their input? (Each such aggref gets
+		 * its own sort buffer. The logic here MUST match the corresponding
+		 * logic in function build_pertrans_for_aggref().)
+		 */
+		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind) &&
+			!aggref->aggpresorted &&
+			(aggref->aggdistinct || aggref->aggorder))
+		{
+			++costs->numSorts;
+		}
 	}
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..c533bfb9a58 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1709,6 +1709,13 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
 	pathnode->path.rows = subpath->rows;
 
+	/*
+	 * For now, set workmem at hash memory limit. Function
+	 * cost_memoize_rescan() will adjust this field, same as it does for field
+	 * "est_entries".
+	 */
+	pathnode->path.workmem = normalize_workmem(get_hash_memory_limit());
+
 	return pathnode;
 }
 
@@ -1937,12 +1944,14 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		pathnode->path.disabled_nodes = agg_path.disabled_nodes;
 		pathnode->path.startup_cost = agg_path.startup_cost;
 		pathnode->path.total_cost = agg_path.total_cost;
+		pathnode->path.workmem = agg_path.workmem;
 	}
 	else
 	{
 		pathnode->path.disabled_nodes = sort_path.disabled_nodes;
 		pathnode->path.startup_cost = sort_path.startup_cost;
 		pathnode->path.total_cost = sort_path.total_cost;
+		pathnode->path.workmem = sort_path.workmem;
 	}
 
 	rel->cheapest_unique_path = (Path *) pathnode;
@@ -2289,6 +2298,13 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	/* Cost is the same as for a regular CTE scan */
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
+	/*
+	 * But working memory used is 0, since the worktable scan doesn't create a
+	 * tuplestore -- it just reuses a tuplestore already created by a
+	 * recursive union.
+	 */
+	pathnode->workmem = 0;
+
 	return pathnode;
 }
 
@@ -3283,6 +3299,7 @@ create_agg_path(PlannerInfo *root,
 
 	pathnode->aggstrategy = aggstrategy;
 	pathnode->aggsplit = aggsplit;
+	pathnode->numSorts = aggcosts ? aggcosts->numSorts : 0;
 	pathnode->numGroups = numGroups;
 	pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
 	pathnode->groupClause = groupClause;
@@ -3333,6 +3350,8 @@ create_groupingsets_path(PlannerInfo *root,
 	ListCell   *lc;
 	bool		is_first = true;
 	bool		is_first_sort = true;
+	int			num_sort_nodes = 0;
+	double		max_sort_workmem = 0.0;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3369,6 +3388,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->numSorts = agg_costs ? agg_costs->numSorts : 0;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 	pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
@@ -3432,6 +3452,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 subpath->pathtarget->width);
 				if (!rollup->is_hashed)
 					is_first_sort = false;
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 			else
 			{
@@ -3444,6 +3466,12 @@ create_groupingsets_path(PlannerInfo *root,
 						  work_mem,
 						  -1.0);
 
+				/*
+				 * We costed sorting the previous "sort" rollup's "sort_out"
+				 * buffer. How much memory did it need?
+				 */
+				max_sort_workmem = Max(max_sort_workmem, sort_path.workmem);
+
 				/* Account for cost of aggregation */
 
 				cost_agg(&agg_path, root,
@@ -3457,12 +3485,17 @@ create_groupingsets_path(PlannerInfo *root,
 						 sort_path.total_cost,
 						 sort_path.rows,
 						 subpath->pathtarget->width);
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 
 			pathnode->path.disabled_nodes += agg_path.disabled_nodes;
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
+
+		if (!rollup->is_hashed)
+			++num_sort_nodes;
 	}
 
 	/* add tlist eval cost for each output row */
@@ -3470,6 +3503,17 @@ create_groupingsets_path(PlannerInfo *root,
 	pathnode->path.total_cost += target->cost.startup +
 		target->cost.per_tuple * pathnode->path.rows;
 
+	/*
+	 * Include working memory needed to sort agg output. If there's only 1
+	 * sort rollup, then we don't need any memory. If there are 2 sort
+	 * rollups, we need enough memory for 1 sort buffer. If there are >= 3
+	 * sort rollups, we need only 2 sort buffers, since we're
+	 * double-buffering.
+	 */
+	pathnode->path.workmem += num_sort_nodes > 2 ?
+		max_sort_workmem * 2.0 :
+		max_sort_workmem;
+
 	return pathnode;
 }
 
@@ -3619,7 +3663,8 @@ create_windowagg_path(PlannerInfo *root,
 				   subpath->disabled_nodes,
 				   subpath->startup_cost,
 				   subpath->total_cost,
-				   subpath->rows);
+				   subpath->rows,
+				   subpath->pathtarget->width);
 
 	/* add tlist eval cost for each output row */
 	pathnode->path.startup_cost += target->cost.startup;
@@ -3744,7 +3789,11 @@ create_setop_path(PlannerInfo *root,
 			MAXALIGN(SizeofMinimalTupleHeader);
 		if (hashentrysize * numGroups > get_hash_memory_limit())
 			pathnode->path.disabled_nodes++;
+
+		pathnode->path.workmem =
+			normalize_workmem(numGroups * hashentrysize);
 	}
+
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
@@ -3795,7 +3844,7 @@ create_recursiveunion_path(PlannerInfo *root,
 	pathnode->wtParam = wtParam;
 	pathnode->numGroups = numGroups;
 
-	cost_recursive_union(&pathnode->path, leftpath, rightpath);
+	cost_recursive_union(pathnode, leftpath, rightpath);
 
 	return pathnode;
 }
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 570e7cad1fa..50454952eb2 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -53,6 +53,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		memory;			/* print planner's memory usage information */
+	bool		work_mem;		/* print work_mem estimates per node */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
@@ -69,6 +70,8 @@ typedef struct ExplainState
 	bool		hide_workers;	/* set if we find an invisible Gather */
 	int			rtable_size;	/* length of rtable excluding the RTE_GROUP
 								 * entry */
+	int			num_workers;	/* # of worker processes planned to use */
+	double		total_workmem;	/* total working memory estimate (in bytes) */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..fc5b20994dd 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -62,7 +62,8 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									size_t *space_allowed,
 									int *numbuckets,
 									int *numbatches,
-									int *num_skew_mcvs);
+									int *num_skew_mcvs,
+									int *workmem);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fbf05322c75..17eb6b52579 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -60,6 +60,7 @@ typedef struct AggClauseCosts
 	QualCost	transCost;		/* total per-input-row execution costs */
 	QualCost	finalCost;		/* total per-aggregated-row costs */
 	Size		transitionSpace;	/* space for pass-by-ref transition data */
+	int			numSorts;		/* # of required input-sort buffers */
 } AggClauseCosts;
 
 /*
@@ -1697,6 +1698,13 @@ typedef struct Path
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
 	Cost		total_cost;		/* total cost (assuming all tuples fetched) */
 
+	/*
+	 * NOTE: The Path's workmem is a double, rather than an int, because it
+	 * sometimes combines multiple working-memory estimates (e.g., for
+	 * GroupingSetsPath).
+	 */
+	Cost		workmem;		/* estimated work_mem (in KB) */
+
 	/* sort ordering of path's output; a List of PathKey nodes; see above */
 	List	   *pathkeys;
 } Path;
@@ -2290,6 +2298,7 @@ typedef struct AggPath
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy, see nodes.h */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+	int			numSorts;		/* number of inputs that require sorting */
 	Cardinality numGroups;		/* estimated number of groups in input */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	List	   *groupClause;	/* a list of SortGroupClause's */
@@ -2331,6 +2340,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	int			numSorts;		/* number of inputs that require sorting */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
@@ -3374,6 +3384,7 @@ typedef struct JoinCostWorkspace
 
 	/* Fields below here should be treated as private to costsize.c */
 	Cost		run_cost;		/* non-startup cost components */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* private for cost_nestloop code */
 	Cost		inner_run_cost; /* also used by cost_mergejoin code */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index bf1f25c0dba..67da7f091b5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -168,6 +168,8 @@ typedef struct Plan
 	/* total cost (assuming all tuples fetched) */
 	Cost		total_cost;
 
+	int			workmem;		/* estimated work_mem (in KB) */
+
 	/*
 	 * planner's estimate of result size of this plan step
 	 */
@@ -426,6 +428,9 @@ typedef struct RecursiveUnion
 
 	/* estimated number of groups in input */
 	long		numGroups;
+
+	/* estimated work_mem for hash table (in KB) */
+	int			hashWorkMem;
 } RecursiveUnion;
 
 /* ----------------
@@ -1145,6 +1150,12 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* number of inputs that require sorting */
+	int			numSorts;
+
+	/* estimated work_mem needed to sort each input (in KB) */
+	int			sortWorkMem;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 839e71d52f4..b7d6b0fe7dc 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1109,6 +1109,8 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	int			hashtab_workmem;	/* estimated hashtable work_mem (in KB) */
+	int			hashnul_workmem;	/* estimated hashnulls work_mem (in KB) */
 } SubPlan;
 
 /*
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index a6ffeac90be..df8e7de9dc2 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -85,6 +85,7 @@ extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
 													dsa_pointer dp);
 extern int	tbm_calculate_entries(Size maxbytes);
+extern double tbm_calculate_bytes(double maxentries);
 
 extern TBMIterator tbm_begin_iterate(TIDBitmap *tbm,
 									 dsa_area *dsa, dsa_pointer dsp);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 3aa3c16e442..737c553a409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -106,7 +106,7 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 									 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_resultscan(Path *path, PlannerInfo *root,
 							RelOptInfo *baserel, ParamPathInfo *param_info);
-extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
+extern void cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, int disabled_nodes,
 					  Cost input_cost, double tuples, int width,
@@ -139,7 +139,7 @@ extern void cost_windowagg(Path *path, PlannerInfo *root,
 						   List *windowFuncs, WindowClause *winclause,
 						   int input_disabled_nodes,
 						   Cost input_startup_cost, Cost input_total_cost,
-						   double input_tuples);
+						   double input_tuples, int width);
 extern void cost_group(Path *path, PlannerInfo *root,
 					   int numGroupCols, double numGroups,
 					   List *quals,
@@ -217,9 +217,17 @@ extern void set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *re
 extern void set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double relation_byte_size(double tuples, int width);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 								   Path *bitmapqual, double loop_count,
 								   Cost *cost_p, double *tuples_p);
 extern double compute_gather_rows(Path *path);
+extern int	compute_agg_input_workmem(double input_tuples, double input_width);
+extern int	compute_agg_output_workmem(PlannerInfo *root,
+									   AggStrategy aggstrategy,
+									   double numGroups, uint64 transitionSpace,
+									   double input_tuples, double input_width,
+									   bool cost_sort);
+extern int	normalize_workmem(double nbytes);
 
 #endif							/* COST_H */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5a930199611..cf3694a744f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 					 Size transitionSpace, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
 						 LimitOption limitOption, int uniqNumCols,
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
new file mode 100644
index 00000000000..215180808f4
--- /dev/null
+++ b/src/test/regress/expected/workmem.out
@@ -0,0 +1,631 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Unique -> hash agg
+set enable_hashagg = on;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                         workmem_filter                          
+-----------------------------------------------------------------
+ Sort  (work_mem=N kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  HashAggregate  (work_mem=N kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+(10 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Unique -> sort
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ Sort  (work_mem=N kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  Unique
+               ->  Sort  (work_mem=N kB)
+                     Sort Key: "*VALUES*".column1, "*VALUES*".column2
+                     ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+(11 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+             workmem_filter              
+-----------------------------------------
+ Limit
+   ->  Incremental Sort  (work_mem=N kB)
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort  (work_mem=N kB)
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+ Total Working Memory: N kB
+(8 rows)
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+    4220 |    5017 |   0 |    0 |   0 |      0 |      20 |      220 |         220 |      4220 |     4220 |  40 |   41 | IGAAAA   | ZKHAAA   | HHHHxx
+(1 row)
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+                                 workmem_filter                                 
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Hash Join
+               Hash Cond: (t3.thousand = t1.unique1)
+               ->  HashAggregate  (work_mem=N kB)
+                     Group Key: t3.thousand, t3.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
+               ->  Hash  (work_mem=N kB)
+                     ->  Index Only Scan using onek_unique1 on onek t1
+                           Index Cond: (unique1 < 1)
+         ->  Index Only Scan using tenk1_hundred on tenk1 t2
+               Index Cond: (hundred = t3.tenthous)
+ Total Working Memory: N kB
+(13 rows)
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+ count 
+-------
+   100
+(1 row)
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+                       workmem_filter                        
+-------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Nested Loop Left Join
+               Filter: (t4.f1 IS NULL)
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize  (work_mem=N kB)
+                     ->  Nested Loop Left Join
+                           Join Filter: (t3.f1 > 1)
+                           ->  Seq Scan on int4_tbl t3
+                                 Filter: (f1 > 0)
+                           ->  Materialize  (work_mem=N kB)
+                                 ->  Seq Scan on int4_tbl t4
+         ->  Seq Scan on int4_tbl t1
+ Total Working Memory: N kB
+(14 rows)
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+ count 
+-------
+     0
+(1 row)
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB)
+   ->  Sort  (work_mem=N kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+(9 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB)
+   ->  Sort  (work_mem=N kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+(17 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+                   workmem_filter                   
+----------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB)
+   Group Key: (length((stringu1)::text))
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial HashAggregate  (work_mem=N kB)
+               Group Key: length((stringu1)::text)
+               ->  Parallel Seq Scan on tenk1
+ Total Working Memory: N kB
+(8 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length 
+--------
+      6
+(1 row)
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+         QUERY PLAN         
+----------------------------
+ Aggregate
+   ->  Seq Scan on tenk1
+ Total Working Memory: 0 kB
+(3 rows)
+
+select MAX(length(stringu1)) from tenk1;
+ max 
+-----
+   6
+(1 row)
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                      workmem_filter                       
+-----------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB)
+ Total Working Memory: N kB
+(3 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                     workmem_filter                      
+---------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB)
+ Total Working Memory: N kB
+(3 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- Table Function Scan
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+                      workmem_filter                      
+----------------------------------------------------------
+ Nested Loop
+   ->  Seq Scan on xmldata
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB)
+ Total Working Memory: N kB
+(4 rows)
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+ id | _id | country_name | country_id | region_id | size | unit | premier_name 
+----+-----+--------------+------------+-----------+------+------+--------------
+(0 rows)
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ SetOp Except
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+ Total Working Memory: 0 kB
+(5 rows)
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+ unique1 
+---------
+      10
+(1 row)
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   ->  HashSetOp Intersect  (work_mem=N kB)
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+ Total Working Memory: N kB
+(5 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+ count 
+-------
+  5000
+(1 row)
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+                       workmem_filter                       
+------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  Memoize  (work_mem=N kB)
+               Cache Key: o.four
+               Cache Mode: binary
+               ->  CTE Scan on x  (work_mem=N kB)
+                     CTE x
+                       ->  Recursive Union  (work_mem=N kB)
+                             ->  Result
+                             ->  WorkTable Scan on x x_1
+                                   Filter: (a < 10)
+ Total Working Memory: N kB
+(14 rows)
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+ sum  | sum  
+------+------
+ 1700 | 5350
+(1 row)
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+                   workmem_filter                   
+----------------------------------------------------
+ Aggregate
+   CTE q1
+     ->  HashAggregate  (work_mem=N kB)
+           Group Key: tenk1.hundred
+           ->  Seq Scan on tenk1
+   InitPlan 2
+     ->  Aggregate
+           ->  CTE Scan on q1 qsub  (work_mem=N kB)
+   ->  CTE Scan on q1  (work_mem=N kB)
+         Filter: ((y)::numeric > (InitPlan 2).col1)
+ Total Working Memory: N kB
+(11 rows)
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+ count 
+-------
+    50
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB)
+         ->  Sort  (work_mem=N kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB)
+ Total Working Memory: N kB
+(6 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+                                            workmem_filter                                             
+-------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (((a.unique1 = 1) AND (b.unique1 = 2)) OR ((a.unique2 = 3) AND (b.hundred = 4)))
+         ->  Bitmap Heap Scan on tenk1 b
+               Recheck Cond: ((hundred = 4) OR (unique1 = 2))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB)
+                           Index Cond: (hundred = 4)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                           Index Cond: (unique1 = 2)
+         ->  Materialize  (work_mem=N kB)
+               ->  Bitmap Heap Scan on tenk1 a
+                     Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
+                     ->  BitmapOr
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB)
+                                 Index Cond: (unique2 = 3)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                                 Index Cond: (unique1 = 1)
+ Total Working Memory: N kB
+(19 rows)
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+ count 
+-------
+   101
+(1 row)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+       workmem_filter       
+----------------------------
+ Result  (work_mem=N kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+(6 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+(9 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..1089e3bdf96 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate workmem
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/workmem.sql b/src/test/regress/sql/workmem.sql
new file mode 100644
index 00000000000..5878f2aa4c4
--- /dev/null
+++ b/src/test/regress/sql/workmem.sql
@@ -0,0 +1,303 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Unique -> hash agg
+set enable_hashagg = on;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Unique -> sort
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+
+select length(stringu1) from tenk1 group by length(stringu1);
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+
+select MAX(length(stringu1)) from tenk1;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- Table Function Scan
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
-- 
2.47.1

v03_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchapplication/octet-stream; name=v03_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchDownload

From ea57eb88096287fe55251903081adced4d1f3bc4 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Thu, 20 Feb 2025 17:33:48 +0000
Subject: [PATCH 2/4] Store non-init-plan SubPlan objects in Plan list

We currently track SubPlan objects, on Plans, via either the plan->initPlan
list, for init plans; or via whatever expression contains the SubPlan, for
regular sub plans.

A SubPlan object can itself use working memory, if it uses a hash table.
This hash table is associated with the SubPlan itself, and not with the
Plan to which the SubPlan points.

To allow us to assign working memory to an individual SubPlan, this commit
stores a link to the regular SubPlan, inside a new plan->subPlan list,
when we finalize the (parent) Plan whose expression contains the regular
SubPlan.

Unlike the existing plan->initPlan list, we will not use the new plan->
subPlan list to initialize SubPlan nodes -- that must be done when we
initialize the expression that contains the SubPlan. Instead, we will use
it, during InitPlan() but before ExecInitNode(), to assign a working-
memory limit to the SubPlan.
---
 src/backend/optimizer/plan/setrefs.c | 284 +++++++++++++++++----------
 src/include/nodes/plannodes.h        |   2 +
 2 files changed, 177 insertions(+), 109 deletions(-)

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 999a5a8ab5a..8a4e77baa90 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -58,6 +58,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	int			rtoffset;
 	double		num_exec;
 } fix_scan_expr_context;
@@ -65,6 +66,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *outer_itlist;
 	indexed_tlist *inner_itlist;
 	Index		acceptable_rel;
@@ -76,6 +78,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *subplan_itlist;
 	int			newvarno;
 	int			rtoffset;
@@ -127,8 +130,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, plan, lst, rtoffset, num_exec) \
+	((List *) fix_scan_expr(root, plan, (Node *) (lst), rtoffset, num_exec))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -157,7 +160,7 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 										int rtoffset);
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
-static Node *fix_scan_expr(PlannerInfo *root, Node *node,
+static Node *fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node,
 						   int rtoffset, double num_exec);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
@@ -183,7 +186,7 @@ static Var *search_indexed_tlist_for_sortgroupref(Expr *node,
 												  Index sortgroupref,
 												  indexed_tlist *itlist,
 												  int newvarno);
-static List *fix_join_expr(PlannerInfo *root,
+static List *fix_join_expr(PlannerInfo *root, Plan *plan,
 						   List *clauses,
 						   indexed_tlist *outer_itlist,
 						   indexed_tlist *inner_itlist,
@@ -193,7 +196,7 @@ static List *fix_join_expr(PlannerInfo *root,
 						   double num_exec);
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
-static Node *fix_upper_expr(PlannerInfo *root,
+static Node *fix_upper_expr(PlannerInfo *root, Plan *plan,
 							Node *node,
 							indexed_tlist *subplan_itlist,
 							int newvarno,
@@ -202,7 +205,7 @@ static Node *fix_upper_expr(PlannerInfo *root,
 							double num_exec);
 static Node *fix_upper_expr_mutator(Node *node,
 									fix_upper_expr_context *context);
-static List *set_returning_clause_references(PlannerInfo *root,
+static List *set_returning_clause_references(PlannerInfo *root, Plan *plan,
 											 List *rlist,
 											 Plan *topplan,
 											 Index resultRelation,
@@ -633,10 +636,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -646,13 +649,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablesample = (TableSampleClause *)
-					fix_scan_expr(root, (Node *) splan->tablesample,
+					fix_scan_expr(root, plan, (Node *) splan->tablesample,
 								  rtoffset, 1);
 			}
 			break;
@@ -662,22 +665,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual,
+					fix_scan_list(root, plan, splan->indexqual,
 								  rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexorderby =
-					fix_scan_list(root, splan->indexorderby,
+					fix_scan_list(root, plan, splan->indexorderby,
 								  rtoffset, 1);
 				splan->indexorderbyorig =
-					fix_scan_list(root, splan->indexorderbyorig,
+					fix_scan_list(root, plan, splan->indexorderbyorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -697,9 +700,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, plan, splan->indexqual, rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -709,13 +712,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->bitmapqualorig =
-					fix_scan_list(root, splan->bitmapqualorig,
+					fix_scan_list(root, plan, splan->bitmapqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -725,13 +728,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidquals =
-					fix_scan_list(root, splan->tidquals,
+					fix_scan_list(root, plan, splan->tidquals,
 								  rtoffset, 1);
 			}
 			break;
@@ -741,13 +744,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidrangequals =
-					fix_scan_list(root, splan->tidrangequals,
+					fix_scan_list(root, plan, splan->tidrangequals,
 								  rtoffset, 1);
 			}
 			break;
@@ -762,13 +765,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, plan, splan->functions, rtoffset, 1);
 			}
 			break;
 		case T_TableFuncScan:
@@ -777,13 +780,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablefunc = (TableFunc *)
-					fix_scan_expr(root, (Node *) splan->tablefunc,
+					fix_scan_expr(root, plan, (Node *) splan->tablefunc,
 								  rtoffset, 1);
 			}
 			break;
@@ -793,13 +796,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->values_lists =
-					fix_scan_list(root, splan->values_lists,
+					fix_scan_list(root, plan, splan->values_lists,
 								  rtoffset, 1);
 			}
 			break;
@@ -809,10 +812,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -822,10 +825,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -835,10 +838,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -877,7 +880,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 */
 				set_dummy_tlist_references(plan, rtoffset);
 
-				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
+				mplan->param_exprs = fix_scan_list(root, plan, mplan->param_exprs,
 												   rtoffset,
 												   NUM_EXEC_TLIST(plan));
 				break;
@@ -939,9 +942,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitOffset, rtoffset, 1);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitCount, rtoffset, 1);
 			}
 			break;
 		case T_Agg:
@@ -994,14 +997,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
-				wplan->runCondition = fix_scan_list(root,
+					fix_scan_expr(root, plan, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root, plan,
 													wplan->runCondition,
 													rtoffset,
 													NUM_EXEC_TLIST(plan));
-				wplan->runConditionOrig = fix_scan_list(root,
+				wplan->runConditionOrig = fix_scan_list(root, plan,
 														wplan->runConditionOrig,
 														rtoffset,
 														NUM_EXEC_TLIST(plan));
@@ -1043,15 +1046,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					}
 
 					splan->plan.targetlist =
-						fix_scan_list(root, splan->plan.targetlist,
+						fix_scan_list(root, plan, splan->plan.targetlist,
 									  rtoffset, NUM_EXEC_TLIST(plan));
 					splan->plan.qual =
-						fix_scan_list(root, splan->plan.qual,
+						fix_scan_list(root, plan, splan->plan.qual,
 									  rtoffset, NUM_EXEC_QUAL(plan));
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->resconstantqual, rtoffset,
+								  1);
 			}
 			break;
 		case T_ProjectSet:
@@ -1066,7 +1070,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->withCheckOptionLists =
-					fix_scan_list(root, splan->withCheckOptionLists,
+					fix_scan_list(root, plan, splan->withCheckOptionLists,
 								  rtoffset, 1);
 
 				if (splan->returningLists)
@@ -1086,7 +1090,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						List	   *rlist = (List *) lfirst(lcrl);
 						Index		resultrel = lfirst_int(lcrr);
 
-						rlist = set_returning_clause_references(root,
+						rlist = set_returning_clause_references(root, plan,
 																rlist,
 																subplan,
 																resultrel,
@@ -1121,13 +1125,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					itlist = build_tlist_index(splan->exclRelTlist);
 
 					splan->onConflictSet =
-						fix_join_expr(root, splan->onConflictSet,
+						fix_join_expr(root, plan, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
 
 					splan->onConflictWhere = (Node *)
-						fix_join_expr(root, (List *) splan->onConflictWhere,
+						fix_join_expr(root, plan, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
@@ -1135,7 +1139,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, plan, splan->exclRelTlist, rtoffset, 1);
 				}
 
 				/*
@@ -1186,7 +1190,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 							MergeAction *action = (MergeAction *) lfirst(l);
 
 							/* Fix targetList of each action. */
-							action->targetList = fix_join_expr(root,
+							action->targetList = fix_join_expr(root, plan,
 															   action->targetList,
 															   NULL, itlist,
 															   resultrel,
@@ -1195,7 +1199,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   NUM_EXEC_TLIST(plan));
 
 							/* Fix quals too. */
-							action->qual = (Node *) fix_join_expr(root,
+							action->qual = (Node *) fix_join_expr(root, plan,
 																  (List *) action->qual,
 																  NULL, itlist,
 																  resultrel,
@@ -1206,7 +1210,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 						/* Fix join condition too. */
 						mergeJoinCondition = (Node *)
-							fix_join_expr(root,
+							fix_join_expr(root, plan,
 										  (List *) mergeJoinCondition,
 										  NULL, itlist,
 										  resultrel,
@@ -1353,7 +1357,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 
 	plan->scan.scanrelid += rtoffset;
 	plan->scan.plan.targetlist = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.targetlist,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1361,7 +1365,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_TLIST((Plan *) plan));
 	plan->scan.plan.qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.qual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1369,7 +1373,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	plan->recheckqual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->recheckqual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1377,13 +1381,13 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
-	plan->indexqual = fix_scan_list(root, plan->indexqual,
+	plan->indexqual = fix_scan_list(root, (Plan *) plan, plan->indexqual,
 									rtoffset, 1);
 	/* indexorderby is already transformed to reference index columns */
-	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
+	plan->indexorderby = fix_scan_list(root, (Plan *) plan, plan->indexorderby,
 									   rtoffset, 1);
 	/* indextlist must NOT be transformed to reference index columns */
-	plan->indextlist = fix_scan_list(root, plan->indextlist,
+	plan->indextlist = fix_scan_list(root, (Plan *) plan, plan->indextlist,
 									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 
 	pfree(index_itlist);
@@ -1430,10 +1434,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		 */
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
-			fix_scan_list(root, plan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 		plan->scan.plan.qual =
-			fix_scan_list(root, plan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
 
 		result = (Plan *) plan;
@@ -1599,7 +1603,7 @@ set_foreignscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(fscan->fdw_scan_tlist);
 
 		fscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1607,7 +1611,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1615,7 +1619,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1623,7 +1627,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_recheck_quals,
 						   itlist,
 						   INDEX_VAR,
@@ -1633,7 +1637,7 @@ set_foreignscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
-			fix_scan_list(root, fscan->fdw_scan_tlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 	}
 	else
@@ -1643,16 +1647,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 * way
 		 */
 		fscan->scan.plan.targetlist =
-			fix_scan_list(root, fscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual =
-			fix_scan_list(root, fscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs =
-			fix_scan_list(root, fscan->fdw_exprs,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals =
-			fix_scan_list(root, fscan->fdw_recheck_quals,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_recheck_quals,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 	}
 
@@ -1685,7 +1689,7 @@ set_customscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(cscan->custom_scan_tlist);
 
 		cscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1693,7 +1697,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1701,7 +1705,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->custom_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1711,20 +1715,20 @@ set_customscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
-			fix_scan_list(root, cscan->custom_scan_tlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
-			fix_scan_list(root, cscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual =
-			fix_scan_list(root, cscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs =
-			fix_scan_list(root, cscan->custom_exprs,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 	}
 
@@ -1752,7 +1756,8 @@ set_customscan_references(PlannerInfo *root,
  * startup time.
  */
 static int
-register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
+register_partpruneinfo(PlannerInfo *root, Plan *plan, int part_prune_index,
+					   int rtoffset)
 {
 	PlannerGlobal *glob = root->glob;
 	PartitionPruneInfo *pinfo;
@@ -1776,10 +1781,10 @@ register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
 
 			prelinfo->rtindex += rtoffset;
 			prelinfo->initial_pruning_steps =
-				fix_scan_list(root, prelinfo->initial_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->initial_pruning_steps,
 							  rtoffset, 1);
 			prelinfo->exec_pruning_steps =
-				fix_scan_list(root, prelinfo->exec_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->exec_pruning_steps,
 							  rtoffset, 1);
 
 			for (i = 0; i < prelinfo->nparts; i++)
@@ -1863,7 +1868,8 @@ set_append_references(PlannerInfo *root,
 	 */
 	if (aplan->part_prune_index >= 0)
 		aplan->part_prune_index =
-			register_partpruneinfo(root, aplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) aplan,
+								   aplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(aplan->plan.lefttree == NULL);
@@ -1931,7 +1937,8 @@ set_mergeappend_references(PlannerInfo *root,
 	 */
 	if (mplan->part_prune_index >= 0)
 		mplan->part_prune_index =
-			register_partpruneinfo(root, mplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) mplan,
+								   mplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(mplan->plan.lefttree == NULL);
@@ -1958,7 +1965,7 @@ set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	 */
 	outer_itlist = build_tlist_index(outer_plan->targetlist);
 	hplan->hashkeys = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) hplan->hashkeys,
 					   outer_itlist,
 					   OUTER_VAR,
@@ -2194,7 +2201,8 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * replacing Aggref nodes that should be replaced by initplan output Params,
  * choosing the best implementation for AlternativeSubPlans,
  * looking up operator opcode info for OpExpr and related nodes,
- * and adding OIDs from regclass Const nodes into root->glob->relationOids.
+ * adding OIDs from regclass Const nodes into root->glob->relationOids, and
+ * recording Subplans that use hash tables.
  *
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
@@ -2204,11 +2212,13 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node, int rtoffset,
+			  double num_exec)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
 
@@ -2299,8 +2309,21 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 															 (AlternativeSubPlan *) node,
 															 context->num_exec),
 									 context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_scan_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 static bool
@@ -2312,6 +2335,17 @@ fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this SubPlan so that we can assign working memory to it (if
+		 * needed).
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
 	return expression_tree_walker(node, fix_scan_expr_walker, context);
 }
 
@@ -2341,7 +2375,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * NestLoopParams now, because those couldn't refer to nullable
 	 * subexpressions.
 	 */
-	join->joinqual = fix_join_expr(root,
+	join->joinqual = fix_join_expr(root, (Plan *) join,
 								   join->joinqual,
 								   outer_itlist,
 								   inner_itlist,
@@ -2371,7 +2405,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 			 * make things match up perfectly seems well out of proportion to
 			 * the value.
 			 */
-			nlp->paramval = (Var *) fix_upper_expr(root,
+			nlp->paramval = (Var *) fix_upper_expr(root, (Plan *) join,
 												   (Node *) nlp->paramval,
 												   outer_itlist,
 												   OUTER_VAR,
@@ -2388,7 +2422,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		MergeJoin  *mj = (MergeJoin *) join;
 
-		mj->mergeclauses = fix_join_expr(root,
+		mj->mergeclauses = fix_join_expr(root, (Plan *) join,
 										 mj->mergeclauses,
 										 outer_itlist,
 										 inner_itlist,
@@ -2401,7 +2435,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		HashJoin   *hj = (HashJoin *) join;
 
-		hj->hashclauses = fix_join_expr(root,
+		hj->hashclauses = fix_join_expr(root, (Plan *) join,
 										hj->hashclauses,
 										outer_itlist,
 										inner_itlist,
@@ -2414,7 +2448,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 		 * HashJoin's hashkeys are used to look for matching tuples from its
 		 * outer plan (not the Hash node!) in the hashtable.
 		 */
-		hj->hashkeys = (List *) fix_upper_expr(root,
+		hj->hashkeys = (List *) fix_upper_expr(root, (Plan *) join,
 											   (Node *) hj->hashkeys,
 											   outer_itlist,
 											   OUTER_VAR,
@@ -2433,7 +2467,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * be, so we just tell fix_join_expr to accept superset nullingrels
 	 * matches instead of exact ones.
 	 */
-	join->plan.targetlist = fix_join_expr(root,
+	join->plan.targetlist = fix_join_expr(root, (Plan *) join,
 										  join->plan.targetlist,
 										  outer_itlist,
 										  inner_itlist,
@@ -2441,7 +2475,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
 										  NUM_EXEC_TLIST((Plan *) join));
-	join->plan.qual = fix_join_expr(root,
+	join->plan.qual = fix_join_expr(root, (Plan *) join,
 									join->plan.qual,
 									outer_itlist,
 									inner_itlist,
@@ -2519,7 +2553,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 													  subplan_itlist,
 													  OUTER_VAR);
 			if (!newexpr)
-				newexpr = fix_upper_expr(root,
+				newexpr = fix_upper_expr(root, plan,
 										 (Node *) tle->expr,
 										 subplan_itlist,
 										 OUTER_VAR,
@@ -2528,7 +2562,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 										 NUM_EXEC_TLIST(plan));
 		}
 		else
-			newexpr = fix_upper_expr(root,
+			newexpr = fix_upper_expr(root, plan,
 									 (Node *) tle->expr,
 									 subplan_itlist,
 									 OUTER_VAR,
@@ -2542,7 +2576,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	plan->targetlist = output_targetlist;
 
 	plan->qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) plan->qual,
 					   subplan_itlist,
 					   OUTER_VAR,
@@ -3081,6 +3115,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  *    the source relation elements, outer_itlist = NULL and acceptable_rel
  *    the target relation.
  *
+ * 'plan' is the Plan node to which the clauses belong
  * 'clauses' is the targetlist or list of join clauses
  * 'outer_itlist' is the indexed target list of the outer join relation,
  *		or NULL
@@ -3097,6 +3132,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  */
 static List *
 fix_join_expr(PlannerInfo *root,
+			  Plan *plan,
 			  List *clauses,
 			  indexed_tlist *outer_itlist,
 			  indexed_tlist *inner_itlist,
@@ -3108,6 +3144,7 @@ fix_join_expr(PlannerInfo *root,
 	fix_join_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.outer_itlist = outer_itlist;
 	context.inner_itlist = inner_itlist;
 	context.acceptable_rel = acceptable_rel;
@@ -3234,7 +3271,19 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 															 context->num_exec),
 									 context);
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_join_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_join_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3258,6 +3307,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * expensive, so we don't want to try it in the common case where the
  * subplan tlist is just a flattened list of Vars.)
  *
+ * 'plan': the Plan node to which the expression belongs
  * 'node': the tree to be fixed (a target item or qual)
  * 'subplan_itlist': indexed target list for subplan (or index)
  * 'newvarno': varno to use for Vars referencing tlist elements
@@ -3271,6 +3321,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  */
 static Node *
 fix_upper_expr(PlannerInfo *root,
+			   Plan *plan,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
 			   int newvarno,
@@ -3281,6 +3332,7 @@ fix_upper_expr(PlannerInfo *root,
 	fix_upper_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.subplan_itlist = subplan_itlist;
 	context.newvarno = newvarno;
 	context.rtoffset = rtoffset;
@@ -3358,8 +3410,21 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
 															  (AlternativeSubPlan *) node,
 															  context->num_exec),
 									  context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_upper_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_upper_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3377,9 +3442,10 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * We also must perform opcode lookup and add regclass OIDs to
  * root->glob->relationOids.
  *
+ * 'plan': the ModifyTable node itself
  * 'rlist': the RETURNING targetlist to be fixed
  * 'topplan': the top subplan node that will be just below the ModifyTable
- *		node (note it's not yet passed through set_plan_refs)
+ *		node
  * 'resultRelation': RT index of the associated result relation
  * 'rtoffset': how much to increment varnos by
  *
@@ -3391,7 +3457,7 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * Note: resultRelation is not yet adjusted by rtoffset.
  */
 static List *
-set_returning_clause_references(PlannerInfo *root,
+set_returning_clause_references(PlannerInfo *root, Plan *plan,
 								List *rlist,
 								Plan *topplan,
 								Index resultRelation,
@@ -3415,7 +3481,7 @@ set_returning_clause_references(PlannerInfo *root,
 	 */
 	itlist = build_tlist_index_other_vars(topplan->targetlist, resultRelation);
 
-	rlist = fix_join_expr(root,
+	rlist = fix_join_expr(root, plan,
 						  rlist,
 						  itlist,
 						  NULL,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 67da7f091b5..d3f8fd7bd6c 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -206,6 +206,8 @@ typedef struct Plan
 	struct Plan *righttree;
 	/* Init Plan nodes (un-correlated expr subselects) */
 	List	   *initPlan;
+	/* Regular Sub Plan nodes (cf. "initPlan", above) */
+	List	   *subPlan;
 
 	/*
 	 * Information for management of parameter-change-driven rescanning
-- 
2.47.1

v03_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patchapplication/octet-stream; name=v03_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patchDownload

From a7a8eeeb2ccebd765b704ff2e86f7769cd359531 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 21 Feb 2025 00:16:22 +0000
Subject: [PATCH 3/4] EXPLAIN (WORK_MEM ON) now shows working memory limit

This commit moves the working-memory limit that an executor node checks, at
runtime, from the "work_mem" and "hash_mem_multiplier" GUCs, to a new
field, "work_mem", added to the Plan node. To preserve backward
compatibility, it also copies the "work_mem", etc., values from these GUCs
to the new field.

This field is on the Plan node, instead of the PlanState, because it needs
to be set before we can call ExecInitNode(). Many PlanStates look at their
working-memory limit when creating their data structures, during
initialization. So the field is on the Plan node, but set between planning
and execution phases.

Also modifies "EXPLAIN (WORK_MEM ON)" so that it also displays this
working-memory limit.
---
 src/backend/commands/explain.c             |  59 ++++-
 src/backend/executor/Makefile              |   1 +
 src/backend/executor/execGrouping.c        |  10 +-
 src/backend/executor/execMain.c            |   6 +
 src/backend/executor/execSRF.c             |   5 +-
 src/backend/executor/execWorkmem.c         | 281 +++++++++++++++++++++
 src/backend/executor/meson.build           |   1 +
 src/backend/executor/nodeAgg.c             |  69 +++--
 src/backend/executor/nodeBitmapIndexscan.c |   3 +-
 src/backend/executor/nodeBitmapOr.c        |   3 +-
 src/backend/executor/nodeCtescan.c         |   3 +-
 src/backend/executor/nodeFunctionscan.c    |   2 +
 src/backend/executor/nodeHash.c            |  23 +-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeMaterial.c        |   3 +-
 src/backend/executor/nodeMemoize.c         |   2 +-
 src/backend/executor/nodeRecursiveunion.c  |  12 +-
 src/backend/executor/nodeSetOp.c           |   1 +
 src/backend/executor/nodeSort.c            |   4 +-
 src/backend/executor/nodeSubplan.c         |   2 +
 src/backend/executor/nodeTableFuncscan.c   |   3 +-
 src/backend/executor/nodeWindowAgg.c       |   3 +-
 src/backend/optimizer/path/costsize.c      |  15 +-
 src/include/commands/explain.h             |   1 +
 src/include/executor/executor.h            |   7 +
 src/include/executor/hashjoin.h            |   3 +-
 src/include/executor/nodeAgg.h             |   5 +-
 src/include/executor/nodeHash.h            |   3 +-
 src/include/nodes/plannodes.h              |   8 +-
 src/include/nodes/primnodes.h              |   2 +
 src/test/regress/expected/workmem.out      | 184 ++++++++------
 31 files changed, 577 insertions(+), 151 deletions(-)
 create mode 100644 src/backend/executor/execWorkmem.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e09d7f868c9..07c6d34764b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -180,8 +180,8 @@ static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 static SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
-static void compute_subplan_workmem(List *plans, double *workmem);
-static void compute_agg_workmem(Agg *agg, double *workmem);
+static void compute_subplan_workmem(List *plans, double *workmem, double *limit);
+static void compute_agg_workmem(Agg *agg, double *workmem, double *limit);
 
 
 
@@ -843,6 +843,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 	{
 		ExplainPropertyFloat("Total Working Memory", "kB",
 							 es->total_workmem, 0, es);
+		ExplainPropertyFloat("Total Working Memory Limit", "kB",
+							 es->total_workmem_limit, 0, es);
 	}
 
 	ExplainCloseGroup("Query", NULL, true, es);
@@ -1983,19 +1985,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->work_mem)
 	{
 		double		plan_workmem = 0.0;
+		double		plan_limit = 0.0;
 
 		/*
 		 * Include working memory used by this Plan's SubPlan objects, whether
 		 * they are included on the Plan's initPlan or subPlan lists.
 		 */
-		compute_subplan_workmem(planstate->initPlan, &plan_workmem);
-		compute_subplan_workmem(planstate->subPlan, &plan_workmem);
+		compute_subplan_workmem(planstate->initPlan, &plan_workmem, &plan_limit);
+		compute_subplan_workmem(planstate->subPlan, &plan_workmem, &plan_limit);
 
 		/* Include working memory used by this Plan, itself. */
 		switch (nodeTag(plan))
 		{
 			case T_Agg:
-				compute_agg_workmem((Agg *) plan, &plan_workmem);
+				compute_agg_workmem((Agg *) plan, &plan_workmem, &plan_limit);
 				break;
 			case T_FunctionScan:
 				{
@@ -2003,6 +2006,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 					plan_workmem += (double) plan->workmem *
 						list_length(fscan->functions);
+					plan_limit += (double) plan->workmem_limit *
+						list_length(fscan->functions);
 					break;
 				}
 			case T_IncrementalSort:
@@ -2011,7 +2016,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				 * IncrementalSort creates two Tuplestores, each of
 				 * (estimated) size workmem.
 				 */
-				plan_workmem = (double) plan->workmem * 2;
+				plan_workmem += (double) plan->workmem * 2;
+				plan_limit += (double) plan->workmem_limit * 2;
 				break;
 			case T_RecursiveUnion:
 				{
@@ -2024,11 +2030,15 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					 */
 					plan_workmem += (double) plan->workmem * 2 +
 						runion->hashWorkMem;
+					plan_limit += (double) plan->workmem_limit * 2 +
+						runion->hashWorkMemLimit;
 					break;
 				}
 			default:
 				if (plan->workmem > 0)
 					plan_workmem += plan->workmem;
+				if (plan->workmem_limit > 0)
+					plan_limit += plan->workmem_limit;
 				break;
 		}
 
@@ -2037,17 +2047,23 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		 * working memory.
 		 */
 		plan_workmem *= (1 + es->num_workers);
+		plan_limit *= (1 + es->num_workers);
 
 		es->total_workmem += plan_workmem;
+		es->total_workmem_limit += plan_limit;
 
-		if (plan_workmem > 0.0)
+		if (plan_workmem > 0.0 || plan_limit > 0.0)
 		{
 			if (es->format == EXPLAIN_FORMAT_TEXT)
-				appendStringInfo(es->str, "  (work_mem=%.0f kB)",
-								 plan_workmem);
+				appendStringInfo(es->str, "  (work_mem=%.0f kB limit=%.0f kB)",
+								 plan_workmem, plan_limit);
 			else
+			{
 				ExplainPropertyFloat("Working Memory", "kB",
 									 plan_workmem, 0, es);
+				ExplainPropertyFloat("Working Memory Limit", "kB",
+									 plan_limit, 0, es);
+			}
 		}
 	}
 
@@ -6062,29 +6078,39 @@ GetSerializationMetrics(DestReceiver *dest)
  * increments work_mem counters to include the SubPlan's working-memory.
  */
 static void
-compute_subplan_workmem(List *plans, double *workmem)
+compute_subplan_workmem(List *plans, double *workmem, double *limit)
 {
 	foreach_node(SubPlanState, sps, plans)
 	{
 		SubPlan    *sp = sps->subplan;
 
 		if (sp->hashtab_workmem > 0)
+		{
 			*workmem += sp->hashtab_workmem;
+			*limit += sp->hashtab_workmem_limit;
+		}
 
 		if (sp->hashnul_workmem > 0)
+		{
 			*workmem += sp->hashnul_workmem;
+			*limit += sp->hashnul_workmem_limit;
+		}
 	}
 }
 
-/* Compute an Agg's working memory estimate. */
+/* Compute an Agg's working memory estimate and limit. */
 typedef struct AggWorkMem
 {
 	double		input_sort_workmem;
+	double		input_sort_limit;
 
 	double		output_hash_workmem;
+	double		output_hash_limit;
 
 	int			num_sort_nodes;
+
 	double		max_output_sort_workmem;
+	double		output_sort_limit;
 }			AggWorkMem;
 
 static void
@@ -6092,6 +6118,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 {
 	/* Record memory used for input sort buffers. */
 	mem->input_sort_workmem += (double) agg->numSorts * agg->sortWorkMem;
+	mem->input_sort_limit += (double) agg->numSorts * agg->sortWorkMemLimit;
 
 	/* Record memory used for output data structures. */
 	switch (agg->aggstrategy)
@@ -6102,6 +6129,9 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 			mem->max_output_sort_workmem =
 				Max(mem->max_output_sort_workmem, agg->plan.workmem);
 
+			if (mem->output_sort_limit == 0)
+				mem->output_sort_limit = agg->plan.workmem_limit;
+
 			++mem->num_sort_nodes;
 			break;
 		case AGG_HASHED:
@@ -6112,6 +6142,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
 			 * lifetime of the Agg.
 			 */
 			mem->output_hash_workmem += agg->plan.workmem;
+			mem->output_hash_limit += agg->plan.workmem_limit;
 			break;
 		default:
 
@@ -6135,7 +6166,7 @@ compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
  * value on the main Agg node.
  */
 static void
-compute_agg_workmem(Agg *agg, double *workmem)
+compute_agg_workmem(Agg *agg, double *workmem, double *limit)
 {
 	AggWorkMem	mem;
 	ListCell   *lc;
@@ -6153,9 +6184,13 @@ compute_agg_workmem(Agg *agg, double *workmem)
 	}
 
 	*workmem = mem.input_sort_workmem + mem.output_hash_workmem;
+	*limit = mem.input_sort_limit + mem.output_hash_limit;;
 
 	/* We'll have at most two sort buffers alive, at any time. */
 	*workmem += mem.num_sort_nodes > 2 ?
 		mem.max_output_sort_workmem * 2.0 :
 		mem.max_output_sort_workmem;
+	*limit += mem.num_sort_nodes > 2 ?
+		mem.output_sort_limit * 2.0 :
+		mem.output_sort_limit;
 }
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..8aa9580558f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -30,6 +30,7 @@ OBJS = \
 	execScan.o \
 	execTuples.o \
 	execUtils.o \
+	execWorkmem.o \
 	functions.o \
 	instrument.o \
 	nodeAgg.o \
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 33b124fbb0a..bcd1822da80 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -168,6 +168,7 @@ BuildTupleHashTable(PlanState *parent,
 					Oid *collations,
 					long nbuckets,
 					Size additionalsize,
+					Size hash_mem_limit,
 					MemoryContext metacxt,
 					MemoryContext tablecxt,
 					MemoryContext tempcxt,
@@ -175,15 +176,18 @@ BuildTupleHashTable(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
-	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
 
 	Assert(nbuckets > 0);
 
-	/* Limit initial table size request to not more than hash_mem */
-	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	/*
+	 * Limit initial table size request to not more than hash_mem
+	 *
+	 * XXX - we should also limit the *maximum* table size to hash_mem.
+	 */
+	hash_mem_limit = hash_mem_limit / entrysize;
 	if (nbuckets > hash_mem_limit)
 		nbuckets = hash_mem_limit;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0493b7d5365..78fd887a84d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1050,6 +1050,12 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/* signal that this EState is not used for EPQ */
 	estate->es_epq_active = NULL;
 
+	/*
+	 * Assign working memory to SubPlan and Plan nodes, before initializing
+	 * their states.
+	 */
+	ExecAssignWorkMem(plannedstmt);
+
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
diff --git a/src/backend/executor/execSRF.c b/src/backend/executor/execSRF.c
index a03fe780a02..4b1e7e0ad1e 100644
--- a/src/backend/executor/execSRF.c
+++ b/src/backend/executor/execSRF.c
@@ -102,6 +102,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 							ExprContext *econtext,
 							MemoryContext argContext,
 							TupleDesc expectedDesc,
+							int workMem,
 							bool randomAccess)
 {
 	Tuplestorestate *tupstore = NULL;
@@ -261,7 +262,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 				MemoryContext oldcontext =
 					MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-				tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+				tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 				rsinfo.setResult = tupstore;
 				if (!returnsTuple)
 				{
@@ -396,7 +397,7 @@ no_function_result:
 		MemoryContext oldcontext =
 			MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-		tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+		tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 		rsinfo.setResult = tupstore;
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
new file mode 100644
index 00000000000..c513b90fc77
--- /dev/null
+++ b/src/backend/executor/execWorkmem.c
@@ -0,0 +1,281 @@
+/*-------------------------------------------------------------------------
+ *
+ * execWorkmem.c
+ *	 routine to set the "workmem_limit" field(s) on Plan nodes that need
+ *   workimg memory.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execWorkmem.c
+ *
+ * INTERFACE ROUTINES
+ *		ExecAssignWorkMem	- assign working memory to Plan nodes
+ *
+ *	 NOTES
+ *		Historically, every PlanState node, during initialization, looked at
+ *		the "work_mem" (plus maybe "hash_mem_multiplier") GUC, to determine
+ *		what working-memory limit was imposed on it.
+ *
+ *		Now, to allow different PlanState nodes to be restricted to different
+ *		amounts of memory, each PlanState node reads this limit off its
+ *		corresponding Plan node's "workmem_limit" field. And we populate that
+ *		field by calling ExecAssignWorkMem(), from InitPlan(), before we
+ *		initialize the PlanState nodes.
+ *
+ * 		The "workmem_limit" field is a limit "per data structure," rather than
+ *		"per PlanState". This is needed because some SQL operators (e.g.,
+ *		RecursiveUnion and Agg) require multiple data structures, and sometimes
+ *		the data structures don't all share the same memory requirement. So we
+ *		cannot always just divide a "per PlanState" limit among individual data
+ *		structures. Instead, we maintain the limits on the data structures (and
+ *		EXPLAIN, for example, sums them up into a single, human-readable
+ *		number).
+ *
+ *		Note that the *Path's* "workmem" estimate is per SQL operator, but when
+ *		we convert that Path to a Plan we also break its "workmem" estimate
+ *		down into per-data structure estimates. Some operators therefore
+ *		require additional "limit" fields, which we add to the corresponding
+ *		Plan.
+ *
+ *		We store the "workmem_limit" field(s) on the Plan, instead of the
+ *		PlanState, even though it conceptually belongs to execution rather than
+ *		to planning, because we need it to be set before initializing the
+ *		corresponding PlanState. This is a chicken-and-egg problem. We could,
+ *		of course, make ExecInitNode() a two-phase operation, but that seems
+ *		like overkill. Instead, we store these "limit" fields on the Plan, but
+ *		set them when we start execution, as part of InitPlan().
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+
+
+/* decls for local routines only used within this module */
+static void assign_workmem_subplan(SubPlan *subplan);
+static void assign_workmem_plan(Plan *plan);
+static void assign_workmem_agg(Agg *agg);
+static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+									bool *is_first_sort);
+
+/* end of local decls */
+
+
+/* ------------------------------------------------------------------------
+ *		ExecAssignWorkMem
+ *
+ *		Recursively assigns working memory to any Plans or SubPlans that need
+ *		it.
+ *
+ *		Inputs:
+ *		  'plannedstmt' is the statement to which we assign working memory
+ *
+ * ------------------------------------------------------------------------
+ */
+void
+ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	/*
+	 * No need to re-assign working memory on parallel workers, since workers
+	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
+	 *
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	/* Assign working memory to the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			assign_workmem_plan(plan);
+	}
+
+	/* And assign working memory to the main Plan tree. */
+	assign_workmem_plan(plannedstmt->planTree);
+}
+
+static void
+assign_workmem_subplan(SubPlan *subplan)
+{
+	subplan->hashtab_workmem_limit = subplan->useHashTable ?
+		normalize_workmem(get_hash_memory_limit()) : 0;
+
+	subplan->hashnul_workmem_limit =
+		subplan->useHashTable && !subplan->unknownEqFalse ?
+		normalize_workmem(get_hash_memory_limit()) : 0;
+}
+
+static void
+assign_workmem_plan(Plan *plan)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Assign working memory to this node's (hashed) SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		assign_workmem_subplan(subplan);
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		assign_workmem_subplan(subplan);
+
+	/* Assign working memory to this node. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_FunctionScan:
+		case T_IncrementalSort:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+			if (plan->workmem > 0)
+				plan->workmem_limit = work_mem;
+			break;
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			if (plan->workmem > 0)
+				plan->workmem_limit =
+					normalize_workmem(get_hash_memory_limit());
+			break;
+		case T_Agg:
+			assign_workmem_agg((Agg *) plan);
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+				plan->workmem_limit = work_mem;
+
+				if (runion->numCols > 0)
+				{
+					/* Also include memory for hash table. */
+					runion->hashWorkMemLimit =
+						normalize_workmem(get_hash_memory_limit());
+				}
+
+				break;
+			}
+		default:
+			Assert(plan->workmem == 0);
+			plan->workmem_limit = 0;
+			break;
+	}
+
+	/*
+	 * Assign working memory to this node's children. (Logic copied from
+	 * ExplainNode().)
+	 */
+	if (outerPlan(plan))
+		assign_workmem_plan(outerPlan(plan));
+
+	if (innerPlan(plan))
+		assign_workmem_plan(innerPlan(plan));
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				assign_workmem_plan(child);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_SubqueryScan:
+			assign_workmem_plan(((SubqueryScan *) plan)->subplan);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				assign_workmem_plan(child);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+assign_workmem_agg(Agg *agg)
+{
+	bool		is_first_sort = true;
+
+	/* Assign working memory to the main Agg node. */
+	assign_workmem_agg_node(agg,
+							true /* is_first */ ,
+							agg->chain == NULL /* is_last */ ,
+							&is_first_sort);
+
+	/* Assign working memory to any other grouping sets. */
+	foreach_node(Agg, aggnode, agg->chain)
+	{
+		assign_workmem_agg_node(aggnode,
+								false /* is_first */ ,
+								foreach_current_index(aggnode) ==
+								list_length(agg->chain) - 1 /* is_last */ ,
+								&is_first_sort);
+	}
+}
+
+static void
+assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+						bool *is_first_sort)
+{
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * Because nodeAgg.c will combine all AGG_HASHED nodes into a
+			 * single phase, it's easier to store the hash working-memory
+			 * limit on the first AGG_{HASHED,MIXED} node, and set it to zero
+			 * for all subsequent AGG_HASHED nodes.
+			 */
+			agg->plan.workmem_limit = is_first ?
+				normalize_workmem(get_hash_memory_limit()) : 0;
+			break;
+		case AGG_SORTED:
+
+			/*
+			 * Also store the sort-output working-memory limit on the first
+			 * AGG_SORTED node, and set it to zero for all subsequent
+			 * AGG_SORTED nodes.
+			 *
+			 * We'll need working-memory to hold the "sort_out" only if this
+			 * isn't the last Agg node (in which case there's no one to sort
+			 * our output).
+			 */
+			agg->plan.workmem_limit = *is_first_sort && !is_last ?
+				work_mem : 0;
+
+			*is_first_sort = false;
+			break;
+		default:
+			break;
+	}
+
+	/* Also include memory needed to sort the input: */
+	if (agg->numSorts > 0)
+	{
+		Assert(agg->sortWorkMem > 0);
+
+		agg->sortWorkMemLimit = work_mem;
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..4e65974f5f3 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -18,6 +18,7 @@ backend_sources += files(
   'execScan.c',
   'execTuples.c',
   'execUtils.c',
+  'execWorkmem.c',
   'functions.c',
   'instrument.c',
   'nodeAgg.c',
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ceb8c8a8039..9e5bcf7ada4 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -258,6 +258,7 @@
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "executor/nodeHash.h"
 #include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -403,7 +404,8 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
 					  Bitmapset **unaggregated);
 static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets,
+							 Size hash_mem_limit);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
 static long hash_choose_num_buckets(double hashentrysize,
@@ -411,6 +413,7 @@ static long hash_choose_num_buckets(double hashentrysize,
 static int	hash_choose_num_partitions(double input_groups,
 									   double hashentrysize,
 									   int used_bits,
+									   Size hash_mem_limit,
 									   int *log2_npartitions);
 static void initialize_hash_entry(AggState *aggstate,
 								  TupleHashTable hashtable,
@@ -431,9 +434,10 @@ static HashAggBatch *hashagg_batch_new(LogicalTape *input_tape, int setno,
 									   int64 input_tuples, double input_card,
 									   int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
-static void hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset,
-							   int used_bits, double input_groups,
-							   double hashentrysize);
+static void hashagg_spill_init(HashAggSpill *spill,
+							   LogicalTapeSet *tapeset, int used_bits,
+							   double input_groups, double hashentrysize,
+							   Size hash_mem_limit);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 								TupleTableSlot *inputslot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
@@ -521,6 +525,14 @@ initialize_phase(AggState *aggstate, int newphase)
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
+		int			workmem_limit;
+
+		/*
+		 * Read the sort-output workmem_limit off the first AGG_SORTED node.
+		 * Since phase 0 is always AGG_HASHED, this will always be phase 1.
+		 */
+		workmem_limit = aggstate->phases[1].aggnode->plan.workmem_limit;
+		Assert(workmem_limit > 0);
 
 		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
 												  sortnode->numCols,
@@ -528,7 +540,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->sortOperators,
 												  sortnode->collations,
 												  sortnode->nullsFirst,
-												  work_mem,
+												  workmem_limit,
 												  NULL, TUPLESORT_NONE);
 	}
 
@@ -577,7 +589,7 @@ fetch_input_tuple(AggState *aggstate)
  */
 static void
 initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
-					 AggStatePerGroup pergroupstate)
+					 AggStatePerGroup pergroupstate, size_t workMem)
 {
 	/*
 	 * Start a fresh sort operation for each DISTINCT/ORDER BY aggregate.
@@ -591,6 +603,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 		if (pertrans->sortstates[aggstate->current_set])
 			tuplesort_end(pertrans->sortstates[aggstate->current_set]);
 
+		Assert(workMem > 0);
 
 		/*
 		 * We use a plain Datum sorter when there's a single input column;
@@ -606,7 +619,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, TUPLESORT_NONE);
+									  workMem, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -616,7 +629,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, TUPLESORT_NONE);
+									 workMem, NULL, TUPLESORT_NONE);
 	}
 
 	/*
@@ -687,7 +700,8 @@ initialize_aggregates(AggState *aggstate,
 			AggStatePerTrans pertrans = &transstates[transno];
 			AggStatePerGroup pergroupstate = &pergroup[transno];
 
-			initialize_aggregate(aggstate, pertrans, pergroupstate);
+			initialize_aggregate(aggstate, pertrans, pergroupstate,
+								 aggstate->phase->aggnode->sortWorkMemLimit);
 		}
 	}
 }
@@ -1498,7 +1512,7 @@ build_hash_tables(AggState *aggstate)
 		}
 #endif
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, setno, nbuckets, memory);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1508,7 +1522,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, long nbuckets,
+				 Size hash_mem_limit)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
@@ -1537,6 +1552,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 perhash->aggnode->grpCollations,
 											 nbuckets,
 											 additionalsize,
+											 hash_mem_limit,
 											 metacxt,
 											 hashcxt,
 											 tmpcxt,
@@ -1805,12 +1821,11 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  */
 void
 hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
-					Size *mem_limit, uint64 *ngroups_limit,
+					Size hash_mem_limit, Size *mem_limit, uint64 *ngroups_limit,
 					int *num_partitions)
 {
 	int			npartitions;
 	Size		partition_mem;
-	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
 	if (input_groups * hashentrysize <= hash_mem_limit)
@@ -1830,6 +1845,7 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups,
 											 hashentrysize,
 											 used_bits,
+											 hash_mem_limit,
 											 NULL);
 	if (num_partitions != NULL)
 		*num_partitions = npartitions;
@@ -1927,7 +1943,8 @@ hash_agg_enter_spill_mode(AggState *aggstate)
 
 			hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
+							   aggstate->hashentrysize,
+							   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 		}
 	}
 }
@@ -2014,9 +2031,9 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  */
 static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
-						   int used_bits, int *log2_npartitions)
+						   int used_bits, Size hash_mem_limit,
+						   int *log2_npartitions)
 {
-	Size		hash_mem_limit = get_hash_memory_limit();
 	double		partition_limit;
 	double		mem_wanted;
 	double		dpartitions;
@@ -2095,7 +2112,8 @@ initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
 		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
 		AggStatePerGroup pergroupstate = &pergroup[transno];
 
-		initialize_aggregate(aggstate, pertrans, pergroupstate);
+		initialize_aggregate(aggstate, pertrans, pergroupstate,
+							 aggstate->phase->aggnode->sortWorkMemLimit);
 	}
 }
 
@@ -2156,7 +2174,8 @@ lookup_hash_entries(AggState *aggstate)
 			if (spill->partitions == NULL)
 				hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 								   perhash->aggnode->numGroups,
-								   aggstate->hashentrysize);
+								   aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 
 			hashagg_spill_tuple(aggstate, spill, slot, hash);
 			pergroup[setno] = NULL;
@@ -2630,7 +2649,9 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
-						batch->used_bits, &aggstate->hash_mem_limit,
+						batch->used_bits,
+						(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
+						&aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
 	/*
@@ -2718,7 +2739,8 @@ agg_refill_hash_table(AggState *aggstate)
 				 */
 				spill_initialized = true;
 				hashagg_spill_init(&spill, tapeset, batch->used_bits,
-								   batch->input_card, aggstate->hashentrysize);
+								   batch->input_card, aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 			}
 			/* no memory for a new group, spill */
 			hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
@@ -2916,13 +2938,15 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
  */
 static void
 hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
-				   double input_groups, double hashentrysize)
+				   double input_groups, double hashentrysize,
+				   Size hash_mem_limit)
 {
 	int			npartitions;
 	int			partition_bits;
 
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
-											 used_bits, &partition_bits);
+											 used_bits, hash_mem_limit,
+											 &partition_bits);
 
 #ifdef USE_INJECTION_POINTS
 	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
@@ -3649,6 +3673,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			totalGroups += aggstate->perhash[k].aggnode->numGroups;
 
 		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 0b32c3a022f..5e006baa88d 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -91,7 +91,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	else
 	{
 		/* XXX should we use less than work_mem for this? */
-		tbm = tbm_create(work_mem * (Size) 1024,
+		Assert(node->ss.ps.plan->workmem_limit > 0);
+		tbm = tbm_create((Size) node->ss.ps.plan->workmem_limit * 1024,
 						 ((BitmapIndexScan *) node->ss.ps.plan)->isshared ?
 						 node->ss.ps.state->es_query_dsa : NULL);
 	}
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 231760ec93d..4ba32639f7d 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -143,7 +143,8 @@ MultiExecBitmapOr(BitmapOrState *node)
 			if (result == NULL) /* first subplan */
 			{
 				/* XXX should we use less than work_mem for this? */
-				result = tbm_create(work_mem * (Size) 1024,
+				Assert(subnode->plan->workmem_limit > 0);
+				result = tbm_create((Size) subnode->plan->workmem_limit * 1024,
 									((BitmapOr *) node->ps.plan)->isshared ?
 									node->ps.state->es_query_dsa : NULL);
 			}
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index e1675f66b43..2272185dce7 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -232,7 +232,8 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 		/* I am the leader */
 		prmdata->value = PointerGetDatum(scanstate);
 		scanstate->leader = scanstate;
-		scanstate->cte_table = tuplestore_begin_heap(true, false, work_mem);
+		scanstate->cte_table =
+			tuplestore_begin_heap(true, false, node->scan.plan.workmem_limit);
 		tuplestore_set_eflags(scanstate->cte_table, scanstate->eflags);
 		scanstate->readptr = 0;
 	}
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 644363582d9..bbb93a8dd58 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -95,6 +95,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											node->funcstates[0].tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
@@ -154,6 +155,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											fs->tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d54cfe5fdbe..60afda04069 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -38,6 +38,7 @@
 #include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/syscache.h"
@@ -449,6 +450,7 @@ ExecHashTableCreate(HashState *state)
 	Hash	   *node;
 	HashJoinTable hashtable;
 	Plan	   *outerNode;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;
 	int			nbuckets;
 	int			nbatch;
@@ -473,8 +475,12 @@ ExecHashTableCreate(HashState *state)
 	 */
 	rows = node->plan.parallel_aware ? node->rows_total : outerNode->plan_rows;
 
+	worker_space_allowed = (size_t) node->plan.workmem_limit * 1024;
+	Assert(worker_space_allowed > 0);
+
 	ExecChooseHashTableSize(rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
+							worker_space_allowed,
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
@@ -601,6 +607,7 @@ ExecHashTableCreate(HashState *state)
 		{
 			pstate->nbatch = nbatch;
 			pstate->space_allowed = space_allowed;
+			pstate->worker_space_allowed = worker_space_allowed;
 			pstate->growth = PHJ_GROWTH_OK;
 
 			/* Set up the shared state for coordinating batches. */
@@ -658,9 +665,10 @@ ExecHashTableCreate(HashState *state)
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
+						size_t worker_space_allowed,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t *space_allowed,
+						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
 						int *num_skew_mcvs,
@@ -690,9 +698,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
-	 * Compute in-memory hashtable size limit from GUCs.
+	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = get_hash_memory_limit();
+	hash_table_bytes = worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -709,7 +717,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		hash_table_bytes = (size_t) newlimit;
 	}
 
-	*space_allowed = hash_table_bytes;
+	*total_space_allowed = hash_table_bytes;
 
 	/*
 	 * If skew optimization is possible, estimate the number of skew buckets
@@ -813,8 +821,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		if (try_combined_hash_mem)
 		{
 			ExecChooseHashTableSize(ntuples, tupwidth, useskew,
-									false, parallel_workers,
-									space_allowed,
+									worker_space_allowed, false,
+									parallel_workers,
+									total_space_allowed,
 									numbuckets,
 									numbatches,
 									num_skew_mcvs,
@@ -1242,7 +1251,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					 * to switch from one large combined memory budget to the
 					 * regular hash_mem budget.
 					 */
-					pstate->space_allowed = get_hash_memory_limit();
+					pstate->space_allowed = pstate->worker_space_allowed;
 
 					/*
 					 * The combined hash_mem of all participants wasn't
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 975b0397e7a..503d75e364b 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -312,7 +312,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.sortOperators[nPresortedCols]),
 												&(plannode->sort.collations[nPresortedCols]),
 												&(plannode->sort.nullsFirst[nPresortedCols]),
-												work_mem,
+												plannode->sort.plan.workmem_limit,
 												NULL,
 												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
@@ -613,7 +613,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.sortOperators,
 												  plannode->sort.collations,
 												  plannode->sort.nullsFirst,
-												  work_mem,
+												  plannode->sort.plan.workmem_limit,
 												  NULL,
 												  node->bounded ?
 												  TUPLESORT_ALLOWBOUNDED :
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9798bb75365..10f764c1bd5 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -61,7 +61,8 @@ ExecMaterial(PlanState *pstate)
 	 */
 	if (tuplestorestate == NULL && node->eflags != 0)
 	{
-		tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
+		tuplestorestate =
+			tuplestore_begin_heap(true, false, node->ss.ps.plan->workmem_limit);
 		tuplestore_set_eflags(tuplestorestate, node->eflags);
 		if (node->eflags & EXEC_FLAG_MARK)
 		{
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 609deb12afb..a3fc37745ca 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -1036,7 +1036,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
 	mstate->mem_used = 0;
 
 	/* Limit the total memory consumed by the cache to this */
-	mstate->mem_limit = get_hash_memory_limit();
+	mstate->mem_limit = (Size) node->plan.workmem_limit * 1024;
 
 	/* A memory context dedicated for the cache */
 	mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..96dc8d53db3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -52,6 +52,7 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->dupCollations,
 											 node->numGroups,
 											 0,
+											 (Size) node->hashWorkMemLimit * 1024,
 											 rustate->ps.state->es_query_cxt,
 											 rustate->tableContext,
 											 rustate->tempContext,
@@ -202,8 +203,15 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/* initialize processing state */
 	rustate->recursing = false;
 	rustate->intermediate_empty = true;
-	rustate->working_table = tuplestore_begin_heap(false, false, work_mem);
-	rustate->intermediate_table = tuplestore_begin_heap(false, false, work_mem);
+
+	/*
+	 * NOTE: each of our working tables gets the same workmem_limit, since
+	 * we're going to swap them repeatedly.
+	 */
+	rustate->working_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
+	rustate->intermediate_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
 
 	/*
 	 * If hashing, we need a per-tuple memory context for comparisons, and a
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5b7ff9c3748..7b71adf05dc 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -105,6 +105,7 @@ build_hash_table(SetOpState *setopstate)
 												node->cmpCollations,
 												node->numGroups,
 												sizeof(SetOpStatePerGroupData),
+												(Size) node->plan.workmem_limit * 1024,
 												setopstate->ps.state->es_query_cxt,
 												setopstate->tableContext,
 												econtext->ecxt_per_tuple_memory,
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f603337ecd3..1da77ab1d6a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -107,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->sortOperators[0],
 												   plannode->collations[0],
 												   plannode->nullsFirst[0],
-												   work_mem,
+												   plannode->plan.workmem_limit,
 												   NULL,
 												   tuplesortopts);
 		else
@@ -117,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->sortOperators,
 												  plannode->collations,
 												  plannode->nullsFirst,
-												  work_mem,
+												  plannode->plan.workmem_limit,
 												  NULL,
 												  tuplesortopts);
 		if (node->bounded)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 49767ed6a52..73214501238 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -546,6 +546,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  node->tab_collations,
 											  nbuckets,
 											  0,
+											  (Size) subplan->hashtab_workmem_limit * 1024,
 											  node->planstate->state->es_query_cxt,
 											  node->hashtablecxt,
 											  node->hashtempcxt,
@@ -575,6 +576,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 												  node->tab_collations,
 												  nbuckets,
 												  0,
+												  (Size) subplan->hashnul_workmem_limit * 1024,
 												  node->planstate->state->es_query_cxt,
 												  node->hashtablecxt,
 												  node->hashtempcxt,
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 83ade3f9437..8a9e534a743 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -276,7 +276,8 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	/* build tuplestore for the result */
 	oldcxt = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
-	tstate->tupstore = tuplestore_begin_heap(false, false, work_mem);
+	tstate->tupstore = tuplestore_begin_heap(false, false,
+											 tstate->ss.ps.plan->workmem_limit);
 
 	/*
 	 * Each call to fetch a new set of rows - of which there may be very many
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 9a1acce2b5d..76819d140ba 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1092,7 +1092,8 @@ prepare_tuplestore(WindowAggState *winstate)
 	Assert(winstate->buffer == NULL);
 
 	/* Create new tuplestore */
-	winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
+	winstate->buffer = tuplestore_begin_heap(false, false,
+											 node->plan.workmem_limit);
 
 	/*
 	 * Set up read pointers for the tuplestore.  The current pointer doesn't
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7c1fdde842b..fecea810b6e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1119,7 +1119,6 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 
-
 	/*
 	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
 	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
@@ -2875,7 +2874,8 @@ cost_agg(Path *path, PlannerInfo *root,
 		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
 											input_width,
 											aggcosts->transitionSpace);
-		hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+		hash_agg_set_limits(hashentrysize, numGroups, 0,
+							get_hash_memory_limit(), &mem_limit,
 							&ngroups_limit, &num_partitions);
 
 		nbatches = Max((numGroups * hashentrysize) / mem_limit,
@@ -4323,6 +4323,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	ExecChooseHashTableSize(inner_path_rows_total,
 							inner_path->pathtarget->width,
 							true,	/* useskew */
+							get_hash_memory_limit(),
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
 							&space_allowed,
@@ -4651,15 +4652,19 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 		/*
 		 * Estimate working memory needed for the hashtable (and hashnulls, if
-		 * needed). The logic below MUST match the logic in buildSubPlanHash()
-		 * and ExecInitSubPlan().
+		 * needed). The "nbuckets" estimate must match the logic in
+		 * buildSubPlanHash() and ExecInitSubPlan().
 		 */
 		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
 		if (nbuckets < 1)
 			nbuckets = 1;
 
+		/*
+		 * This estimate must match the logic in subpath_is_hashable() (and
+		 * see comments there).
+		 */
 		hashentrysize = MAXALIGN(plan->plan_width) +
-			MAXALIGN(SizeofMinimalTupleHeader);
+			MAXALIGN(SizeofHeapTupleHeader);
 
 		subplan->hashtab_workmem =
 			normalize_workmem((double) nbuckets * hashentrysize);
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 50454952eb2..498a1a3a4b6 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -72,6 +72,7 @@ typedef struct ExplainState
 								 * entry */
 	int			num_workers;	/* # of worker processes planned to use */
 	double		total_workmem;	/* total working memory estimate (in bytes) */
+	double		total_workmem_limit;	/* total working-memory limit (in kB) */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d12e3f451d2..c4147876d55 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  Oid *collations,
 										  long nbuckets,
 										  Size additionalsize,
+										  Size hash_mem_limit,
 										  MemoryContext metacxt,
 										  MemoryContext tablecxt,
 										  MemoryContext tempcxt,
@@ -499,6 +500,7 @@ extern Tuplestorestate *ExecMakeTableFunctionResult(SetExprState *setexpr,
 													ExprContext *econtext,
 													MemoryContext argContext,
 													TupleDesc expectedDesc,
+													int workMem,
 													bool randomAccess);
 extern SetExprState *ExecInitFunctionResultSet(Expr *expr,
 											   ExprContext *econtext, PlanState *parent);
@@ -724,4 +726,9 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execWorkmem.c
+ */
+extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..9b184c47322 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -253,7 +253,8 @@ typedef struct ParallelHashJoinState
 	ParallelHashGrowth growth;	/* control batch/bucket growth */
 	dsa_pointer chunk_work_queue;	/* chunk work queue */
 	int			nparticipants;
-	size_t		space_allowed;
+	size_t		space_allowed;	/* -- might be shared with other workers */
+	size_t		worker_space_allowed;	/* -- exclusive to this worker */
 	size_t		total_tuples;	/* total number of inner tuples */
 	LWLock		lock;			/* lock protecting the above */
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 34b82d0f5d1..728006b3ff5 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -329,8 +329,9 @@ extern void ExecReScanAgg(AggState *node);
 extern Size hash_agg_entry_size(int numTrans, Size tupleWidth,
 								Size transitionSpace);
 extern void hash_agg_set_limits(double hashentrysize, double input_groups,
-								int used_bits, Size *mem_limit,
-								uint64 *ngroups_limit, int *num_partitions);
+								int used_bits, Size hash_mem_limit,
+								Size *mem_limit, uint64 *ngroups_limit,
+								int *num_partitions);
 
 /* parallel instrumentation support */
 extern void ExecAggEstimate(AggState *node, ParallelContext *pcxt);
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index fc5b20994dd..6a40730c065 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -57,9 +57,10 @@ extern bool ExecParallelScanHashTableForUnmatched(HashJoinState *hjstate,
 extern void ExecHashTableReset(HashJoinTable hashtable);
 extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
+									size_t worker_space_allowed,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t *space_allowed,
+									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
 									int *num_skew_mcvs,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d3f8fd7bd6c..445953c77d3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,7 @@ typedef struct Plan
 	Cost		total_cost;
 
 	int			workmem;		/* estimated work_mem (in KB) */
+	int			workmem_limit;	/* work_mem limit per parallel worker (in KB) */
 
 	/*
 	 * planner's estimate of result size of this plan step
@@ -237,7 +238,7 @@ typedef struct Plan
 
 /* ----------------
  *	 Result node -
- *		If no outer plan, evaluate a variable-free targetlist.
+z *		If no outer plan, evaluate a variable-free targetlist.
  *		If outer plan, return tuples from outer plan (after a level of
  *		projection as shown by targetlist).
  *
@@ -433,6 +434,8 @@ typedef struct RecursiveUnion
 
 	/* estimated work_mem for hash table (in KB) */
 	int			hashWorkMem;
+	/* work_mem reserved for hash table */
+	int			hashWorkMemLimit;
 } RecursiveUnion;
 
 /* ----------------
@@ -1158,6 +1161,9 @@ typedef struct Agg
 	/* estimated work_mem needed to sort each input (in KB) */
 	int			sortWorkMem;
 
+	/* work_mem limit to sort one input (in KB) */
+	int			sortWorkMemLimit;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index b7d6b0fe7dc..7232d07e8b8 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1111,6 +1111,8 @@ typedef struct SubPlan
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
 	int			hashtab_workmem;	/* estimated hashtable work_mem (in KB) */
 	int			hashnul_workmem;	/* estimated hashnulls work_mem (in KB) */
+	int			hashtab_workmem_limit;	/* hashtable work_mem limit (in kB) */
+	int			hashnul_workmem_limit;	/* hashnulls work_mem limit (in kB) */
 } SubPlan;
 
 /*
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
index 215180808f4..c1a3bdd93d2 100644
--- a/src/test/regress/expected/workmem.out
+++ b/src/test/regress/expected/workmem.out
@@ -29,17 +29,18 @@ order by unique1;
 ');
                          workmem_filter                          
 -----------------------------------------------------------------
- Sort  (work_mem=N kB)
+ Sort  (work_mem=N kB limit=4096 kB)
    Sort Key: onek.unique1
    ->  Nested Loop
-         ->  HashAggregate  (work_mem=N kB)
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                Group Key: "*VALUES*".column1, "*VALUES*".column2
                ->  Values Scan on "*VALUES*"
          ->  Index Scan using onek_unique1 on onek
                Index Cond: (unique1 = "*VALUES*".column1)
                Filter: ("*VALUES*".column2 = ten)
  Total Working Memory: N kB
-(10 rows)
+ Total Working Memory Limit: 12288 kB
+(11 rows)
 
 select *
 from onek
@@ -64,18 +65,19 @@ order by unique1;
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- Sort  (work_mem=N kB)
+ Sort  (work_mem=N kB limit=4096 kB)
    Sort Key: onek.unique1
    ->  Nested Loop
          ->  Unique
-               ->  Sort  (work_mem=N kB)
+               ->  Sort  (work_mem=N kB limit=4096 kB)
                      Sort Key: "*VALUES*".column1, "*VALUES*".column2
                      ->  Values Scan on "*VALUES*"
          ->  Index Scan using onek_unique1 on onek
                Index Cond: (unique1 = "*VALUES*".column1)
                Filter: ("*VALUES*".column2 = ten)
  Total Working Memory: N kB
-(11 rows)
+ Total Working Memory Limit: 8192 kB
+(12 rows)
 
 select *
 from onek
@@ -95,17 +97,18 @@ explain (costs off, work_mem on)
 select * from (select * from tenk1 order by four) t order by four, ten
 limit 1;
 ');
-             workmem_filter              
------------------------------------------
+                    workmem_filter                     
+-------------------------------------------------------
  Limit
-   ->  Incremental Sort  (work_mem=N kB)
+   ->  Incremental Sort  (work_mem=N kB limit=8192 kB)
          Sort Key: tenk1.four, tenk1.ten
          Presorted Key: tenk1.four
-         ->  Sort  (work_mem=N kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
                Sort Key: tenk1.four
                ->  Seq Scan on tenk1
  Total Working Memory: N kB
-(8 rows)
+ Total Working Memory Limit: 12288 kB
+(9 rows)
 
 select * from (select * from tenk1 order by four) t order by four, ten
 limit 1;
@@ -131,16 +134,17 @@ where exists (select 1 from tenk1 t3
    ->  Nested Loop
          ->  Hash Join
                Hash Cond: (t3.thousand = t1.unique1)
-               ->  HashAggregate  (work_mem=N kB)
+               ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                      Group Key: t3.thousand, t3.tenthous
                      ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
-               ->  Hash  (work_mem=N kB)
+               ->  Hash  (work_mem=N kB limit=8192 kB)
                      ->  Index Only Scan using onek_unique1 on onek t1
                            Index Cond: (unique1 < 1)
          ->  Index Only Scan using tenk1_hundred on tenk1 t2
                Index Cond: (hundred = t3.tenthous)
  Total Working Memory: N kB
-(13 rows)
+ Total Working Memory Limit: 16384 kB
+(14 rows)
 
 select count(*) from (
 select t1.unique1, t2.hundred
@@ -165,23 +169,24 @@ from int4_tbl t1, int4_tbl t2
 where t4.f1 is null
 ) t;
 ');
-                       workmem_filter                        
--------------------------------------------------------------
+                              workmem_filter                              
+--------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Nested Loop Left Join
                Filter: (t4.f1 IS NULL)
                ->  Seq Scan on int4_tbl t2
-               ->  Materialize  (work_mem=N kB)
+               ->  Materialize  (work_mem=N kB limit=4096 kB)
                      ->  Nested Loop Left Join
                            Join Filter: (t3.f1 > 1)
                            ->  Seq Scan on int4_tbl t3
                                  Filter: (f1 > 0)
-                           ->  Materialize  (work_mem=N kB)
+                           ->  Materialize  (work_mem=N kB limit=4096 kB)
                                  ->  Seq Scan on int4_tbl t4
          ->  Seq Scan on int4_tbl t1
  Total Working Memory: N kB
-(14 rows)
+ Total Working Memory Limit: 8192 kB
+(15 rows)
 
 select count(*) from (
 select t1.f1
@@ -204,16 +209,17 @@ group by grouping sets((a, b), (a));
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- WindowAgg  (work_mem=N kB)
-   ->  Sort  (work_mem=N kB)
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
          Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
-         ->  HashAggregate  (work_mem=N kB)
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
                Hash Key: "*VALUES*".column1, "*VALUES*".column2
                Hash Key: "*VALUES*".column1
                ->  Values Scan on "*VALUES*"
                      Filter: (column1 = column2)
  Total Working Memory: N kB
-(9 rows)
+ Total Working Memory Limit: 16384 kB
+(10 rows)
 
 select a, b, row_number() over (order by a, b nulls first)
 from (values (1, 1), (2, 2)) as t (a, b) where a = b
@@ -236,10 +242,10 @@ group by grouping sets((a, b), (a), (b), (c), (d));
 ');
                             workmem_filter                            
 ----------------------------------------------------------------------
- WindowAgg  (work_mem=N kB)
-   ->  Sort  (work_mem=N kB)
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
          Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
-         ->  GroupAggregate  (work_mem=N kB)
+         ->  GroupAggregate  (work_mem=N kB limit=8192 kB)
                Group Key: "*VALUES*".column1, "*VALUES*".column2
                Group Key: "*VALUES*".column1
                Sort Key: "*VALUES*".column2
@@ -248,12 +254,13 @@ group by grouping sets((a, b), (a), (b), (c), (d));
                  Group Key: "*VALUES*".column3
                Sort Key: "*VALUES*".column4
                  Group Key: "*VALUES*".column4
-               ->  Sort  (work_mem=N kB)
+               ->  Sort  (work_mem=N kB limit=4096 kB)
                      Sort Key: "*VALUES*".column1
                      ->  Values Scan on "*VALUES*"
                            Filter: (column1 = column2)
  Total Working Memory: N kB
-(17 rows)
+ Total Working Memory Limit: 20480 kB
+(18 rows)
 
 select a, b, row_number() over (order by a, b nulls first)
 from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
@@ -282,17 +289,18 @@ select workmem_filter('
 explain (costs off, work_mem on)
 select length(stringu1) from tenk1 group by length(stringu1);
 ');
-                   workmem_filter                   
-----------------------------------------------------
- Finalize HashAggregate  (work_mem=N kB)
+                          workmem_filter                           
+-------------------------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB limit=8192 kB)
    Group Key: (length((stringu1)::text))
    ->  Gather
          Workers Planned: 4
-         ->  Partial HashAggregate  (work_mem=N kB)
+         ->  Partial HashAggregate  (work_mem=N kB limit=40960 kB)
                Group Key: length((stringu1)::text)
                ->  Parallel Seq Scan on tenk1
  Total Working Memory: N kB
-(8 rows)
+ Total Working Memory Limit: 49152 kB
+(9 rows)
 
 select length(stringu1) from tenk1 group by length(stringu1);
  length 
@@ -307,12 +315,13 @@ reset max_parallel_workers_per_gather;
 -- Agg (simple) [no work_mem]
 explain (costs off, work_mem on)
 select MAX(length(stringu1)) from tenk1;
-         QUERY PLAN         
-----------------------------
+            QUERY PLAN            
+----------------------------------
  Aggregate
    ->  Seq Scan on tenk1
  Total Working Memory: 0 kB
-(3 rows)
+ Total Working Memory Limit: 0 kB
+(4 rows)
 
 select MAX(length(stringu1)) from tenk1;
  max 
@@ -328,12 +337,13 @@ select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
 ) t;
 ');
-                      workmem_filter                       
------------------------------------------------------------
+                             workmem_filter                              
+-------------------------------------------------------------------------
  Aggregate
-   ->  Function Scan on generate_series a  (work_mem=N kB)
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(3 rows)
+ Total Working Memory Limit: 4096 kB
+(4 rows)
 
 select count(*) from (
 select sum(n) over(partition by m)
@@ -352,12 +362,13 @@ from rows from(generate_series(1, 5),
                generate_series(2, 10),
                generate_series(4, 15));
 ');
-                     workmem_filter                      
----------------------------------------------------------
+                             workmem_filter                             
+------------------------------------------------------------------------
  Aggregate
-   ->  Function Scan on generate_series  (work_mem=N kB)
+   ->  Function Scan on generate_series  (work_mem=N kB limit=12288 kB)
  Total Working Memory: N kB
-(3 rows)
+ Total Working Memory Limit: 12288 kB
+(4 rows)
 
 select count(*)
 from rows from(generate_series(1, 5),
@@ -384,13 +395,14 @@ SELECT  xmltable.*
                                   unit text PATH ''SIZE/@unit'',
                                   premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
 ');
-                      workmem_filter                      
-----------------------------------------------------------
+                             workmem_filter                             
+------------------------------------------------------------------------
  Nested Loop
    ->  Seq Scan on xmldata
-   ->  Table Function Scan on "xmltable"  (work_mem=N kB)
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(4 rows)
+ Total Working Memory Limit: 4096 kB
+(5 rows)
 
 SELECT  xmltable.*
    FROM (SELECT data FROM xmldata) x,
@@ -418,7 +430,8 @@ select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
    ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
          Filter: (unique2 <> 10)
  Total Working Memory: 0 kB
-(5 rows)
+ Total Working Memory Limit: 0 kB
+(6 rows)
 
 select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
  unique1 
@@ -435,11 +448,12 @@ select count(*) from
                           workmem_filter                          
 ------------------------------------------------------------------
  Aggregate
-   ->  HashSetOp Intersect  (work_mem=N kB)
+   ->  HashSetOp Intersect  (work_mem=N kB limit=8192 kB)
          ->  Seq Scan on tenk1
          ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
  Total Working Memory: N kB
-(5 rows)
+ Total Working Memory Limit: 8192 kB
+(6 rows)
 
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -456,23 +470,24 @@ cross join lateral (with recursive x(a) as (
           select o.four as a union select a + 1 from x where a < 10)
     select * from x) ss where o.ten = 1;
 ');
-                       workmem_filter                       
-------------------------------------------------------------
+                              workmem_filter                               
+---------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
                Filter: (ten = 1)
-         ->  Memoize  (work_mem=N kB)
+         ->  Memoize  (work_mem=N kB limit=8192 kB)
                Cache Key: o.four
                Cache Mode: binary
-               ->  CTE Scan on x  (work_mem=N kB)
+               ->  CTE Scan on x  (work_mem=N kB limit=4096 kB)
                      CTE x
-                       ->  Recursive Union  (work_mem=N kB)
+                       ->  Recursive Union  (work_mem=N kB limit=16384 kB)
                              ->  Result
                              ->  WorkTable Scan on x x_1
                                    Filter: (a < 10)
  Total Working Memory: N kB
-(14 rows)
+ Total Working Memory Limit: 28672 kB
+(15 rows)
 
 select sum(o.four), sum(ss.a) from onek o
 cross join lateral (with recursive x(a) as (
@@ -491,20 +506,21 @@ WITH q1(x,y) AS (
   )
 SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
 ');
-                   workmem_filter                   
-----------------------------------------------------
+                          workmem_filter                          
+------------------------------------------------------------------
  Aggregate
    CTE q1
-     ->  HashAggregate  (work_mem=N kB)
+     ->  HashAggregate  (work_mem=N kB limit=8192 kB)
            Group Key: tenk1.hundred
            ->  Seq Scan on tenk1
    InitPlan 2
      ->  Aggregate
-           ->  CTE Scan on q1 qsub  (work_mem=N kB)
-   ->  CTE Scan on q1  (work_mem=N kB)
+           ->  CTE Scan on q1 qsub  (work_mem=N kB limit=4096 kB)
+   ->  CTE Scan on q1  (work_mem=N kB limit=4096 kB)
          Filter: ((y)::numeric > (InitPlan 2).col1)
  Total Working Memory: N kB
-(11 rows)
+ Total Working Memory Limit: 16384 kB
+(12 rows)
 
 WITH q1(x,y) AS (
     SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
@@ -522,15 +538,16 @@ select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
 limit 5;
 ');
-                            workmem_filter                             
------------------------------------------------------------------------
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
  Limit
-   ->  WindowAgg  (work_mem=N kB)
-         ->  Sort  (work_mem=N kB)
+   ->  WindowAgg  (work_mem=N kB limit=4096 kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
                Sort Key: ((a.n < 3))
-               ->  Function Scan on generate_series a  (work_mem=N kB)
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
  Total Working Memory: N kB
-(6 rows)
+ Total Working Memory Limit: 12288 kB
+(7 rows)
 
 select sum(n) over(partition by m)
 from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
@@ -560,20 +577,21 @@ select * from tenk1 a join tenk1 b on
          ->  Bitmap Heap Scan on tenk1 b
                Recheck Cond: ((hundred = 4) OR (unique1 = 2))
                ->  BitmapOr
-                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB)
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB limit=4096 kB)
                            Index Cond: (hundred = 4)
-                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
                            Index Cond: (unique1 = 2)
-         ->  Materialize  (work_mem=N kB)
+         ->  Materialize  (work_mem=N kB limit=4096 kB)
                ->  Bitmap Heap Scan on tenk1 a
                      Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
                      ->  BitmapOr
-                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB)
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB limit=4096 kB)
                                  Index Cond: (unique2 = 3)
-                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
                                  Index Cond: (unique1 = 1)
  Total Working Memory: N kB
-(19 rows)
+ Total Working Memory Limit: 20480 kB
+(20 rows)
 
 select count(*) from (
 select * from tenk1 a join tenk1 b on
@@ -589,15 +607,16 @@ select workmem_filter('
 explain (costs off, work_mem on)
 select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
 ');
-       workmem_filter       
-----------------------------
- Result  (work_mem=N kB)
+             workmem_filter             
+----------------------------------------
+ Result  (work_mem=N kB limit=16384 kB)
    SubPlan 1
      ->  Append
            ->  Result
            ->  Result
  Total Working Memory: N kB
-(6 rows)
+ Total Working Memory Limit: 16384 kB
+(7 rows)
 
 select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
  ?column? 
@@ -612,16 +631,17 @@ select 1 = any (select (select 1) where 1 = any (select 1));
 ');
                          workmem_filter                         
 ----------------------------------------------------------------
- Result  (work_mem=N kB)
+ Result  (work_mem=N kB limit=16384 kB)
    SubPlan 3
-     ->  Result  (work_mem=N kB)
+     ->  Result  (work_mem=N kB limit=8192 kB)
            One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
            InitPlan 1
              ->  Result
            SubPlan 2
              ->  Result
  Total Working Memory: N kB
-(9 rows)
+ Total Working Memory Limit: 24576 kB
+(10 rows)
 
 select 1 = any (select (select 1) where 1 = any (select 1));
  ?column? 
-- 
2.47.1

v03_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchapplication/octet-stream; name=v03_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchDownload

From 2357a65a9164747c49c0511f8ac6350133ee4c94 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 21 Feb 2025 00:41:31 +0000
Subject: [PATCH 4/4] Add "workmem_hook" to allow extensions to override
 per-node work_mem

---
 contrib/Makefile                     |   3 +-
 contrib/workmem/Makefile             |  20 +
 contrib/workmem/expected/workmem.out | 676 +++++++++++++++++++++++++++
 contrib/workmem/meson.build          |  28 ++
 contrib/workmem/sql/workmem.sql      | 304 ++++++++++++
 contrib/workmem/workmem.c            | 655 ++++++++++++++++++++++++++
 src/backend/executor/execWorkmem.c   |  37 +-
 src/include/executor/executor.h      |   4 +
 8 files changed, 1717 insertions(+), 10 deletions(-)
 create mode 100644 contrib/workmem/Makefile
 create mode 100644 contrib/workmem/expected/workmem.out
 create mode 100644 contrib/workmem/meson.build
 create mode 100644 contrib/workmem/sql/workmem.sql
 create mode 100644 contrib/workmem/workmem.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..b4880ab7067 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,7 +50,8 @@ SUBDIRS = \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
-		vacuumlo
+		vacuumlo	\
+		workmem
 
 ifeq ($(with_ssl),openssl)
 SUBDIRS += pgcrypto sslinfo
diff --git a/contrib/workmem/Makefile b/contrib/workmem/Makefile
new file mode 100644
index 00000000000..f920cdb9964
--- /dev/null
+++ b/contrib/workmem/Makefile
@@ -0,0 +1,20 @@
+# contrib/workmem/Makefile
+
+MODULE_big = workmem
+OBJS = \
+	$(WIN32RES) \
+	workmem.o
+PGFILEDESC = "workmem - extension that adjusts PostgreSQL work_mem per node"
+
+REGRESS = workmem
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/workmem
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/workmem/expected/workmem.out b/contrib/workmem/expected/workmem.out
new file mode 100644
index 00000000000..a2c6d3be4d2
--- /dev/null
+++ b/contrib/workmem/expected/workmem.out
@@ -0,0 +1,676 @@
+load 'workmem';
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=25600 kB)
+   ->  Sort  (work_mem=N kB limit=25600 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=51200 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20480 kB)
+   ->  Sort  (work_mem=N kB limit=20480 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=40960 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=20480 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=102400 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=102399 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102399 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                    
+--------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=34133 kB)
+         ->  Sort  (work_mem=N kB limit=34133 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=34134 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter              
+-----------------------------------------
+ Result  (work_mem=N kB limit=102400 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=68267 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=34133 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=1024 kB)
+   ->  Sort  (work_mem=N kB limit=1024 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=2048 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=819 kB)
+   ->  Sort  (work_mem=N kB limit=819 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=1638 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=820 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=4095 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4095 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=1365 kB)
+         ->  Sort  (work_mem=N kB limit=1365 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=1366 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+            workmem_filter             
+---------------------------------------
+ Result  (work_mem=N kB limit=4096 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=2731 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=1365 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20 kB)
+   ->  Sort  (work_mem=N kB limit=20 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=40 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=16 kB)
+   ->  Sort  (work_mem=N kB limit=16 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=32 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=16 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=80 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                           workmem_filter                            
+---------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=78 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 78 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                                  workmem_filter                                   
+-----------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=26 kB)
+         ->  Sort  (work_mem=N kB limit=27 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=27 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+           workmem_filter            
+-------------------------------------
+ Result  (work_mem=N kB limit=80 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=54 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=26 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/meson.build b/contrib/workmem/meson.build
new file mode 100644
index 00000000000..fce8030ba45
--- /dev/null
+++ b/contrib/workmem/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+workmem_sources = files(
+  'workmem.c',
+)
+
+if host_system == 'windows'
+  workmem_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'workmem',
+    '--FILEDESC', 'workmem - extension that adjusts PostgreSQL work_mem per node',])
+endif
+
+workmem = shared_module('workmem',
+  workmem_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += workmem
+
+tests += {
+  'name': 'workmem',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'workmem',
+    ],
+  },
+}
diff --git a/contrib/workmem/sql/workmem.sql b/contrib/workmem/sql/workmem.sql
new file mode 100644
index 00000000000..e6dbc35bf10
--- /dev/null
+++ b/contrib/workmem/sql/workmem.sql
@@ -0,0 +1,304 @@
+load 'workmem';
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
+
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/workmem.c b/contrib/workmem/workmem.c
new file mode 100644
index 00000000000..eb62a65d7bc
--- /dev/null
+++ b/contrib/workmem/workmem.c
@@ -0,0 +1,655 @@
+/*-------------------------------------------------------------------------
+ *
+ * workmem.c
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/workmem/workmem.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+
+PG_MODULE_MAGIC;
+
+/* Local variables */
+
+/*
+ * A Target represents a collection of data structures, belonging to an
+ * execution node, that all share the same memory limit.
+ *
+ * For example, in parallel query, every parallel worker (plus the leader)
+ * gets a copy of the execution node, and therefore a copy of all of that
+ * node's work_mem limits. In this case, we'll track a single Target, but its
+ * count will include (1 + num_workers), because this Target gets "applied"
+ * to (1 + num_workers) execution nodes.
+ */
+typedef struct Target
+{
+	/* # of data structures to which target applies: */
+	int			count;
+	/* workmem estimate for each of these data structures: */
+	int			workmem;
+	/* (original) workmem limit for each of these data structures: */
+	int			limit;
+	/* workmem estimate, but capped at (original) workmem limit: */
+	int			priority;
+	/* ratio of (priority / limit); measure's Target's "greediness": */
+	double		ratio;
+	/* link to target's actual limit, so we can set it: */
+	int		   *target_limit;
+}			Target;
+
+typedef struct WorkMemStats
+{
+	/* total # of data structures that get working memory: */
+	uint64		count;
+	/* total working memory estimated for this query: */
+	uint64		workmem;
+	/* total working memory (currently) reserved for this query: */
+	uint64		limit;
+	/* total "capped" working memory estimate: */
+	uint64		priority;
+	/* list of Targets, used to update actual workmem limits: */
+	List	   *targets;
+}			WorkMemStats;
+
+/* GUC variables */
+static int	workmem_query_work_mem = 100 * 1024;	/* kB */
+
+/* internal functions */
+static void workmem_fn(PlannedStmt *plannedstmt);
+
+static int	clamp_priority(int workmem, int limit);
+static Target * make_target(int workmem, int *target_limit, int count);
+static void add_target(WorkMemStats * workmem_stats, Target * target);
+
+/* Sort comparators: sort by ratio, ascending or descending. */
+static int	target_compare_asc(const ListCell *a, const ListCell *b);
+static int	target_compare_desc(const ListCell *a, const ListCell *b);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Define custom GUC variable. */
+	DefineCustomIntVariable("workmem.query_work_mem",
+							"Amount of working-memory (in kB) to provide each "
+							"query.",
+							NULL,
+							&workmem_query_work_mem,
+							100 * 1024, /* default to 100 MB */
+							64,
+							INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_KB,
+							NULL,
+							NULL,
+							NULL);
+
+	MarkGUCPrefixReserved("workmem");
+
+	/* Install hooks. */
+	ExecAssignWorkMem_hook = workmem_fn;
+}
+
+/* Compute an Agg's working memory estimate and limit. */
+typedef struct AggWorkMem
+{
+	uint64		hash_workmem;
+	int		   *hash_limit;
+
+	int			num_sorts;
+	int			max_sort_workmem;
+	int		   *sort_limit;
+}			AggWorkMem;
+
+static void
+workmem_analyze_agg_node(Agg *agg, AggWorkMem * mem,
+						 WorkMemStats * workmem_stats)
+{
+	if (agg->sortWorkMem > 0 || agg->sortWorkMemLimit > 0)
+	{
+		/* Record memory used for input sort buffers. */
+		Target	   *target = make_target(agg->sortWorkMem,
+										 &agg->sortWorkMemLimit,
+										 agg->numSorts);
+
+		add_target(workmem_stats, target);
+	}
+
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			mem->hash_workmem += agg->plan.workmem;
+
+			/* Read hash limit from the first AGG_HASHED node. */
+			if (mem->hash_limit == NULL)
+				mem->hash_limit = &agg->plan.workmem_limit;
+
+			break;
+		case AGG_SORTED:
+
+			++mem->num_sorts;
+
+			mem->max_sort_workmem = Max(mem->max_sort_workmem, agg->plan.workmem);
+
+			/* Read sort limit from the first AGG_SORTED node. */
+			if (mem->sort_limit == NULL)
+				mem->sort_limit = &agg->plan.workmem_limit;
+
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze_agg(Agg *agg, int num_workers, WorkMemStats * workmem_stats)
+{
+	AggWorkMem	mem;
+
+	memset(&mem, 0, sizeof(mem));
+
+	/* Analyze main Agg node. */
+	workmem_analyze_agg_node(agg, &mem, workmem_stats);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach_node(Agg, aggnode, agg->chain)
+		workmem_analyze_agg_node(aggnode, &mem, workmem_stats);
+
+	/*
+	 * Working memory for hash tables, if needed. All hash tables share the
+	 * same limit:
+	 */
+	if (mem.hash_workmem > 0 || mem.hash_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.hash_workmem, mem.hash_limit,
+						1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	/*
+	 * Workimg memory for (output) sort buffers, if needed. We'll need at most
+	 * 2 sort buffers:
+	 */
+	if (mem.max_sort_workmem > 0 || mem.sort_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.max_sort_workmem, mem.sort_limit,
+						Min(mem.num_sorts, 2) * (1 + num_workers));
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_subplan(SubPlan *subplan, int num_workers,
+						WorkMemStats * workmem_stats)
+{
+	if (subplan->hashtab_workmem > 0 || subplan->hashtab_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash table */
+		Target	   *target = make_target(subplan->hashtab_workmem,
+										 &subplan->hashtab_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	if (subplan->hashnul_workmem > 0 || subplan->hashnul_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash-NULL table */
+		Target	   *target = make_target(subplan->hashnul_workmem,
+										 &subplan->hashnul_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_plan(Plan *plan, int num_workers, WorkMemStats * workmem_stats)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Analyze this node's SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	if (IsA(plan, Gather) || IsA(plan, GatherMerge))
+	{
+		/*
+		 * Parallel query apparently does not run InitPlans in parallel. Well,
+		 * currently, Gather and GatherMerge Plan nodes don't contain any
+		 * quals, so they can't contain SubPlans at all; so maybe we should
+		 * move this below the SubPlan-analysis loop, as well? For now, to
+		 * maintain consistency with explain.c, we'll just leave this here.
+		 */
+		Assert(num_workers == 0);
+
+		if (IsA(plan, Gather))
+			num_workers = ((Gather *) plan)->num_workers;
+		else
+			num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	/* Analyze this node's working memory. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 1 + num_workers);
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_Agg:
+			workmem_analyze_agg((Agg *) plan, num_workers, workmem_stats);
+			break;
+		case T_FunctionScan:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				int			nfuncs =
+					list_length(((FunctionScan *) plan)->functions);
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 nfuncs * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_IncrementalSort:
+			if (plan->workmem > 0 || plan->workmem_limit > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 2 * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+				Target	   *target;
+
+				/* working memory for two tuplestores */
+				target = make_target(plan->workmem, &plan->workmem_limit,
+									 2 * (1 + num_workers));
+				add_target(workmem_stats, target);
+
+				/* working memory for a hash table, if needed */
+				if (runion->hashWorkMem > 0 || runion->hashWorkMemLimit > 0)
+				{
+					target = make_target(runion->hashWorkMem,
+										 &runion->hashWorkMem,
+										 1 + num_workers);
+					add_target(workmem_stats, target);
+				}
+			}
+			break;
+		default:
+			Assert(plan->workmem == 0);
+			Assert(plan->workmem_limit == 0);
+			break;
+	}
+
+	/* Now analyze this Plan's children. */
+	if (outerPlan(plan))
+		workmem_analyze_plan(outerPlan(plan), num_workers, workmem_stats);
+
+	if (innerPlan(plan))
+		workmem_analyze_plan(innerPlan(plan), num_workers, workmem_stats);
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_SubqueryScan:
+			workmem_analyze_plan(((SubqueryScan *) plan)->subplan,
+								 num_workers, workmem_stats);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	/* Analyze the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			workmem_analyze_plan(plan, 0 /* num_workers */ , workmem_stats);
+	}
+
+	/* Analyze the main Plan tree itself. */
+	workmem_analyze_plan(plannedstmt->planTree, 0 /* num_workers */ ,
+						 workmem_stats);
+}
+
+static void
+workmem_set(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			remaining = workmem_query_work_mem;
+
+	if (workmem_stats->limit <= remaining)
+	{
+		/*
+		 * "High memory" case: we have more than enough query_work_mem; now
+		 * hand out the excess.
+		 */
+
+		/* This is memory that exceeds workmem limits. */
+		remaining -= workmem_stats->limit;
+
+		/*
+		 * Sort targets from highest ratio to lowest. When we assign memory to
+		 * a Target, we'll truncate fractional KB; so by going through the
+		 * list from highest to lowest ratio, we ensure that the lowest ratios
+		 * get the leftover fractional KBs.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/* NOTE: This is extra workmem *per data structure*. */
+			extra_workmem = (int) (fraction * remaining);
+
+			*target->target_limit += extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else if (workmem_stats->priority <= remaining)
+	{
+		/*
+		 * "Medium memory" case: we don't have enough query_work_mem to give
+		 * every target its full allotment, but we do have enough to give it
+		 * as much as (we estimate) it needs.
+		 *
+		 * So, just take some memory away from nodes that (we estimate) won't
+		 * need it.
+		 */
+
+		/* This is memory that exceeds workmem estimates. */
+		remaining -= workmem_stats->priority;
+
+		/*
+		 * Sort targets from highest ratio to lowest. We'll skip any Target
+		 * with ratio > 1.0, because (we estimate) they already need their
+		 * full allotment. Also, once a target reaches its workmem limit,
+		 * we'll stop giving it more workmem, leaving the surplus memory to be
+		 * assigned to targets with smaller ratios.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/*
+			 * Don't give the target more than its (original) limit.
+			 *
+			 * NOTE: This is extra workmem *per data structure*.
+			 */
+			extra_workmem = Min((int) (fraction * remaining),
+								target->limit - target->priority);
+
+			*target->target_limit = target->priority + extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else
+	{
+		uint64		limit = workmem_stats->limit;
+
+		/*
+		 * "Low memory" case: we are severely memory constrained, and need to
+		 * take "priority" memory away from targets that (we estimate)
+		 * actually need it. We'll do this by (effectively) reducing the
+		 * global "work_mem" limit, uniformly, for all targets, until we're
+		 * under the query_work_mem limit.
+		 */
+		elog(WARNING,
+			 "not enough working memory for query: increase "
+			 "workmem.query_work_mem");
+
+		/*
+		 * Sort targets from lowest ratio to highest. For any target whose
+		 * ratio is < the target_ratio, we'll just assign it its priority (=
+		 * workmem) as limit, and return the excess workmem to our "limit",
+		 * for use by subsequent, greedier, targets.
+		 */
+		list_sort(workmem_stats->targets, target_compare_asc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		target_ratio;
+			int			target_limit;
+
+			/*
+			 * If we restrict our targets to this ratio, we'll stay within the
+			 * query_work_mem limit.
+			 */
+			target_ratio = (double) remaining / limit;
+
+			/*
+			 * Don't give this target more than its priority request (but we
+			 * might give it less).
+			 */
+			target_limit = Min(target->priority,
+							   target_ratio * target->limit);
+			*target->target_limit = target_limit;
+
+			/* "Remaining" decreases by memory we actually assigned. */
+			remaining -= (target_limit * target->count);
+
+			/*
+			 * "Limit" decreases by target's original memory limit.
+			 *
+			 * If target_limit <= target->priority, so we restricted this
+			 * target to less memory than (we estimate) it needs, then the
+			 * target_ratio will stay the same, since, letting A = remaining,
+			 * B = limit, and R = ratio, we'll have:
+			 *
+			 * R=A/B <=> A=R*B <=> A-R*X = R*B - R*X <=> A-R*X = R * (B-X) <=>
+			 * R = (A-R*X) / (B-X)
+			 *
+			 * -- which is what we wanted to prove.
+			 *
+			 * And if target_limit > target->priority, so we didn't need to
+			 * restrict this target beyond its priority estimate, then the
+			 * target_ratio will increase. This means more memory for the
+			 * remaining, greedier, targets.
+			 */
+			limit -= (target->limit * target->count);
+
+			target_ratio = (double) remaining / limit;
+		}
+	}
+}
+
+/*
+ * workmem_fn: updates the query plan's work_mem based on query_work_mem
+ */
+static void
+workmem_fn(PlannedStmt *plannedstmt)
+{
+	WorkMemStats workmem_stats;
+	MemoryContext context,
+				oldcontext;
+
+	/*
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	if (workmem_query_work_mem == -1)
+		return;					/* disabled */
+
+	/*
+	 * Start by assigning default working memory to all of this query's Plan
+	 * nodes.
+	 */
+	standard_ExecAssignWorkMem(plannedstmt);
+
+	memset(&workmem_stats, 0, sizeof(workmem_stats));
+
+	/*
+	 * Set up our own memory context, so we can drop the metadata we generate,
+	 * all at once.
+	 */
+	context = AllocSetContextCreate(CurrentMemoryContext,
+									"workmem_fn context",
+									ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	/* Figure out how much total working memory this query wants/needs. */
+	workmem_analyze(plannedstmt, &workmem_stats);
+
+	/* Now restrict the query to workmem.query_work_mem. */
+	workmem_set(plannedstmt, &workmem_stats);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Drop all our metadata. */
+	MemoryContextDelete(context);
+}
+
+static int
+clamp_priority(int workmem, int limit)
+{
+	return Min(workmem, limit);
+}
+
+static Target *
+make_target(int workmem, int *target_limit, int count)
+{
+	Target	   *result = palloc_object(Target);
+
+	result->count = count;
+	result->workmem = workmem;
+	result->limit = *target_limit;
+	result->priority = clamp_priority(result->workmem, result->limit);
+	result->ratio = (double) result->priority / result->limit;
+	result->target_limit = target_limit;
+
+	return result;
+}
+
+static void
+add_target(WorkMemStats * workmem_stats, Target * target)
+{
+	workmem_stats->count += target->count;
+	workmem_stats->workmem += target->count * target->workmem;
+	workmem_stats->limit += target->count * target->limit;
+	workmem_stats->priority += target->count * target->priority;
+	workmem_stats->targets = lappend(workmem_stats->targets, target);
+}
+
+/* This "ascending" comparator sorts least-greedy Targets first. */
+static int
+target_compare_asc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in ascending order: smallest ratio first, then (if ratios equal)
+	 * smallest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) a->ptr_value)->workmem -
+			((Target *) b->ptr_value)->workmem;
+	}
+	else
+		return a_val > b_val ? 1 : -1;
+}
+
+/* This "descending" comparator sorts most-greedy Targets first. */
+static int
+target_compare_desc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in descending order: largest ratio first, then (if ratios equal)
+	 * largest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) b->ptr_value)->workmem -
+			((Target *) a->ptr_value)->workmem;
+	}
+	else
+		return b_val > a_val ? 1 : -1;
+}
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
index c513b90fc77..8a3e52c8968 100644
--- a/src/backend/executor/execWorkmem.c
+++ b/src/backend/executor/execWorkmem.c
@@ -57,6 +57,9 @@
 #include "optimizer/cost.h"
 
 
+/* Hook for plugins to get control in ExecAssignWorkMem */
+ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook = NULL;
+
 /* decls for local routines only used within this module */
 static void assign_workmem_subplan(SubPlan *subplan);
 static void assign_workmem_plan(Plan *plan);
@@ -81,16 +84,32 @@ static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
 void
 ExecAssignWorkMem(PlannedStmt *plannedstmt)
 {
-	/*
-	 * No need to re-assign working memory on parallel workers, since workers
-	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
-	 *
-	 * We already assigned working-memory limits on the leader, and those
-	 * limits were sent to the workers inside the serialized Plan.
-	 */
-	if (IsParallelWorker())
-		return;
+	if (ExecAssignWorkMem_hook)
+		(*ExecAssignWorkMem_hook) (plannedstmt);
+	else
+	{
+		/*
+		 * No need to re-assign working memory on parallel workers, since
+		 * workers have the same work_mem and hash_mem_multiplier GUCs as the
+		 * leader.
+		 *
+		 * We already assigned working-memory limits on the leader, and those
+		 * limits were sent to the workers inside the serialized Plan.
+		 *
+		 * We bail out here, in case the hook wants to re-assign memory on
+		 * parallel workers, and maybe wants to call
+		 * standard_ExecAssignWorkMem() first, as well.
+		 */
+		if (IsParallelWorker())
+			return;
 
+		standard_ExecAssignWorkMem(plannedstmt);
+	}
+}
+
+void
+standard_ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
 	/* Assign working memory to the Plans referred to by SubPlan objects. */
 	foreach_ptr(Plan, plan, plannedstmt->subplans)
 	{
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4147876d55..c12625d2061 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -96,6 +96,9 @@ typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
 											  bool ereport_on_violation);
 extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
+/* Hook for plugins to get control in ExecAssignWorkMem() */
+typedef void (*ExecAssignWorkMem_hook_type) (PlannedStmt *plannedstmt);
+extern PGDLLIMPORT ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook;
 
 /*
  * prototypes from functions in execAmi.c
@@ -730,5 +733,6 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
  * prototypes from functions in execWorkmem.c
  */
 extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+extern void standard_ExecAssignWorkMem(PlannedStmt *plannedstmt);
 
 #endif							/* EXECUTOR_H  */
-- 
2.47.1

#18

Jeff Davis

pgsql@j-davis.com

11 months ago

In reply to: James Hunter (#15)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Mon, 2025-02-24 at 12:46 -0800, James Hunter wrote:

Attached please find the patch set I mentioned, above, in [1]. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

I didn't look at the details yet. But from:

/messages/by-id/CAJVSvF7x_DLj7-JrXvMB4_j+jzuvjG_7iXNjx5KmLBTXHPNdGA@mail.gmail.com

I expected something much smaller in scope, where we just add a
"plan_work_mem" field to the Plan struct, copy the work_mem global GUC
to that field when we construct a Plan node, and then reference the
plan_work_mem instead of the GUC directly.

Can you give a bit more context about why we need so many changes,
including test changes?

Regards,
Jeff Davis

#19

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: Jeff Davis (#18)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Mon, Feb 24, 2025 at 6:54 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-02-24 at 12:46 -0800, James Hunter wrote:

Attached please find the patch set I mentioned, above, in [1]. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

I didn't look at the details yet. But from:

/messages/by-id/CAJVSvF7x_DLj7-JrXvMB4_j+jzuvjG_7iXNjx5KmLBTXHPNdGA@mail.gmail.com

I expected something much smaller in scope, where we just add a
"plan_work_mem" field to the Plan struct, copy the work_mem global GUC
to that field when we construct a Plan node, and then reference the
plan_work_mem instead of the GUC directly.

What you describe is basically Patch 3: it copies the work_mem and/or
work_mem * hash_mem_multiplier global GUCs onto a "workmem_limit"
field on the Plan struct, and then references that field instead of
the GUC.

Patch 3 basically consists of a new file that copies these GUCs to the
new field, along with small changes to all relevant execution nodes to
reach that field instead of the global GUC. Excluding test changes, it
adds 281 lines to new file "execWorkmem.c", and modifies 263 other
lines across 29 other files; most files have < 5 lines modified.

Can you give a bit more context about why we need so many changes,
including test changes?

So Patch 3 is what you describe, above. By itself, this does very
little, so Patch 4 serves as a PoC or demo showing how a cloud service
provider might use Patch 3's framework to provide better memory
management.

I don't think Patch 4 needs to go into core PostgreSQL, but I find it
helpful in demonstrating how the "workmem" framework could be used. It
adds a hook to allow an extension to override the "copy" function
added in Patch 3. The hook stuff itself is pretty small. And then, to
show how a useful extension could be written using that hook, Patch 4
includes a basic extension that implements the hook.

However, the ability to override a field via hook is useful only if
that hook has enough information to make an intelligent decision! So,
we need Patch 1, which just copies the existing "workmem" *estimates*,
from existing planner logic, onto a second Path / Plan field, this one
just called "workmem". It could be renamed "workmem_estimate," or
anything else -- the important thing is that this field is on the
Plan, so the hook can look at it when deciding how much working memory
to assign to that Plan node. The "workmem" field added by Patch 1 is
analogous to the "cost" field the planner already exposes.

Patch 1 adds ~200 lines to explain.c, to display workmem stats; and
modifies around 600 lines in costsize.c and createplan.c, to copy the
existing "workmem" estimate onto the Path and Plan fields. I could
omit populating the Path field, and copy only to the Plan field, but
it seemed like a good time to fill in the Path as well, in case future
logic wants to make use of it. (None of the other 3 patches use the
Path's "workmem" field.) So, Patch 1 is around 900 lines of code,
total, but none of the changes are very serious, since they just copy
existing estimates onto a field.

So that's Patches 1, 3, and 4; and Patch 2 is just some local
infrastructure so that Patches 3 and 4 can find all the query's
SubPlans, so they can assign working memory to them.

In summary:

* Patch 1 copies the planner's "workmem" *estimate* to a new field on the Plan;
* Patch 2 keeps track of SubPlans, so Patches 3 and 4 can assign
workmem to them;
* Patch 3 copies the "work_mem" *limit* GUC to a new field on the Plan; and
* Patch 4 is a demo / PoC / example of how an extension can override
the Plan's "workmem_limit" field, so it doesn't need to go into core
PostgreSQL.

We don't need test changes, but the code is pretty easy to test, so,
for completeness, I added a new unit test, which is basically just a
collection of queries copied from existing unit tests, displayed using
the new "EXPLAIN (work_mem on)" option. And the "test changes" in
Patch 3 just add "work mem limit" to the output -- it's the same test
as Patch 1, just showing more information thanks to Patch 3.

Overall, the changes made to PostgreSQL are minimal, but they are
spread across multiple files... because (a) costing is spread across
multiple files (costsize.c, pathnode.c, and createplan.c -- and also
subselect.c, etc.); and (b) query execution is spread across multiple
files (e.g., nodeSort.c, nodeHash.c, etc.).

Every time we cost a Path or Plan that uses workmem, Patch 1 needs to
copy the Path/Plan's workmem estimate onto the new field. And every
time an exec node uses workmem, Patch 3 needs to read the workmem
limit off the new field. So looking at the number of files touched
overstates the size of the patches.

Thanks,
James

#20

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: James Hunter (#19)

5 attachment(s)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Mon, Feb 24, 2025 at 9:55 PM James Hunter <james.hunter.pg@gmail.com> wrote:

On Mon, Feb 24, 2025 at 6:54 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-02-24 at 12:46 -0800, James Hunter wrote:

Attached please find the patch set I mentioned, above, in [1]. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

I didn't look at the details yet. But from:

/messages/by-id/CAJVSvF7x_DLj7-JrXvMB4_j+jzuvjG_7iXNjx5KmLBTXHPNdGA@mail.gmail.com

I expected something much smaller in scope, where we just add a
"plan_work_mem" field to the Plan struct, copy the work_mem global GUC
to that field when we construct a Plan node, and then reference the
plan_work_mem instead of the GUC directly.

Attaching a new refactoring, which splits the code changes into
patches by functionality. This refactoring yields 5 patches, each of
which is relatively localized. I hope that the result will be more
focused and more feasible to review.

* Patch 1: modifies file setrefs.c, to track all (regular) SubPlan
objects that occur inside of Plan node (qual) expressions, on a new
Plan.subPlan list (parallel to the existing Plan.initPlan list, which
is for SubPlans that have been turned into init plans).

[Patch 1 has no visible side effects, since it just populates a list
on the Plan object.]

* Patch 2: copies the work_mem [* hash_mem_multiplier] GUC(s) to a new
Plan field, Plan.workmem_limit; and modifies existing exec nodes to
read the limit from this field instead of the GUCs. Adds a new file,
"execWorkmem.c", that does the GUC copying, and modifies existing exec
nodes to read the new field(s).

[Patch 2 has no visible side effects, since it just refactors code, to
store the GUCs on a field and then read those fields instead of the
GUCs.]

* Patch 3: stores the optimizer's estimate of how much working memory
a given Path / Plan node will use, on the Path / Plan, in a new field,
"workmem". (I used "workmem" for the estimate, vs. "workmem_limit," in
Patch 2, for the limit. This is to try to be parallel with the
existing "rows" and "cost" estimates.) Involves a significant amount
of code in costsize.c and createplan.c, because sometimes this
estimate is not readily available.

(What I mean is: while Patch 2 just reads the workmem_limit from a
GUC, Patch 3 has to estimate the actual workmem by basically
multiplying (width * rows). But not all Paths / Plans cost the
possibility of spilling, so sometimes I have to copy this formula from
the corresponding exec node, etc. The logical changes in Patch 3 are
simple, but the physical LoC is larger.)

[Patch 3 has no visible side effect, since it just stores an estimate
on the Plan object.]

* Patch 4: modifies file explain.c to implement a "work_mem on" option
to the EXPLAIN command. Also adds a unit test that shows that this
"work_mem on" option works as expected.

[Patch 4 is pure visible side effect -- all it does it add a new
option to display workmem stats to the customer. But it doesn't change
any existing behavior: it just adds a new EXPLAIN option.]

* Patch 5: adds a sample extension / hook that shows how Patches 2 and
3 can be used -- without much effort! -- to implement a per-query
working-memory limit, that gives more working memory to exec nodes
that (are estimated to) need it, while taking working memory away, if
necessary, from exec nodes that (we estimate) don't need it.

The refactored patch set should be more feasible to review, since each
patch is now localized to a single piece of functionality.

Note that Patch 5 isn't essential to merge into core PostgreSQL, since
it's mostly a proof-of-concept for how a "work_mem.query_work_mem" GUC
could be implemented. But Patches 2 and 3 are needed, since they
expose the limit and estimate, on the Plan, on which Patch 5 (or any
similar working-memory extension) relies.

Thanks again,
James Hunter

Attachments:

0001-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchapplication/octet-stream; name=0001-Store-non-init-plan-SubPlan-objects-in-Plan-list.patchDownload

From 0c05e29424247c737379805ffebba99045fe0d3d Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Thu, 20 Feb 2025 17:33:48 +0000
Subject: [PATCH 1/5] Store non-init-plan SubPlan objects in Plan list

We currently track SubPlan objects, on Plans, via either the plan->initPlan
list, for init plans; or via whatever expression contains the SubPlan, for
regular sub plans.

A SubPlan object can itself use working memory, if it uses a hash table.
This hash table is associated with the SubPlan itself, and not with the
Plan to which the SubPlan points.

To allow us to assign working memory to an individual SubPlan, this commit
stores a link to the regular SubPlan, inside a new plan->subPlan list,
when we finalize the (parent) Plan whose expression contains the regular
SubPlan.

Unlike the existing plan->initPlan list, we will not use the new plan->
subPlan list to initialize SubPlan nodes -- that must be done when we
initialize the expression that contains the SubPlan. Instead, we will use
it, during InitPlan() but before ExecInitNode(), to assign a working-
memory limit to the SubPlan.
---
 src/backend/optimizer/plan/setrefs.c | 284 +++++++++++++++++----------
 src/include/nodes/plannodes.h        |   2 +
 2 files changed, 177 insertions(+), 109 deletions(-)

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 999a5a8ab5a..8a4e77baa90 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -58,6 +58,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	int			rtoffset;
 	double		num_exec;
 } fix_scan_expr_context;
@@ -65,6 +66,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *outer_itlist;
 	indexed_tlist *inner_itlist;
 	Index		acceptable_rel;
@@ -76,6 +78,7 @@ typedef struct
 typedef struct
 {
 	PlannerInfo *root;
+	Plan	   *plan;
 	indexed_tlist *subplan_itlist;
 	int			newvarno;
 	int			rtoffset;
@@ -127,8 +130,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, plan, lst, rtoffset, num_exec) \
+	((List *) fix_scan_expr(root, plan, (Node *) (lst), rtoffset, num_exec))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -157,7 +160,7 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 										int rtoffset);
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
-static Node *fix_scan_expr(PlannerInfo *root, Node *node,
+static Node *fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node,
 						   int rtoffset, double num_exec);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
@@ -183,7 +186,7 @@ static Var *search_indexed_tlist_for_sortgroupref(Expr *node,
 												  Index sortgroupref,
 												  indexed_tlist *itlist,
 												  int newvarno);
-static List *fix_join_expr(PlannerInfo *root,
+static List *fix_join_expr(PlannerInfo *root, Plan *plan,
 						   List *clauses,
 						   indexed_tlist *outer_itlist,
 						   indexed_tlist *inner_itlist,
@@ -193,7 +196,7 @@ static List *fix_join_expr(PlannerInfo *root,
 						   double num_exec);
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
-static Node *fix_upper_expr(PlannerInfo *root,
+static Node *fix_upper_expr(PlannerInfo *root, Plan *plan,
 							Node *node,
 							indexed_tlist *subplan_itlist,
 							int newvarno,
@@ -202,7 +205,7 @@ static Node *fix_upper_expr(PlannerInfo *root,
 							double num_exec);
 static Node *fix_upper_expr_mutator(Node *node,
 									fix_upper_expr_context *context);
-static List *set_returning_clause_references(PlannerInfo *root,
+static List *set_returning_clause_references(PlannerInfo *root, Plan *plan,
 											 List *rlist,
 											 Plan *topplan,
 											 Index resultRelation,
@@ -633,10 +636,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -646,13 +649,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablesample = (TableSampleClause *)
-					fix_scan_expr(root, (Node *) splan->tablesample,
+					fix_scan_expr(root, plan, (Node *) splan->tablesample,
 								  rtoffset, 1);
 			}
 			break;
@@ -662,22 +665,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual,
+					fix_scan_list(root, plan, splan->indexqual,
 								  rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->indexorderby =
-					fix_scan_list(root, splan->indexorderby,
+					fix_scan_list(root, plan, splan->indexorderby,
 								  rtoffset, 1);
 				splan->indexorderbyorig =
-					fix_scan_list(root, splan->indexorderbyorig,
+					fix_scan_list(root, plan, splan->indexorderbyorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -697,9 +700,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, plan, splan->indexqual, rtoffset, 1);
 				splan->indexqualorig =
-					fix_scan_list(root, splan->indexqualorig,
+					fix_scan_list(root, plan, splan->indexqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -709,13 +712,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->bitmapqualorig =
-					fix_scan_list(root, splan->bitmapqualorig,
+					fix_scan_list(root, plan, splan->bitmapqualorig,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -725,13 +728,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidquals =
-					fix_scan_list(root, splan->tidquals,
+					fix_scan_list(root, plan, splan->tidquals,
 								  rtoffset, 1);
 			}
 			break;
@@ -741,13 +744,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tidrangequals =
-					fix_scan_list(root, splan->tidrangequals,
+					fix_scan_list(root, plan, splan->tidrangequals,
 								  rtoffset, 1);
 			}
 			break;
@@ -762,13 +765,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, plan, splan->functions, rtoffset, 1);
 			}
 			break;
 		case T_TableFuncScan:
@@ -777,13 +780,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->tablefunc = (TableFunc *)
-					fix_scan_expr(root, (Node *) splan->tablefunc,
+					fix_scan_expr(root, plan, (Node *) splan->tablefunc,
 								  rtoffset, 1);
 			}
 			break;
@@ -793,13 +796,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 				splan->values_lists =
-					fix_scan_list(root, splan->values_lists,
+					fix_scan_list(root, plan, splan->values_lists,
 								  rtoffset, 1);
 			}
 			break;
@@ -809,10 +812,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -822,10 +825,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -835,10 +838,10 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
-					fix_scan_list(root, splan->scan.plan.targetlist,
+					fix_scan_list(root, plan, splan->scan.plan.targetlist,
 								  rtoffset, NUM_EXEC_TLIST(plan));
 				splan->scan.plan.qual =
-					fix_scan_list(root, splan->scan.plan.qual,
+					fix_scan_list(root, plan, splan->scan.plan.qual,
 								  rtoffset, NUM_EXEC_QUAL(plan));
 			}
 			break;
@@ -877,7 +880,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 */
 				set_dummy_tlist_references(plan, rtoffset);
 
-				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
+				mplan->param_exprs = fix_scan_list(root, plan, mplan->param_exprs,
 												   rtoffset,
 												   NUM_EXEC_TLIST(plan));
 				break;
@@ -939,9 +942,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitOffset, rtoffset, 1);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->limitCount, rtoffset, 1);
 			}
 			break;
 		case T_Agg:
@@ -994,14 +997,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, plan, wplan->startOffset, rtoffset, 1);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
-				wplan->runCondition = fix_scan_list(root,
+					fix_scan_expr(root, plan, wplan->endOffset, rtoffset, 1);
+				wplan->runCondition = fix_scan_list(root, plan,
 													wplan->runCondition,
 													rtoffset,
 													NUM_EXEC_TLIST(plan));
-				wplan->runConditionOrig = fix_scan_list(root,
+				wplan->runConditionOrig = fix_scan_list(root, plan,
 														wplan->runConditionOrig,
 														rtoffset,
 														NUM_EXEC_TLIST(plan));
@@ -1043,15 +1046,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					}
 
 					splan->plan.targetlist =
-						fix_scan_list(root, splan->plan.targetlist,
+						fix_scan_list(root, plan, splan->plan.targetlist,
 									  rtoffset, NUM_EXEC_TLIST(plan));
 					splan->plan.qual =
-						fix_scan_list(root, splan->plan.qual,
+						fix_scan_list(root, plan, splan->plan.qual,
 									  rtoffset, NUM_EXEC_QUAL(plan));
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, plan, splan->resconstantqual, rtoffset,
+								  1);
 			}
 			break;
 		case T_ProjectSet:
@@ -1066,7 +1070,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->withCheckOptionLists =
-					fix_scan_list(root, splan->withCheckOptionLists,
+					fix_scan_list(root, plan, splan->withCheckOptionLists,
 								  rtoffset, 1);
 
 				if (splan->returningLists)
@@ -1086,7 +1090,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						List	   *rlist = (List *) lfirst(lcrl);
 						Index		resultrel = lfirst_int(lcrr);
 
-						rlist = set_returning_clause_references(root,
+						rlist = set_returning_clause_references(root, plan,
 																rlist,
 																subplan,
 																resultrel,
@@ -1121,13 +1125,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					itlist = build_tlist_index(splan->exclRelTlist);
 
 					splan->onConflictSet =
-						fix_join_expr(root, splan->onConflictSet,
+						fix_join_expr(root, plan, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
 
 					splan->onConflictWhere = (Node *)
-						fix_join_expr(root, (List *) splan->onConflictWhere,
+						fix_join_expr(root, plan, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
 									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
@@ -1135,7 +1139,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, plan, splan->exclRelTlist, rtoffset, 1);
 				}
 
 				/*
@@ -1186,7 +1190,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 							MergeAction *action = (MergeAction *) lfirst(l);
 
 							/* Fix targetList of each action. */
-							action->targetList = fix_join_expr(root,
+							action->targetList = fix_join_expr(root, plan,
 															   action->targetList,
 															   NULL, itlist,
 															   resultrel,
@@ -1195,7 +1199,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   NUM_EXEC_TLIST(plan));
 
 							/* Fix quals too. */
-							action->qual = (Node *) fix_join_expr(root,
+							action->qual = (Node *) fix_join_expr(root, plan,
 																  (List *) action->qual,
 																  NULL, itlist,
 																  resultrel,
@@ -1206,7 +1210,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 						/* Fix join condition too. */
 						mergeJoinCondition = (Node *)
-							fix_join_expr(root,
+							fix_join_expr(root, plan,
 										  (List *) mergeJoinCondition,
 										  NULL, itlist,
 										  resultrel,
@@ -1353,7 +1357,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 
 	plan->scan.scanrelid += rtoffset;
 	plan->scan.plan.targetlist = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.targetlist,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1361,7 +1365,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_TLIST((Plan *) plan));
 	plan->scan.plan.qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->scan.plan.qual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1369,7 +1373,7 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	plan->recheckqual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, (Plan *) plan,
 					   (Node *) plan->recheckqual,
 					   index_itlist,
 					   INDEX_VAR,
@@ -1377,13 +1381,13 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NRM_EQUAL,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
-	plan->indexqual = fix_scan_list(root, plan->indexqual,
+	plan->indexqual = fix_scan_list(root, (Plan *) plan, plan->indexqual,
 									rtoffset, 1);
 	/* indexorderby is already transformed to reference index columns */
-	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
+	plan->indexorderby = fix_scan_list(root, (Plan *) plan, plan->indexorderby,
 									   rtoffset, 1);
 	/* indextlist must NOT be transformed to reference index columns */
-	plan->indextlist = fix_scan_list(root, plan->indextlist,
+	plan->indextlist = fix_scan_list(root, (Plan *) plan, plan->indextlist,
 									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 
 	pfree(index_itlist);
@@ -1430,10 +1434,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		 */
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
-			fix_scan_list(root, plan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
 		plan->scan.plan.qual =
-			fix_scan_list(root, plan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) plan, plan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
 
 		result = (Plan *) plan;
@@ -1599,7 +1603,7 @@ set_foreignscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(fscan->fdw_scan_tlist);
 
 		fscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1607,7 +1611,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1615,7 +1619,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1623,7 +1627,7 @@ set_foreignscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) fscan,
 						   (Node *) fscan->fdw_recheck_quals,
 						   itlist,
 						   INDEX_VAR,
@@ -1633,7 +1637,7 @@ set_foreignscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
-			fix_scan_list(root, fscan->fdw_scan_tlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 	}
 	else
@@ -1643,16 +1647,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 * way
 		 */
 		fscan->scan.plan.targetlist =
-			fix_scan_list(root, fscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
 		fscan->scan.plan.qual =
-			fix_scan_list(root, fscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) fscan, fscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_exprs =
-			fix_scan_list(root, fscan->fdw_exprs,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 		fscan->fdw_recheck_quals =
-			fix_scan_list(root, fscan->fdw_recheck_quals,
+			fix_scan_list(root, (Plan *) fscan, fscan->fdw_recheck_quals,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
 	}
 
@@ -1685,7 +1689,7 @@ set_customscan_references(PlannerInfo *root,
 		indexed_tlist *itlist = build_tlist_index(cscan->custom_scan_tlist);
 
 		cscan->scan.plan.targetlist = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.targetlist,
 						   itlist,
 						   INDEX_VAR,
@@ -1693,7 +1697,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->scan.plan.qual,
 						   itlist,
 						   INDEX_VAR,
@@ -1701,7 +1705,7 @@ set_customscan_references(PlannerInfo *root,
 						   NRM_EQUAL,
 						   NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs = (List *)
-			fix_upper_expr(root,
+			fix_upper_expr(root, (Plan *) cscan,
 						   (Node *) cscan->custom_exprs,
 						   itlist,
 						   INDEX_VAR,
@@ -1711,20 +1715,20 @@ set_customscan_references(PlannerInfo *root,
 		pfree(itlist);
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
-			fix_scan_list(root, cscan->custom_scan_tlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_scan_tlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
-			fix_scan_list(root, cscan->scan.plan.targetlist,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.targetlist,
 						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
 		cscan->scan.plan.qual =
-			fix_scan_list(root, cscan->scan.plan.qual,
+			fix_scan_list(root, (Plan *) cscan, cscan->scan.plan.qual,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 		cscan->custom_exprs =
-			fix_scan_list(root, cscan->custom_exprs,
+			fix_scan_list(root, (Plan *) cscan, cscan->custom_exprs,
 						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
 	}
 
@@ -1752,7 +1756,8 @@ set_customscan_references(PlannerInfo *root,
  * startup time.
  */
 static int
-register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
+register_partpruneinfo(PlannerInfo *root, Plan *plan, int part_prune_index,
+					   int rtoffset)
 {
 	PlannerGlobal *glob = root->glob;
 	PartitionPruneInfo *pinfo;
@@ -1776,10 +1781,10 @@ register_partpruneinfo(PlannerInfo *root, int part_prune_index, int rtoffset)
 
 			prelinfo->rtindex += rtoffset;
 			prelinfo->initial_pruning_steps =
-				fix_scan_list(root, prelinfo->initial_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->initial_pruning_steps,
 							  rtoffset, 1);
 			prelinfo->exec_pruning_steps =
-				fix_scan_list(root, prelinfo->exec_pruning_steps,
+				fix_scan_list(root, plan, prelinfo->exec_pruning_steps,
 							  rtoffset, 1);
 
 			for (i = 0; i < prelinfo->nparts; i++)
@@ -1863,7 +1868,8 @@ set_append_references(PlannerInfo *root,
 	 */
 	if (aplan->part_prune_index >= 0)
 		aplan->part_prune_index =
-			register_partpruneinfo(root, aplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) aplan,
+								   aplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(aplan->plan.lefttree == NULL);
@@ -1931,7 +1937,8 @@ set_mergeappend_references(PlannerInfo *root,
 	 */
 	if (mplan->part_prune_index >= 0)
 		mplan->part_prune_index =
-			register_partpruneinfo(root, mplan->part_prune_index, rtoffset);
+			register_partpruneinfo(root, (Plan *) mplan,
+								   mplan->part_prune_index, rtoffset);
 
 	/* We don't need to recurse to lefttree or righttree ... */
 	Assert(mplan->plan.lefttree == NULL);
@@ -1958,7 +1965,7 @@ set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	 */
 	outer_itlist = build_tlist_index(outer_plan->targetlist);
 	hplan->hashkeys = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) hplan->hashkeys,
 					   outer_itlist,
 					   OUTER_VAR,
@@ -2194,7 +2201,8 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * replacing Aggref nodes that should be replaced by initplan output Params,
  * choosing the best implementation for AlternativeSubPlans,
  * looking up operator opcode info for OpExpr and related nodes,
- * and adding OIDs from regclass Const nodes into root->glob->relationOids.
+ * adding OIDs from regclass Const nodes into root->glob->relationOids, and
+ * recording Subplans that use hash tables.
  *
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
@@ -2204,11 +2212,13 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Plan *plan, Node *node, int rtoffset,
+			  double num_exec)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
 
@@ -2299,8 +2309,21 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 															 (AlternativeSubPlan *) node,
 															 context->num_exec),
 									 context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_scan_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 static bool
@@ -2312,6 +2335,17 @@ fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this SubPlan so that we can assign working memory to it (if
+		 * needed).
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
 	return expression_tree_walker(node, fix_scan_expr_walker, context);
 }
 
@@ -2341,7 +2375,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * NestLoopParams now, because those couldn't refer to nullable
 	 * subexpressions.
 	 */
-	join->joinqual = fix_join_expr(root,
+	join->joinqual = fix_join_expr(root, (Plan *) join,
 								   join->joinqual,
 								   outer_itlist,
 								   inner_itlist,
@@ -2371,7 +2405,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 			 * make things match up perfectly seems well out of proportion to
 			 * the value.
 			 */
-			nlp->paramval = (Var *) fix_upper_expr(root,
+			nlp->paramval = (Var *) fix_upper_expr(root, (Plan *) join,
 												   (Node *) nlp->paramval,
 												   outer_itlist,
 												   OUTER_VAR,
@@ -2388,7 +2422,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		MergeJoin  *mj = (MergeJoin *) join;
 
-		mj->mergeclauses = fix_join_expr(root,
+		mj->mergeclauses = fix_join_expr(root, (Plan *) join,
 										 mj->mergeclauses,
 										 outer_itlist,
 										 inner_itlist,
@@ -2401,7 +2435,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	{
 		HashJoin   *hj = (HashJoin *) join;
 
-		hj->hashclauses = fix_join_expr(root,
+		hj->hashclauses = fix_join_expr(root, (Plan *) join,
 										hj->hashclauses,
 										outer_itlist,
 										inner_itlist,
@@ -2414,7 +2448,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 		 * HashJoin's hashkeys are used to look for matching tuples from its
 		 * outer plan (not the Hash node!) in the hashtable.
 		 */
-		hj->hashkeys = (List *) fix_upper_expr(root,
+		hj->hashkeys = (List *) fix_upper_expr(root, (Plan *) join,
 											   (Node *) hj->hashkeys,
 											   outer_itlist,
 											   OUTER_VAR,
@@ -2433,7 +2467,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 	 * be, so we just tell fix_join_expr to accept superset nullingrels
 	 * matches instead of exact ones.
 	 */
-	join->plan.targetlist = fix_join_expr(root,
+	join->plan.targetlist = fix_join_expr(root, (Plan *) join,
 										  join->plan.targetlist,
 										  outer_itlist,
 										  inner_itlist,
@@ -2441,7 +2475,7 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
 										  NUM_EXEC_TLIST((Plan *) join));
-	join->plan.qual = fix_join_expr(root,
+	join->plan.qual = fix_join_expr(root, (Plan *) join,
 									join->plan.qual,
 									outer_itlist,
 									inner_itlist,
@@ -2519,7 +2553,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 													  subplan_itlist,
 													  OUTER_VAR);
 			if (!newexpr)
-				newexpr = fix_upper_expr(root,
+				newexpr = fix_upper_expr(root, plan,
 										 (Node *) tle->expr,
 										 subplan_itlist,
 										 OUTER_VAR,
@@ -2528,7 +2562,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 										 NUM_EXEC_TLIST(plan));
 		}
 		else
-			newexpr = fix_upper_expr(root,
+			newexpr = fix_upper_expr(root, plan,
 									 (Node *) tle->expr,
 									 subplan_itlist,
 									 OUTER_VAR,
@@ -2542,7 +2576,7 @@ set_upper_references(PlannerInfo *root, Plan *plan, int rtoffset)
 	plan->targetlist = output_targetlist;
 
 	plan->qual = (List *)
-		fix_upper_expr(root,
+		fix_upper_expr(root, plan,
 					   (Node *) plan->qual,
 					   subplan_itlist,
 					   OUTER_VAR,
@@ -3081,6 +3115,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  *    the source relation elements, outer_itlist = NULL and acceptable_rel
  *    the target relation.
  *
+ * 'plan' is the Plan node to which the clauses belong
  * 'clauses' is the targetlist or list of join clauses
  * 'outer_itlist' is the indexed target list of the outer join relation,
  *		or NULL
@@ -3097,6 +3132,7 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  */
 static List *
 fix_join_expr(PlannerInfo *root,
+			  Plan *plan,
 			  List *clauses,
 			  indexed_tlist *outer_itlist,
 			  indexed_tlist *inner_itlist,
@@ -3108,6 +3144,7 @@ fix_join_expr(PlannerInfo *root,
 	fix_join_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.outer_itlist = outer_itlist;
 	context.inner_itlist = inner_itlist;
 	context.acceptable_rel = acceptable_rel;
@@ -3234,7 +3271,19 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 															 context->num_exec),
 									 context);
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_join_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_join_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3258,6 +3307,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * expensive, so we don't want to try it in the common case where the
  * subplan tlist is just a flattened list of Vars.)
  *
+ * 'plan': the Plan node to which the expression belongs
  * 'node': the tree to be fixed (a target item or qual)
  * 'subplan_itlist': indexed target list for subplan (or index)
  * 'newvarno': varno to use for Vars referencing tlist elements
@@ -3271,6 +3321,7 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  */
 static Node *
 fix_upper_expr(PlannerInfo *root,
+			   Plan *plan,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
 			   int newvarno,
@@ -3281,6 +3332,7 @@ fix_upper_expr(PlannerInfo *root,
 	fix_upper_expr_context context;
 
 	context.root = root;
+	context.plan = plan;
 	context.subplan_itlist = subplan_itlist;
 	context.newvarno = newvarno;
 	context.rtoffset = rtoffset;
@@ -3358,8 +3410,21 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
 															  (AlternativeSubPlan *) node,
 															  context->num_exec),
 									  context);
+
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_upper_expr_mutator, context);
+	node = expression_tree_mutator(node, fix_upper_expr_mutator, context);
+
+	if (IsA(node, SubPlan))
+	{
+		/*
+		 * Track this (mutated) SubPlan so that we can assign working memory
+		 * to it, if needed.
+		 */
+		if (context->plan)
+			context->plan->subPlan = lappend(context->plan->subPlan, node);
+	}
+
+	return node;
 }
 
 /*
@@ -3377,9 +3442,10 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * We also must perform opcode lookup and add regclass OIDs to
  * root->glob->relationOids.
  *
+ * 'plan': the ModifyTable node itself
  * 'rlist': the RETURNING targetlist to be fixed
  * 'topplan': the top subplan node that will be just below the ModifyTable
- *		node (note it's not yet passed through set_plan_refs)
+ *		node
  * 'resultRelation': RT index of the associated result relation
  * 'rtoffset': how much to increment varnos by
  *
@@ -3391,7 +3457,7 @@ fix_upper_expr_mutator(Node *node, fix_upper_expr_context *context)
  * Note: resultRelation is not yet adjusted by rtoffset.
  */
 static List *
-set_returning_clause_references(PlannerInfo *root,
+set_returning_clause_references(PlannerInfo *root, Plan *plan,
 								List *rlist,
 								Plan *topplan,
 								Index resultRelation,
@@ -3415,7 +3481,7 @@ set_returning_clause_references(PlannerInfo *root,
 	 */
 	itlist = build_tlist_index_other_vars(topplan->targetlist, resultRelation);
 
-	rlist = fix_join_expr(root,
+	rlist = fix_join_expr(root, plan,
 						  rlist,
 						  itlist,
 						  NULL,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index bf1f25c0dba..39471466a9a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -204,6 +204,8 @@ typedef struct Plan
 	struct Plan *righttree;
 	/* Init Plan nodes (un-correlated expr subselects) */
 	List	   *initPlan;
+	/* Regular Sub Plan nodes (cf. "initPlan", above) */
+	List	   *subPlan;
 
 	/*
 	 * Information for management of parameter-change-driven rescanning
-- 
2.47.1

0002-Store-working-memory-limit-on-Plan-field-rather-than.patchapplication/octet-stream; name=0002-Store-working-memory-limit-on-Plan-field-rather-than.patchDownload

From c1860e6f46643d8bae1bfeb1a45c9f3311034925 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Tue, 25 Feb 2025 22:44:01 +0000
Subject: [PATCH 2/5] Store working memory limit on Plan field, rather than in
 GUC

This commit moves the working-memory limit that an executor node checks, at
runtime, from the "work_mem" and "hash_mem_multiplier" GUCs, to a new
field, "workmem_limit", added to the Plan node. To preserve backward
compatibility, it also copies the "work_mem", etc., values from these GUCs
to the new field. This means that this commit is just a refactoring, and
doesn't change any behavior.

This field is on the Plan node, instead of the PlanState, because it needs
to be set before we can call ExecInitNode(). Many PlanStates look at their
working-memory limit when creating their data structures, during
initialization. So the field is on the Plan node, but set between planning
and execution phases.
---
 src/backend/executor/Makefile              |   1 +
 src/backend/executor/execGrouping.c        |  10 +-
 src/backend/executor/execMain.c            |   6 +
 src/backend/executor/execSRF.c             |   5 +-
 src/backend/executor/execWorkmem.c         | 277 +++++++++++++++++++++
 src/backend/executor/meson.build           |   1 +
 src/backend/executor/nodeAgg.c             |  69 +++--
 src/backend/executor/nodeBitmapIndexscan.c |   3 +-
 src/backend/executor/nodeBitmapOr.c        |   3 +-
 src/backend/executor/nodeCtescan.c         |   3 +-
 src/backend/executor/nodeFunctionscan.c    |   2 +
 src/backend/executor/nodeHash.c            |  23 +-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeMaterial.c        |   3 +-
 src/backend/executor/nodeMemoize.c         |   2 +-
 src/backend/executor/nodeRecursiveunion.c  |  12 +-
 src/backend/executor/nodeSetOp.c           |   1 +
 src/backend/executor/nodeSort.c            |   4 +-
 src/backend/executor/nodeSubplan.c         |   2 +
 src/backend/executor/nodeTableFuncscan.c   |   3 +-
 src/backend/executor/nodeWindowAgg.c       |   3 +-
 src/backend/optimizer/path/costsize.c      |   4 +-
 src/include/executor/executor.h            |   7 +
 src/include/executor/hashjoin.h            |   3 +-
 src/include/executor/nodeAgg.h             |   5 +-
 src/include/executor/nodeHash.h            |   3 +-
 src/include/nodes/plannodes.h              |  14 +-
 src/include/nodes/primnodes.h              |   3 +
 28 files changed, 423 insertions(+), 53 deletions(-)
 create mode 100644 src/backend/executor/execWorkmem.c

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..8aa9580558f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -30,6 +30,7 @@ OBJS = \
 	execScan.o \
 	execTuples.o \
 	execUtils.o \
+	execWorkmem.o \
 	functions.o \
 	instrument.o \
 	nodeAgg.o \
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 33b124fbb0a..bcd1822da80 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -168,6 +168,7 @@ BuildTupleHashTable(PlanState *parent,
 					Oid *collations,
 					long nbuckets,
 					Size additionalsize,
+					Size hash_mem_limit,
 					MemoryContext metacxt,
 					MemoryContext tablecxt,
 					MemoryContext tempcxt,
@@ -175,15 +176,18 @@ BuildTupleHashTable(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
-	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
 
 	Assert(nbuckets > 0);
 
-	/* Limit initial table size request to not more than hash_mem */
-	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	/*
+	 * Limit initial table size request to not more than hash_mem
+	 *
+	 * XXX - we should also limit the *maximum* table size to hash_mem.
+	 */
+	hash_mem_limit = hash_mem_limit / entrysize;
 	if (nbuckets > hash_mem_limit)
 		nbuckets = hash_mem_limit;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0493b7d5365..78fd887a84d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1050,6 +1050,12 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/* signal that this EState is not used for EPQ */
 	estate->es_epq_active = NULL;
 
+	/*
+	 * Assign working memory to SubPlan and Plan nodes, before initializing
+	 * their states.
+	 */
+	ExecAssignWorkMem(plannedstmt);
+
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
diff --git a/src/backend/executor/execSRF.c b/src/backend/executor/execSRF.c
index a03fe780a02..4b1e7e0ad1e 100644
--- a/src/backend/executor/execSRF.c
+++ b/src/backend/executor/execSRF.c
@@ -102,6 +102,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 							ExprContext *econtext,
 							MemoryContext argContext,
 							TupleDesc expectedDesc,
+							int workMem,
 							bool randomAccess)
 {
 	Tuplestorestate *tupstore = NULL;
@@ -261,7 +262,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 				MemoryContext oldcontext =
 					MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-				tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+				tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 				rsinfo.setResult = tupstore;
 				if (!returnsTuple)
 				{
@@ -396,7 +397,7 @@ no_function_result:
 		MemoryContext oldcontext =
 			MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-		tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+		tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 		rsinfo.setResult = tupstore;
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
new file mode 100644
index 00000000000..5ec176d1355
--- /dev/null
+++ b/src/backend/executor/execWorkmem.c
@@ -0,0 +1,277 @@
+/*-------------------------------------------------------------------------
+ *
+ * execWorkmem.c
+ *	 routine to set the "workmem_limit" field(s) on Plan nodes that need
+ *   workimg memory.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execWorkmem.c
+ *
+ * INTERFACE ROUTINES
+ *		ExecAssignWorkMem	- assign working memory to Plan nodes
+ *
+ *	 NOTES
+ *		Historically, every PlanState node, during initialization, looked at
+ *		the "work_mem" (plus maybe "hash_mem_multiplier") GUC, to determine
+ *		what working-memory limit was imposed on it.
+ *
+ *		Now, to allow different PlanState nodes to be restricted to different
+ *		amounts of memory, each PlanState node reads this limit off its
+ *		corresponding Plan node's "workmem_limit" field. And we populate that
+ *		field by calling ExecAssignWorkMem(), from InitPlan(), before we
+ *		initialize the PlanState nodes.
+ *
+ * 		The "workmem_limit" field is a limit "per data structure," rather than
+ *		"per PlanState". This is needed because some SQL operators (e.g.,
+ *		RecursiveUnion and Agg) require multiple data structures, and sometimes
+ *		the data structures don't all share the same memory requirement. So we
+ *		cannot always just divide a "per PlanState" limit among individual data
+ *		structures. Instead, we maintain the limits on the data structures (and
+ *		EXPLAIN, for example, sums them up into a single, human-readable
+ *		number).
+ *
+ *		Note that the *Path's* "workmem" estimate is per SQL operator, but when
+ *		we convert that Path to a Plan we also break its "workmem" estimate
+ *		down into per-data structure estimates. Some operators therefore
+ *		require additional "limit" fields, which we add to the corresponding
+ *		Plan.
+ *
+ *		We store the "workmem_limit" field(s) on the Plan, instead of the
+ *		PlanState, even though it conceptually belongs to execution rather than
+ *		to planning, because we need it to be set before initializing the
+ *		corresponding PlanState. This is a chicken-and-egg problem. We could,
+ *		of course, make ExecInitNode() a two-phase operation, but that seems
+ *		like overkill. Instead, we store these "limit" fields on the Plan, but
+ *		set them when we start execution, as part of InitPlan().
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+
+
+/* decls for local routines only used within this module */
+static void assign_workmem_subplan(SubPlan *subplan);
+static void assign_workmem_plan(Plan *plan);
+static void assign_workmem_agg(Agg *agg);
+static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+									bool *is_first_sort);
+
+/* end of local decls */
+
+
+/* ------------------------------------------------------------------------
+ *		ExecAssignWorkMem
+ *
+ *		Recursively assigns working memory to any Plans or SubPlans that need
+ *		it.
+ *
+ *		Inputs:
+ *		  'plannedstmt' is the statement to which we assign working memory
+ *
+ * ------------------------------------------------------------------------
+ */
+void
+ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	/*
+	 * No need to re-assign working memory on parallel workers, since workers
+	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
+	 *
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	/* Assign working memory to the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			assign_workmem_plan(plan);
+	}
+
+	/* And assign working memory to the main Plan tree. */
+	assign_workmem_plan(plannedstmt->planTree);
+}
+
+static void
+assign_workmem_subplan(SubPlan *subplan)
+{
+	subplan->hashtab_workmem_limit = subplan->useHashTable ?
+		get_hash_memory_limit() / 1024 : 0;
+
+	subplan->hashnul_workmem_limit =
+		subplan->useHashTable && !subplan->unknownEqFalse ?
+		get_hash_memory_limit() / 1024 : 0;
+}
+
+static void
+assign_workmem_plan(Plan *plan)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Assign working memory to this node's (hashed) SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		assign_workmem_subplan(subplan);
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		assign_workmem_subplan(subplan);
+
+	/* Assign working memory to this node. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_FunctionScan:
+		case T_IncrementalSort:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+			plan->workmem_limit = work_mem;
+			break;
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			plan->workmem_limit = get_hash_memory_limit() / 1024;
+			break;
+		case T_Agg:
+			assign_workmem_agg((Agg *) plan);
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+				plan->workmem_limit = work_mem;
+
+				if (runion->numCols > 0)
+				{
+					/* Also include memory for hash table. */
+					runion->hashWorkMemLimit = get_hash_memory_limit() / 1024;
+				}
+
+				break;
+			}
+		default:
+			Assert(plan->workmem == 0);
+			plan->workmem_limit = 0;
+			break;
+	}
+
+	/*
+	 * Assign working memory to this node's children. (Logic copied from
+	 * ExplainNode().)
+	 */
+	if (outerPlan(plan))
+		assign_workmem_plan(outerPlan(plan));
+
+	if (innerPlan(plan))
+		assign_workmem_plan(innerPlan(plan));
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				assign_workmem_plan(child);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				assign_workmem_plan(child);
+			break;
+		case T_SubqueryScan:
+			assign_workmem_plan(((SubqueryScan *) plan)->subplan);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				assign_workmem_plan(child);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+assign_workmem_agg(Agg *agg)
+{
+	bool		is_first_sort = true;
+
+	/* Assign working memory to the main Agg node. */
+	assign_workmem_agg_node(agg,
+							true /* is_first */ ,
+							agg->chain == NULL /* is_last */ ,
+							&is_first_sort);
+
+	/* Assign working memory to any other grouping sets. */
+	foreach_node(Agg, aggnode, agg->chain)
+	{
+		assign_workmem_agg_node(aggnode,
+								false /* is_first */ ,
+								foreach_current_index(aggnode) ==
+								list_length(agg->chain) - 1 /* is_last */ ,
+								&is_first_sort);
+	}
+}
+
+static void
+assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
+						bool *is_first_sort)
+{
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * Because nodeAgg.c will combine all AGG_HASHED nodes into a
+			 * single phase, it's easier to store the hash working-memory
+			 * limit on the first AGG_{HASHED,MIXED} node, and set it to zero
+			 * for all subsequent AGG_HASHED nodes.
+			 */
+			agg->plan.workmem_limit = is_first ?
+				get_hash_memory_limit() / 1024 : 0;
+			break;
+		case AGG_SORTED:
+
+			/*
+			 * Also store the sort-output working-memory limit on the first
+			 * AGG_SORTED node, and set it to zero for all subsequent
+			 * AGG_SORTED nodes.
+			 *
+			 * We'll need working-memory to hold the "sort_out" only if this
+			 * isn't the last Agg node (in which case there's no one to sort
+			 * our output).
+			 */
+			agg->plan.workmem_limit = *is_first_sort && !is_last ?
+				work_mem : 0;
+
+			*is_first_sort = false;
+			break;
+		default:
+			break;
+	}
+
+	/* Also include memory needed to sort the input: */
+	if (agg->numSorts > 0)
+	{
+		Assert(agg->sortWorkMem > 0);
+
+		agg->sortWorkMemLimit = work_mem;
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..4e65974f5f3 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -18,6 +18,7 @@ backend_sources += files(
   'execScan.c',
   'execTuples.c',
   'execUtils.c',
+  'execWorkmem.c',
   'functions.c',
   'instrument.c',
   'nodeAgg.c',
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ceb8c8a8039..9e5bcf7ada4 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -258,6 +258,7 @@
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "executor/nodeHash.h"
 #include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -403,7 +404,8 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
 					  Bitmapset **unaggregated);
 static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets,
+							 Size hash_mem_limit);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
 static long hash_choose_num_buckets(double hashentrysize,
@@ -411,6 +413,7 @@ static long hash_choose_num_buckets(double hashentrysize,
 static int	hash_choose_num_partitions(double input_groups,
 									   double hashentrysize,
 									   int used_bits,
+									   Size hash_mem_limit,
 									   int *log2_npartitions);
 static void initialize_hash_entry(AggState *aggstate,
 								  TupleHashTable hashtable,
@@ -431,9 +434,10 @@ static HashAggBatch *hashagg_batch_new(LogicalTape *input_tape, int setno,
 									   int64 input_tuples, double input_card,
 									   int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
-static void hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset,
-							   int used_bits, double input_groups,
-							   double hashentrysize);
+static void hashagg_spill_init(HashAggSpill *spill,
+							   LogicalTapeSet *tapeset, int used_bits,
+							   double input_groups, double hashentrysize,
+							   Size hash_mem_limit);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 								TupleTableSlot *inputslot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
@@ -521,6 +525,14 @@ initialize_phase(AggState *aggstate, int newphase)
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
+		int			workmem_limit;
+
+		/*
+		 * Read the sort-output workmem_limit off the first AGG_SORTED node.
+		 * Since phase 0 is always AGG_HASHED, this will always be phase 1.
+		 */
+		workmem_limit = aggstate->phases[1].aggnode->plan.workmem_limit;
+		Assert(workmem_limit > 0);
 
 		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
 												  sortnode->numCols,
@@ -528,7 +540,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->sortOperators,
 												  sortnode->collations,
 												  sortnode->nullsFirst,
-												  work_mem,
+												  workmem_limit,
 												  NULL, TUPLESORT_NONE);
 	}
 
@@ -577,7 +589,7 @@ fetch_input_tuple(AggState *aggstate)
  */
 static void
 initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
-					 AggStatePerGroup pergroupstate)
+					 AggStatePerGroup pergroupstate, size_t workMem)
 {
 	/*
 	 * Start a fresh sort operation for each DISTINCT/ORDER BY aggregate.
@@ -591,6 +603,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 		if (pertrans->sortstates[aggstate->current_set])
 			tuplesort_end(pertrans->sortstates[aggstate->current_set]);
 
+		Assert(workMem > 0);
 
 		/*
 		 * We use a plain Datum sorter when there's a single input column;
@@ -606,7 +619,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, TUPLESORT_NONE);
+									  workMem, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -616,7 +629,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, TUPLESORT_NONE);
+									 workMem, NULL, TUPLESORT_NONE);
 	}
 
 	/*
@@ -687,7 +700,8 @@ initialize_aggregates(AggState *aggstate,
 			AggStatePerTrans pertrans = &transstates[transno];
 			AggStatePerGroup pergroupstate = &pergroup[transno];
 
-			initialize_aggregate(aggstate, pertrans, pergroupstate);
+			initialize_aggregate(aggstate, pertrans, pergroupstate,
+								 aggstate->phase->aggnode->sortWorkMemLimit);
 		}
 	}
 }
@@ -1498,7 +1512,7 @@ build_hash_tables(AggState *aggstate)
 		}
 #endif
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, setno, nbuckets, memory);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1508,7 +1522,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, long nbuckets,
+				 Size hash_mem_limit)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
@@ -1537,6 +1552,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 perhash->aggnode->grpCollations,
 											 nbuckets,
 											 additionalsize,
+											 hash_mem_limit,
 											 metacxt,
 											 hashcxt,
 											 tmpcxt,
@@ -1805,12 +1821,11 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  */
 void
 hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
-					Size *mem_limit, uint64 *ngroups_limit,
+					Size hash_mem_limit, Size *mem_limit, uint64 *ngroups_limit,
 					int *num_partitions)
 {
 	int			npartitions;
 	Size		partition_mem;
-	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
 	if (input_groups * hashentrysize <= hash_mem_limit)
@@ -1830,6 +1845,7 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups,
 											 hashentrysize,
 											 used_bits,
+											 hash_mem_limit,
 											 NULL);
 	if (num_partitions != NULL)
 		*num_partitions = npartitions;
@@ -1927,7 +1943,8 @@ hash_agg_enter_spill_mode(AggState *aggstate)
 
 			hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
+							   aggstate->hashentrysize,
+							   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 		}
 	}
 }
@@ -2014,9 +2031,9 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  */
 static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
-						   int used_bits, int *log2_npartitions)
+						   int used_bits, Size hash_mem_limit,
+						   int *log2_npartitions)
 {
-	Size		hash_mem_limit = get_hash_memory_limit();
 	double		partition_limit;
 	double		mem_wanted;
 	double		dpartitions;
@@ -2095,7 +2112,8 @@ initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
 		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
 		AggStatePerGroup pergroupstate = &pergroup[transno];
 
-		initialize_aggregate(aggstate, pertrans, pergroupstate);
+		initialize_aggregate(aggstate, pertrans, pergroupstate,
+							 aggstate->phase->aggnode->sortWorkMemLimit);
 	}
 }
 
@@ -2156,7 +2174,8 @@ lookup_hash_entries(AggState *aggstate)
 			if (spill->partitions == NULL)
 				hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 								   perhash->aggnode->numGroups,
-								   aggstate->hashentrysize);
+								   aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 
 			hashagg_spill_tuple(aggstate, spill, slot, hash);
 			pergroup[setno] = NULL;
@@ -2630,7 +2649,9 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
-						batch->used_bits, &aggstate->hash_mem_limit,
+						batch->used_bits,
+						(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
+						&aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
 	/*
@@ -2718,7 +2739,8 @@ agg_refill_hash_table(AggState *aggstate)
 				 */
 				spill_initialized = true;
 				hashagg_spill_init(&spill, tapeset, batch->used_bits,
-								   batch->input_card, aggstate->hashentrysize);
+								   batch->input_card, aggstate->hashentrysize,
+								   (Size) aggstate->ss.ps.plan->workmem_limit * 1024);
 			}
 			/* no memory for a new group, spill */
 			hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
@@ -2916,13 +2938,15 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
  */
 static void
 hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
-				   double input_groups, double hashentrysize)
+				   double input_groups, double hashentrysize,
+				   Size hash_mem_limit)
 {
 	int			npartitions;
 	int			partition_bits;
 
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
-											 used_bits, &partition_bits);
+											 used_bits, hash_mem_limit,
+											 &partition_bits);
 
 #ifdef USE_INJECTION_POINTS
 	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
@@ -3649,6 +3673,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			totalGroups += aggstate->perhash[k].aggnode->numGroups;
 
 		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							(Size) aggstate->ss.ps.plan->workmem_limit * 1024,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 0b32c3a022f..5e006baa88d 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -91,7 +91,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	else
 	{
 		/* XXX should we use less than work_mem for this? */
-		tbm = tbm_create(work_mem * (Size) 1024,
+		Assert(node->ss.ps.plan->workmem_limit > 0);
+		tbm = tbm_create((Size) node->ss.ps.plan->workmem_limit * 1024,
 						 ((BitmapIndexScan *) node->ss.ps.plan)->isshared ?
 						 node->ss.ps.state->es_query_dsa : NULL);
 	}
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 231760ec93d..4ba32639f7d 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -143,7 +143,8 @@ MultiExecBitmapOr(BitmapOrState *node)
 			if (result == NULL) /* first subplan */
 			{
 				/* XXX should we use less than work_mem for this? */
-				result = tbm_create(work_mem * (Size) 1024,
+				Assert(subnode->plan->workmem_limit > 0);
+				result = tbm_create((Size) subnode->plan->workmem_limit * 1024,
 									((BitmapOr *) node->ps.plan)->isshared ?
 									node->ps.state->es_query_dsa : NULL);
 			}
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index e1675f66b43..2272185dce7 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -232,7 +232,8 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 		/* I am the leader */
 		prmdata->value = PointerGetDatum(scanstate);
 		scanstate->leader = scanstate;
-		scanstate->cte_table = tuplestore_begin_heap(true, false, work_mem);
+		scanstate->cte_table =
+			tuplestore_begin_heap(true, false, node->scan.plan.workmem_limit);
 		tuplestore_set_eflags(scanstate->cte_table, scanstate->eflags);
 		scanstate->readptr = 0;
 	}
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 644363582d9..bbb93a8dd58 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -95,6 +95,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											node->funcstates[0].tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
@@ -154,6 +155,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											fs->tupdesc,
+											node->ss.ps.plan->workmem_limit,
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..aee3c9ea67c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,6 +37,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/syscache.h"
@@ -448,6 +449,7 @@ ExecHashTableCreate(HashState *state)
 	Hash	   *node;
 	HashJoinTable hashtable;
 	Plan	   *outerNode;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;
 	int			nbuckets;
 	int			nbatch;
@@ -471,11 +473,15 @@ ExecHashTableCreate(HashState *state)
 	 */
 	rows = node->plan.parallel_aware ? node->rows_total : outerNode->plan_rows;
 
+	worker_space_allowed = (size_t) node->plan.workmem_limit * 1024;
+	Assert(worker_space_allowed > 0);
+
 	ExecChooseHashTableSize(rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
+							worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs);
 
@@ -599,6 +605,7 @@ ExecHashTableCreate(HashState *state)
 		{
 			pstate->nbatch = nbatch;
 			pstate->space_allowed = space_allowed;
+			pstate->worker_space_allowed = worker_space_allowed;
 			pstate->growth = PHJ_GROWTH_OK;
 
 			/* Set up the shared state for coordinating batches. */
@@ -658,7 +665,8 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t *space_allowed,
+						size_t worker_space_allowed,
+						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
 						int *num_skew_mcvs)
@@ -687,9 +695,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
-	 * Compute in-memory hashtable size limit from GUCs.
+	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = get_hash_memory_limit();
+	hash_table_bytes = worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -706,7 +714,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		hash_table_bytes = (size_t) newlimit;
 	}
 
-	*space_allowed = hash_table_bytes;
+	*total_space_allowed = hash_table_bytes;
 
 	/*
 	 * If skew optimization is possible, estimate the number of skew buckets
@@ -808,7 +816,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		{
 			ExecChooseHashTableSize(ntuples, tupwidth, useskew,
 									false, parallel_workers,
-									space_allowed,
+									worker_space_allowed,
+									total_space_allowed,
 									numbuckets,
 									numbatches,
 									num_skew_mcvs);
@@ -929,7 +938,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
 	Assert(nbuckets > 0);
@@ -1235,7 +1244,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					 * to switch from one large combined memory budget to the
 					 * regular hash_mem budget.
 					 */
-					pstate->space_allowed = get_hash_memory_limit();
+					pstate->space_allowed = pstate->worker_space_allowed;
 
 					/*
 					 * The combined hash_mem of all participants wasn't
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 975b0397e7a..503d75e364b 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -312,7 +312,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.sortOperators[nPresortedCols]),
 												&(plannode->sort.collations[nPresortedCols]),
 												&(plannode->sort.nullsFirst[nPresortedCols]),
-												work_mem,
+												plannode->sort.plan.workmem_limit,
 												NULL,
 												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
@@ -613,7 +613,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.sortOperators,
 												  plannode->sort.collations,
 												  plannode->sort.nullsFirst,
-												  work_mem,
+												  plannode->sort.plan.workmem_limit,
 												  NULL,
 												  node->bounded ?
 												  TUPLESORT_ALLOWBOUNDED :
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9798bb75365..10f764c1bd5 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -61,7 +61,8 @@ ExecMaterial(PlanState *pstate)
 	 */
 	if (tuplestorestate == NULL && node->eflags != 0)
 	{
-		tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
+		tuplestorestate =
+			tuplestore_begin_heap(true, false, node->ss.ps.plan->workmem_limit);
 		tuplestore_set_eflags(tuplestorestate, node->eflags);
 		if (node->eflags & EXEC_FLAG_MARK)
 		{
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 609deb12afb..a3fc37745ca 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -1036,7 +1036,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
 	mstate->mem_used = 0;
 
 	/* Limit the total memory consumed by the cache to this */
-	mstate->mem_limit = get_hash_memory_limit();
+	mstate->mem_limit = (Size) node->plan.workmem_limit * 1024;
 
 	/* A memory context dedicated for the cache */
 	mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..96dc8d53db3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -52,6 +52,7 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->dupCollations,
 											 node->numGroups,
 											 0,
+											 (Size) node->hashWorkMemLimit * 1024,
 											 rustate->ps.state->es_query_cxt,
 											 rustate->tableContext,
 											 rustate->tempContext,
@@ -202,8 +203,15 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/* initialize processing state */
 	rustate->recursing = false;
 	rustate->intermediate_empty = true;
-	rustate->working_table = tuplestore_begin_heap(false, false, work_mem);
-	rustate->intermediate_table = tuplestore_begin_heap(false, false, work_mem);
+
+	/*
+	 * NOTE: each of our working tables gets the same workmem_limit, since
+	 * we're going to swap them repeatedly.
+	 */
+	rustate->working_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
+	rustate->intermediate_table =
+		tuplestore_begin_heap(false, false, node->plan.workmem_limit);
 
 	/*
 	 * If hashing, we need a per-tuple memory context for comparisons, and a
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5b7ff9c3748..7b71adf05dc 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -105,6 +105,7 @@ build_hash_table(SetOpState *setopstate)
 												node->cmpCollations,
 												node->numGroups,
 												sizeof(SetOpStatePerGroupData),
+												(Size) node->plan.workmem_limit * 1024,
 												setopstate->ps.state->es_query_cxt,
 												setopstate->tableContext,
 												econtext->ecxt_per_tuple_memory,
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f603337ecd3..1da77ab1d6a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -107,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->sortOperators[0],
 												   plannode->collations[0],
 												   plannode->nullsFirst[0],
-												   work_mem,
+												   plannode->plan.workmem_limit,
 												   NULL,
 												   tuplesortopts);
 		else
@@ -117,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->sortOperators,
 												  plannode->collations,
 												  plannode->nullsFirst,
-												  work_mem,
+												  plannode->plan.workmem_limit,
 												  NULL,
 												  tuplesortopts);
 		if (node->bounded)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 49767ed6a52..73214501238 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -546,6 +546,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  node->tab_collations,
 											  nbuckets,
 											  0,
+											  (Size) subplan->hashtab_workmem_limit * 1024,
 											  node->planstate->state->es_query_cxt,
 											  node->hashtablecxt,
 											  node->hashtempcxt,
@@ -575,6 +576,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 												  node->tab_collations,
 												  nbuckets,
 												  0,
+												  (Size) subplan->hashnul_workmem_limit * 1024,
 												  node->planstate->state->es_query_cxt,
 												  node->hashtablecxt,
 												  node->hashtempcxt,
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 83ade3f9437..8a9e534a743 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -276,7 +276,8 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	/* build tuplestore for the result */
 	oldcxt = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
-	tstate->tupstore = tuplestore_begin_heap(false, false, work_mem);
+	tstate->tupstore = tuplestore_begin_heap(false, false,
+											 tstate->ss.ps.plan->workmem_limit);
 
 	/*
 	 * Each call to fetch a new set of rows - of which there may be very many
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 9a1acce2b5d..76819d140ba 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1092,7 +1092,8 @@ prepare_tuplestore(WindowAggState *winstate)
 	Assert(winstate->buffer == NULL);
 
 	/* Create new tuplestore */
-	winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
+	winstate->buffer = tuplestore_begin_heap(false, false,
+											 node->plan.workmem_limit);
 
 	/*
 	 * Set up read pointers for the tuplestore.  The current pointer doesn't
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 73d78617009..04360f45760 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2802,7 +2802,8 @@ cost_agg(Path *path, PlannerInfo *root,
 		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
 											input_width,
 											aggcosts->transitionSpace);
-		hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+		hash_agg_set_limits(hashentrysize, numGroups, 0,
+							get_hash_memory_limit(), &mem_limit,
 							&ngroups_limit, &num_partitions);
 
 		nbatches = Max((numGroups * hashentrysize) / mem_limit,
@@ -4224,6 +4225,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
+							get_hash_memory_limit(),
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d12e3f451d2..c4147876d55 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  Oid *collations,
 										  long nbuckets,
 										  Size additionalsize,
+										  Size hash_mem_limit,
 										  MemoryContext metacxt,
 										  MemoryContext tablecxt,
 										  MemoryContext tempcxt,
@@ -499,6 +500,7 @@ extern Tuplestorestate *ExecMakeTableFunctionResult(SetExprState *setexpr,
 													ExprContext *econtext,
 													MemoryContext argContext,
 													TupleDesc expectedDesc,
+													int workMem,
 													bool randomAccess);
 extern SetExprState *ExecInitFunctionResultSet(Expr *expr,
 											   ExprContext *econtext, PlanState *parent);
@@ -724,4 +726,9 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execWorkmem.c
+ */
+extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..9b184c47322 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -253,7 +253,8 @@ typedef struct ParallelHashJoinState
 	ParallelHashGrowth growth;	/* control batch/bucket growth */
 	dsa_pointer chunk_work_queue;	/* chunk work queue */
 	int			nparticipants;
-	size_t		space_allowed;
+	size_t		space_allowed;	/* -- might be shared with other workers */
+	size_t		worker_space_allowed;	/* -- exclusive to this worker */
 	size_t		total_tuples;	/* total number of inner tuples */
 	LWLock		lock;			/* lock protecting the above */
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 34b82d0f5d1..728006b3ff5 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -329,8 +329,9 @@ extern void ExecReScanAgg(AggState *node);
 extern Size hash_agg_entry_size(int numTrans, Size tupleWidth,
 								Size transitionSpace);
 extern void hash_agg_set_limits(double hashentrysize, double input_groups,
-								int used_bits, Size *mem_limit,
-								uint64 *ngroups_limit, int *num_partitions);
+								int used_bits, Size hash_mem_limit,
+								Size *mem_limit, uint64 *ngroups_limit,
+								int *num_partitions);
 
 /* parallel instrumentation support */
 extern void ExecAggEstimate(AggState *node, ParallelContext *pcxt);
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..e4e9e0d1de1 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,8 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t *space_allowed,
+									size_t worker_space_allowed,
+									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
 									int *num_skew_mcvs);
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 39471466a9a..396f7881420 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -168,6 +168,9 @@ typedef struct Plan
 	/* total cost (assuming all tuples fetched) */
 	Cost		total_cost;
 
+	/* (runtime) working memory limit (in KB) */
+	int			workmem_limit;
+
 	/*
 	 * planner's estimate of result size of this plan step
 	 */
@@ -235,7 +238,7 @@ typedef struct Plan
 
 /* ----------------
  *	 Result node -
- *		If no outer plan, evaluate a variable-free targetlist.
+z *		If no outer plan, evaluate a variable-free targetlist.
  *		If outer plan, return tuples from outer plan (after a level of
  *		projection as shown by targetlist).
  *
@@ -428,6 +431,9 @@ typedef struct RecursiveUnion
 
 	/* estimated number of groups in input */
 	long		numGroups;
+
+	/* work_mem reserved for hash table */
+	int			hashWorkMemLimit;
 } RecursiveUnion;
 
 /* ----------------
@@ -1147,6 +1153,12 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* number of inputs that require sorting */
+	int			numSorts;
+
+	/* work_mem limit to sort one input (in KB) */
+	int			sortWorkMemLimit;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index d0576da3e25..b932168237c 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1109,6 +1109,9 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	/* (Runtime) working-memory limits (in KB): */
+	int			hashtab_workmem_limit;	/* limit for hashtable */
+	int			hashnul_workmem_limit;	/* limit for hashnulls */
 } SubPlan;
 
 /*
-- 
2.47.1

0003-Add-workmem-estimate-to-Path-and-Plan-nodes.patchapplication/octet-stream; name=0003-Add-workmem-estimate-to-Path-and-Plan-nodes.patchDownload

From b62d0d39ceda5b8ea60da800318e81ff62611071 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 26 Feb 2025 00:58:28 +0000
Subject: [PATCH 3/5] Add "workmem" estimate to Path and Plan nodes

To allow for future optimizers to make decisions at Path time, this commit
aggregates the Path's total working memory onto the Path's "workmem" field,
normalized to a minimum of 64 KB and rounded up to the next whole KB.

To allow future hooks to override ExecAssignWorkMem(), this commit then
breaks that total working memory into per-data structure working memory,
on the Plan.
---
 src/backend/executor/nodeHash.c         |  13 +-
 src/backend/nodes/tidbitmap.c           |  18 ++
 src/backend/optimizer/path/costsize.c   | 387 ++++++++++++++++++++++--
 src/backend/optimizer/plan/createplan.c | 215 +++++++++++--
 src/backend/optimizer/prep/prepagg.c    |  12 +
 src/backend/optimizer/util/pathnode.c   |  53 +++-
 src/include/executor/nodeHash.h         |   3 +-
 src/include/nodes/pathnodes.h           |   5 +
 src/include/nodes/plannodes.h           |   7 +-
 src/include/nodes/primnodes.h           |   3 +
 src/include/nodes/tidbitmap.h           |   1 +
 src/include/optimizer/cost.h            |  12 +-
 src/include/optimizer/planmain.h        |   2 +-
 13 files changed, 672 insertions(+), 59 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index aee3c9ea67c..3f60f6305bd 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/guc.h"
@@ -454,6 +455,7 @@ ExecHashTableCreate(HashState *state)
 	int			nbuckets;
 	int			nbatch;
 	double		rows;
+	int			workmem;		/* ignored */
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	MemoryContext oldcxt;
@@ -483,7 +485,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state->nparticipants - 1 : 0,
 							worker_space_allowed,
 							&space_allowed,
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
 	/* nbuckets must be a power of 2 */
 	log2_nbuckets = my_log2(nbuckets);
@@ -669,7 +671,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
-						int *num_skew_mcvs)
+						int *num_skew_mcvs,
+						int *workmem)
 {
 	int			tupsize;
 	double		inner_rel_bytes;
@@ -800,6 +803,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * the required bucket headers, we will need multiple batches.
 	 */
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	*workmem = normalize_workmem(inner_rel_bytes + bucket_bytes);
+
 	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
@@ -820,7 +826,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									total_space_allowed,
 									numbuckets,
 									numbatches,
-									num_skew_mcvs);
+									num_skew_mcvs,
+									workmem);
 			return;
 		}
 
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 3d835024caa..ac4c6b67350 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -1554,6 +1554,24 @@ tbm_calculate_entries(Size maxbytes)
 	return (int) nbuckets;
 }
 
+/*
+ * tbm_calculate_bytes
+ *
+ * Estimate number of bytes needed to store maxentries hashtable entries.
+ *
+ * This function is the inverse of tbm_calculate_entries(), and is used to
+ * estimate a work_mem limit, based on cardinality.
+ */
+double
+tbm_calculate_bytes(double maxentries)
+{
+	maxentries = Min(maxentries, INT_MAX - 1);	/* safety limit */
+	maxentries = Max(maxentries, 16);	/* sanity limit */
+
+	return maxentries * (sizeof(PagetableEntry) + sizeof(Pointer) +
+						 sizeof(Pointer));
+}
+
 /*
  * Create a shared or private bitmap iterator and start iteration.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 04360f45760..b455721fcb7 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -104,6 +104,7 @@
 #include "optimizer/plancat.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -200,9 +201,14 @@ static Cost append_nonpartial_cost(List *subpaths, int numpaths,
 								   int parallel_workers);
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static int32 get_expr_width(PlannerInfo *root, const Node *expr);
-static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static void compute_sort_output_sizes(double input_tuples, int input_width,
+									  double limit_tuples,
+									  double *output_tuples,
+									  double *output_bytes);
+static double compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+									 Cardinality max_ancestor_rows);
 
 
 /*
@@ -1112,6 +1118,18 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->disabled_nodes = enable_bitmapscan ? 0 : 1;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+
+	/*
+	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
+	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
+	 *
+	 * (When we convert this path into a BitmapHeapScan plan, we'll break this
+	 * overall estimate down into per-node estimates, just as we do for
+	 * AggPaths.)
+	 */
+	path->workmem = compute_bitmap_workmem(baserel, bitmapqual,
+										   0.0 /* max_ancestor_rows */ );
 }
 
 /*
@@ -1587,6 +1605,16 @@ cost_functionscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem = list_length(rte->functions) *
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1644,6 +1672,16 @@ cost_tablefuncscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1740,6 +1778,9 @@ cost_ctescan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem =
+		normalize_workmem(relation_byte_size(path->rows,
+											 path->pathtarget->width));
 }
 
 /*
@@ -1823,7 +1864,7 @@ cost_resultscan(Path *path, PlannerInfo *root,
  * We are given Paths for the nonrecursive and recursive terms.
  */
 void
-cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
+cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -1850,12 +1891,37 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	 */
 	total_cost += cpu_tuple_cost * total_rows;
 
-	runion->disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
-	runion->startup_cost = startup_cost;
-	runion->total_cost = total_cost;
-	runion->rows = total_rows;
-	runion->pathtarget->width = Max(nrterm->pathtarget->width,
-									rterm->pathtarget->width);
+	runion->path.disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
+	runion->path.startup_cost = startup_cost;
+	runion->path.total_cost = total_cost;
+	runion->path.rows = total_rows;
+	runion->path.pathtarget->width = Max(nrterm->pathtarget->width,
+										 rterm->pathtarget->width);
+
+	/*
+	 * Include memory for working and intermediate tables. Since we'll
+	 * repeatedly swap the two tables, use 2x whichever is larger as our
+	 * estimate.
+	 */
+	runion->path.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(nrterm->rows,
+												 nrterm->pathtarget->width),
+							  relation_byte_size(rterm->rows,
+												 rterm->pathtarget->width))
+						  * 2);
+
+	if (list_length(runion->distinctList) > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		hashentrysize;
+
+		hashentrysize = MAXALIGN(runion->path.pathtarget->width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		runion->path.workmem +=
+			normalize_workmem(runion->numGroups * hashentrysize);
+	}
 }
 
 /*
@@ -1895,7 +1961,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
  */
 static void
-cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+cost_tuplesort(Cost *startup_cost, Cost *run_cost, Cost *nbytes,
 			   double tuples, int width,
 			   Cost comparison_cost, int sort_mem,
 			   double limit_tuples)
@@ -1915,17 +1981,8 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	/* Include the default cost-per-comparison */
 	comparison_cost += 2.0 * cpu_operator_cost;
 
-	/* Do we have a useful LIMIT? */
-	if (limit_tuples > 0 && limit_tuples < tuples)
-	{
-		output_tuples = limit_tuples;
-		output_bytes = relation_byte_size(output_tuples, width);
-	}
-	else
-	{
-		output_tuples = tuples;
-		output_bytes = input_bytes;
-	}
+	compute_sort_output_sizes(tuples, width, limit_tuples,
+							  &output_tuples, &output_bytes);
 
 	if (output_bytes > sort_mem_bytes)
 	{
@@ -1982,6 +2039,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	 * counting the LIMIT otherwise.
 	 */
 	*run_cost = cpu_operator_cost * tuples;
+	*nbytes = output_bytes;
 }
 
 /*
@@ -2011,6 +2069,7 @@ cost_incremental_sort(Path *path,
 				input_groups;
 	Cost		group_startup_cost,
 				group_run_cost,
+				group_nbytes,
 				group_input_run_cost;
 	List	   *presortedExprs = NIL;
 	ListCell   *l;
@@ -2085,7 +2144,7 @@ cost_incremental_sort(Path *path,
 	 * Estimate the average cost of sorting of one group where presorted keys
 	 * are equal.
 	 */
-	cost_tuplesort(&group_startup_cost, &group_run_cost,
+	cost_tuplesort(&group_startup_cost, &group_run_cost, &group_nbytes,
 				   group_tuples, width, comparison_cost, sort_mem,
 				   limit_tuples);
 
@@ -2126,6 +2185,14 @@ cost_incremental_sort(Path *path,
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Incremental sort switches between two Tuplesortstates: one that sorts
+	 * all columns ("full"), and that sorts only suffix columns ("prefix").
+	 * We'll assume they're both around the same size: large enough to hold
+	 * one sort group.
+	 */
+	path->workmem = normalize_workmem(group_nbytes * 2.0);
 }
 
 /*
@@ -2150,8 +2217,9 @@ cost_sort(Path *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	Cost		nbytes;
 
-	cost_tuplesort(&startup_cost, &run_cost,
+	cost_tuplesort(&startup_cost, &run_cost, &nbytes,
 				   tuples, width,
 				   comparison_cost, sort_mem,
 				   limit_tuples);
@@ -2162,6 +2230,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2522,6 +2591,7 @@ cost_material(Path *path,
 	path->disabled_nodes = input_disabled_nodes + (enable_material ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_workmem(nbytes);
 }
 
 /*
@@ -2592,6 +2662,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	if ((estinfo.flags & SELFLAG_USED_DEFAULT) != 0)
 		ndistinct = calls;
 
+	/* How much working memory would we need, to store every distinct tuple? */
+	mpath->path.workmem = normalize_workmem(ndistinct * est_entry_bytes);
+
 	/*
 	 * Since we've already estimated the maximum number of entries we can
 	 * store at once and know the estimated number of distinct values we'll be
@@ -2867,6 +2940,19 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->disabled_nodes = disabled_nodes;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/* Include memory needed to produce output. */
+	path->workmem =
+		compute_agg_output_workmem(root, aggstrategy, numGroups,
+								   aggcosts->transitionSpace, input_tuples,
+								   input_width, false /* cost_sort */ );
+
+	/* Also include memory needed to sort inputs (if needed): */
+	if (aggcosts->numSorts > 0)
+	{
+		path->workmem += (double) aggcosts->numSorts *
+			compute_agg_input_workmem(input_tuples, input_width);
+	}
 }
 
 /*
@@ -3101,7 +3187,7 @@ cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, WindowClause *winclause,
 			   int input_disabled_nodes,
 			   Cost input_startup_cost, Cost input_total_cost,
-			   double input_tuples)
+			   double input_tuples, int width)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -3183,6 +3269,11 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	if (startup_tuples > 1.0)
 		path->startup_cost += (total_cost - startup_cost) / input_tuples *
 			(startup_tuples - 1.0);
+
+
+	/* We need to store a window of size "startup_tuples", in a Tuplestore. */
+	path->workmem =
+		normalize_workmem(relation_byte_size(startup_tuples, width));
 }
 
 /*
@@ -3337,6 +3428,7 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost;
 	/* Save private data for final_cost_nestloop */
 	workspace->run_cost = run_cost;
+	workspace->workmem = 0;
 }
 
 /*
@@ -3800,6 +3892,14 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost + inner_run_cost;
 	/* Save private data for final_cost_mergejoin */
 	workspace->run_cost = run_cost;
+
+	/*
+	 * By itself, Merge Join requires no working memory. If it adds one or
+	 * more Sort or Material nodes, we'll track their working memory when we
+	 * create them, inside createplan.c.
+	 */
+	workspace->workmem = 0;
+
 	workspace->inner_run_cost = inner_run_cost;
 	workspace->outer_rows = outer_rows;
 	workspace->inner_rows = inner_rows;
@@ -4171,6 +4271,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	double		outer_path_rows = outer_path->rows;
 	double		inner_path_rows = inner_path->rows;
 	double		inner_path_rows_total = inner_path_rows;
+	int			workmem;
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
@@ -4229,7 +4330,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
-							&num_skew_mcvs);
+							&num_skew_mcvs,
+							&workmem);
 
 	/*
 	 * If inner relation is too big then we will need to "batch" the join,
@@ -4260,6 +4362,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->numbuckets = numbuckets;
 	workspace->numbatches = numbatches;
 	workspace->inner_rows_total = inner_path_rows_total;
+	workspace->workmem = workmem;
 }
 
 /*
@@ -4268,8 +4371,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
  *
  * Note: the numbatches estimate is also saved into 'path' for use later
  *
- * 'path' is already filled in except for the rows and cost fields and
- *		num_batches
+ * 'path' is already filled in except for the rows and cost fields,
+ *		num_batches, and workmem
  * 'workspace' is the result from initial_cost_hashjoin
  * 'extra' contains miscellaneous information about the join
  */
@@ -4286,6 +4389,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	List	   *hashclauses = path->path_hashclauses;
 	Cost		startup_cost = workspace->startup_cost;
 	Cost		run_cost = workspace->run_cost;
+	int			workmem = workspace->workmem;
 	int			numbuckets = workspace->numbuckets;
 	int			numbatches = workspace->numbatches;
 	Cost		cpu_per_tuple;
@@ -4512,6 +4616,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	path->jpath.path.startup_cost = startup_cost;
 	path->jpath.path.total_cost = startup_cost + run_cost;
+	path->jpath.path.workmem = workmem;
 }
 
 
@@ -4534,6 +4639,9 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 	if (subplan->useHashTable)
 	{
+		long		nbuckets;
+		Size		hashentrysize;
+
 		/*
 		 * If we are using a hash table for the subquery outputs, then the
 		 * cost of evaluating the query is a one-time cost.  We charge one
@@ -4543,6 +4651,37 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		sp_cost.startup += plan->total_cost +
 			cpu_operator_cost * plan->plan_rows;
 
+		/*
+		 * Estimate working memory needed for the hashtable (and hashnulls, if
+		 * needed). The logic below MUST match the logic in buildSubPlanHash()
+		 * and ExecInitSubPlan().
+		 */
+		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
+		if (nbuckets < 1)
+			nbuckets = 1;
+
+		hashentrysize = MAXALIGN(plan->plan_width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		subplan->hashtab_workmem =
+			normalize_workmem((double) nbuckets * hashentrysize);
+
+		if (!subplan->unknownEqFalse)
+		{
+			/* Also needs a hashnulls table.  */
+			if (IsA(subplan->testexpr, OpExpr))
+				nbuckets = 1;	/* there can be only one entry */
+			else
+			{
+				nbuckets /= 16;
+				if (nbuckets < 1)
+					nbuckets = 1;
+			}
+
+			subplan->hashnul_workmem =
+				normalize_workmem((double) nbuckets * hashentrysize);
+		}
+
 		/*
 		 * The per-tuple costs include the cost of evaluating the lefthand
 		 * expressions, plus the cost of probing the hashtable.  We already
@@ -6426,7 +6565,7 @@ get_expr_width(PlannerInfo *root, const Node *expr)
  *	  Estimate the storage space in bytes for a given number of tuples
  *	  of a given width (size in bytes).
  */
-static double
+double
 relation_byte_size(double tuples, int width)
 {
 	return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
@@ -6605,3 +6744,197 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * compute_sort_output_sizes
+ *	  Estimate amount of memory and rows needed to hold a Sort operator's output
+ */
+static void
+compute_sort_output_sizes(double input_tuples, int input_width,
+						  double limit_tuples,
+						  double *output_tuples, double *output_bytes)
+{
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Do we have a useful LIMIT? */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+		*output_tuples = limit_tuples;
+	else
+		*output_tuples = input_tuples;
+
+	*output_bytes = relation_byte_size(*output_tuples, input_width);
+}
+
+/*
+ * compute_agg_input_workmem
+ *	  Estimate memory (in KB) needed to hold a sort buffer for aggregate's input
+ *
+ * Some aggregates involve DISTINCT or ORDER BY, so they need to sort their
+ * input, before they can process it. We need one sort buffer per such
+ * aggregate, and this function returns that sort buffer's (estimated) size (in
+ * KB).
+ */
+int
+compute_agg_input_workmem(double input_tuples, double input_width)
+{
+	/* Account for size of one buffer needed to sort the input. */
+	return normalize_workmem(input_tuples * input_width);
+}
+
+/*
+ * compute_agg_output_workmem
+ *	  Estimate amount of memory needed (in KB) to hold an aggregate's output
+ *
+ * In a Hash aggregate, we need space for the hash table that holds the
+ * aggregated data.
+ *
+ * Sort aggregates require output space only if they are part of a Grouping
+ * Sets chain: the first aggregate writes to its "sort_out" buffer, which the
+ * second aggregate uses as its "sort_in" buffer, and sorts.
+ *
+ * In the latter case, the "Path" code already costs the sort by calling
+ * cost_sort(), so it passes "cost_sort = false" to this function, to avoid
+ * double-counting.
+ */
+int
+compute_agg_output_workmem(PlannerInfo *root, AggStrategy aggstrategy,
+						   double numGroups, uint64 transitionSpace,
+						   double input_tuples, double input_width,
+						   bool cost_sort)
+{
+	/* Account for size of hash table to hold the output. */
+	if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+	{
+		double		hashentrysize;
+
+		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
+											input_width, transitionSpace);
+		return normalize_workmem(numGroups * hashentrysize);
+	}
+
+	/* Account for the size of the "sort_out" buffer. */
+	if (cost_sort && aggstrategy == AGG_SORTED)
+	{
+		double		output_tuples;	/* ignored */
+		double		output_bytes;
+
+		Assert(aggstrategy == AGG_SORTED);
+
+		compute_sort_output_sizes(numGroups, input_width,
+								  0.0 /* limit_tuples */ ,
+								  &output_tuples, &output_bytes);
+		return normalize_workmem(output_bytes);
+	}
+
+	return 0;
+}
+
+/*
+ * compute_bitmap_workmem
+ *	  Estimate total working memory (in KB) needed by bitmapqual
+ *
+ * Although we don't fill in the workmem_est or rows fields on the bitmapqual's
+ * paths, we fill them in on the owning BitmapHeapPath. This function estimates
+ * the total work_mem needed by all BitmapOrPaths and IndexPaths inside
+ * bitmapqual.
+ */
+static double
+compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+					   Cardinality max_ancestor_rows)
+{
+	double		workmem = 0.0;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * baserel->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
+
+	if (IsA(bitmapqual, BitmapAndPath))
+	{
+		BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, apath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, BitmapOrPath))
+	{
+		BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, opath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, IndexPath))
+	{
+		/* Working memory needed for 1 TID bitmap. */
+		workmem +=
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+	}
+
+	return workmem;
+}
+
+/*
+ * normalize_workmem
+ *	  Convert a double, "bytes" working-memory estimate to an int, "KB" value
+ *
+ * Normalizes to a minimum of 64 (KB), rounding up to the nearest whole KB.
+ */
+int
+normalize_workmem(double nbytes)
+{
+	double		workmem;
+
+	/*
+	 * We'll assign working-memory to SQL operators in 1 KB increments, so
+	 * round up to the next whole KB.
+	 */
+	workmem = ceil(nbytes / 1024.0);
+
+	/*
+	 * Although some components can probably work with < 64 KB of working
+	 * memory, PostgreSQL has imposed a hard minimum of 64 KB on the
+	 * "work_mem" GUC, for a long time; so, by now, some components probably
+	 * rely on this minimum, implicitly, and would fail if we tried to assign
+	 * them < 64 KB.
+	 *
+	 * Perhaps this minimum can be relaxed, in the future; but memory sizes
+	 * keep increasing, and right now the minimum of 64 KB = 1.6 percent of
+	 * the default "work_mem" of 4 MB.
+	 *
+	 * So, even with this (overly?) cautious normalization, with the default
+	 * GUC settings, we can still achieve a working-memory reduction of
+	 * 64-to-1.
+	 */
+	workmem = Max((double) 64, workmem);
+
+	/* And clamp to MAX_KILOBYTES. */
+	workmem = Min(workmem, (double) MAX_KILOBYTES);
+
+	return (int) workmem;
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 816a2b2a576..973b86371ef 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -130,6 +130,7 @@ static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
 											   BitmapHeapPath *best_path,
 											   List *tlist, List *scan_clauses);
 static Plan *create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+								   Cardinality max_ancestor_rows,
 								   List **qual, List **indexqual, List **indexECs);
 static void bitmap_subplan_mark_shared(Plan *plan);
 static TidScan *create_tidscan_plan(PlannerInfo *root, TidPath *best_path,
@@ -1853,6 +1854,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupCollations,
 								 NIL,
 								 NIL,
+								 0, /* numSorts */
 								 best_path->path.rows,
 								 0,
 								 subplan);
@@ -1911,6 +1913,15 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 	/* Copy cost data from Path to Plan */
 	copy_generic_path_info(plan, &best_path->path);
 
+	if (IsA(plan, Unique))
+	{
+		/*
+		 * We assigned "workmem" to the Sort subplan. Clear it from the top-
+		 * level Unique node, to avoid double-counting.
+		 */
+		plan->workmem = 0;
+	}
+
 	return plan;
 }
 
@@ -2228,6 +2239,13 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
+	/*
+	 * IncrementalSort creates two sort buffers, which the Path's "workmem"
+	 * estimate combined into a single value. Split it into two now.
+	 */
+	plan->sort.plan.workmem =
+		normalize_workmem(best_path->spath.path.workmem / 2);
+
 	return plan;
 }
 
@@ -2333,12 +2351,29 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 												subplan->targetlist),
 					NIL,
 					NIL,
+					best_path->numSorts,
 					best_path->numGroups,
 					best_path->transitionSpace,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace the overall workmem estimate with that we copied from the Path
+	 * with finer-grained estimates.
+	 */
+	plan->plan.workmem =
+		compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+								   plan->transitionSpace, subplan->plan_rows,
+								   subplan->plan_width, false /* cost_sort */ );
+
+	/* Also include estimated memory needed to sort the input: */
+	if (plan->numSorts > 0)
+	{
+		plan->sortWorkMem = compute_agg_input_workmem(subplan->plan_rows,
+													  subplan->plan_width);
+	}
+
 	return plan;
 }
 
@@ -2457,8 +2492,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			RollupData *rollup = lfirst(lc);
 			AttrNumber *new_grpColIdx;
 			Plan	   *sort_plan = NULL;
-			Plan	   *agg_plan;
+			Agg		   *agg_plan;
 			AggStrategy strat;
+			bool		cost_sort;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2480,19 +2516,20 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			else
 				strat = AGG_SORTED;
 
-			agg_plan = (Plan *) make_agg(NIL,
-										 NIL,
-										 strat,
-										 AGGSPLIT_SIMPLE,
-										 list_length((List *) linitial(rollup->gsets)),
-										 new_grpColIdx,
-										 extract_grouping_ops(rollup->groupClause),
-										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
-										 NIL,
-										 rollup->numGroups,
-										 best_path->transitionSpace,
-										 sort_plan);
+			agg_plan = make_agg(NIL,
+								NIL,
+								strat,
+								AGGSPLIT_SIMPLE,
+								list_length((List *) linitial(rollup->gsets)),
+								new_grpColIdx,
+								extract_grouping_ops(rollup->groupClause),
+								extract_grouping_collations(rollup->groupClause, subplan->targetlist),
+								rollup->gsets,
+								NIL,
+								best_path->numSorts,
+								rollup->numGroups,
+								best_path->transitionSpace,
+								sort_plan);
 
 			/*
 			 * Remove stuff we don't need to avoid bloating debug output.
@@ -2503,7 +2540,36 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan->lefttree = NULL;
 			}
 
-			chain = lappend(chain, agg_plan);
+			/*
+			 * If we're an AGG_SORTED, but not the last, we need to cost
+			 * working memory needed to produce our "sort_out" buffer.
+			 */
+			cost_sort = foreach_current_index(lc) < list_length(rollups) - 1;
+
+			/*
+			 * Although this side node doesn't need accurate cost estimates,
+			 * it does need an accurate *memory* estimate, since we'll use
+			 * that estimate to distribute working memory to this side node,
+			 * at runtime.
+			 */
+
+			/* Estimated memory needed to hold the output: */
+			agg_plan->plan.workmem =
+				compute_agg_output_workmem(root, agg_plan->aggstrategy,
+										   agg_plan->numGroups,
+										   agg_plan->transitionSpace,
+										   subplan->plan_rows,
+										   subplan->plan_width, cost_sort);
+
+			/* Also include estimated memory needed to sort the input: */
+			if (agg_plan->numSorts > 0)
+			{
+				agg_plan->sortWorkMem =
+					compute_agg_input_workmem(subplan->plan_rows,
+											  subplan->plan_width);
+			}
+
+			chain = lappend(chain, (Plan *) agg_plan);
 		}
 	}
 
@@ -2514,6 +2580,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
 		int			numGroupCols;
+		bool		cost_sort;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2529,12 +2596,37 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
 						rollup->gsets,
 						chain,
+						best_path->numSorts,
 						rollup->numGroups,
 						best_path->transitionSpace,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
+
+		/*
+		 * If we're an AGG_SORTED, but not the last, we need to cost working
+		 * memory needed to produce our "sort_out" buffer.
+		 */
+		cost_sort = list_length(rollups) > 1;
+
+		/*
+		 * Replace the overall workmem estimate that we copied from the Path
+		 * with finer-grained estimates.
+		 */
+		plan->plan.workmem =
+			compute_agg_output_workmem(root, plan->aggstrategy, plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows, subplan->plan_width,
+									   cost_sort);
+
+		/* Also include estimated memory needed to sort the input: */
+		if (plan->numSorts > 0)
+		{
+			plan->sortWorkMem =
+				compute_agg_input_workmem(subplan->plan_rows,
+										  subplan->plan_width);
+		}
 	}
 
 	return (Plan *) plan;
@@ -2783,6 +2875,38 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace our overall "workmem" estimate with estimates at finer
+	 * granularity.
+	 */
+
+	/*
+	 * Include memory for working and intermediate tables.  Since we'll
+	 * repeatedly swap the two tables, use the larger of the two as our
+	 * working- memory estimate.
+	 *
+	 * NOTE: The Path's "workmem" estimate is for the whole Path, but the
+	 * Plan's "workmem" estimates are *per data structure*. So, this value is
+	 * half of the corresponding Path's value.
+	 */
+	plan->plan.workmem =
+		normalize_workmem(
+						  Max(relation_byte_size(leftplan->plan_rows,
+												 leftplan->plan_width),
+							  relation_byte_size(rightplan->plan_rows,
+												 rightplan->plan_width)));
+
+	if (plan->numCols > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		entrysize;
+
+		entrysize = sizeof(TupleHashEntryData) + plan->plan.plan_width;
+
+		plan->hashWorkMem =
+			normalize_workmem(plan->numGroups * entrysize);
+	}
+
 	return plan;
 }
 
@@ -3223,6 +3347,7 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	/* Process the bitmapqual tree into a Plan tree and qual lists */
 	bitmapqualplan = create_bitmap_subplan(root, best_path->bitmapqual,
+										   0.0 /* max_ancestor_rows */ ,
 										   &bitmapqualorig, &indexquals,
 										   &indexECs);
 
@@ -3309,6 +3434,12 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&scan_plan->scan.plan, &best_path->path);
 
+	/*
+	 * We assigned "workmem" to the "bitmapqualplan" subplan. Clear it from
+	 * the top-level BitmapHeapScan node, to avoid double-counting.
+	 */
+	scan_plan->scan.plan.workmem = 0;
+
 	return scan_plan;
 }
 
@@ -3334,9 +3465,24 @@ create_bitmap_scan_plan(PlannerInfo *root,
  */
 static Plan *
 create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+					  Cardinality max_ancestor_rows,
 					  List **qual, List **indexqual, List **indexECs)
 {
 	Plan	   *plan;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * bitmapqual->parent->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
 
 	if (IsA(bitmapqual, BitmapAndPath))
 	{
@@ -3362,6 +3508,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3373,8 +3521,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_and(subplans);
 		plan->startup_cost = apath->path.startup_cost;
 		plan->total_cost = apath->path.total_cost;
-		plan->plan_rows =
-			clamp_row_est(apath->bitmapselectivity * apath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = apath->path.parallel_safe;
@@ -3409,6 +3556,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3437,8 +3586,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			plan = (Plan *) make_bitmap_or(subplans);
 			plan->startup_cost = opath->path.startup_cost;
 			plan->total_cost = opath->path.total_cost;
-			plan->plan_rows =
-				clamp_row_est(opath->bitmapselectivity * opath->path.parent->tuples);
+			plan->plan_rows = plan_rows;
 			plan->plan_width = 0;	/* meaningless */
 			plan->parallel_aware = false;
 			plan->parallel_safe = opath->path.parallel_safe;
@@ -3484,8 +3632,9 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
-		plan->plan_rows =
-			clamp_row_est(ipath->indexselectivity * ipath->path.parent->tuples);
+		plan->workmem =
+			normalize_workmem(tbm_calculate_bytes(max_ancestor_rows));
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
@@ -3796,6 +3945,14 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	/*
+	 * Replace the path's total working-memory estimate with a per-function
+	 * estimate.
+	 */
+	scan_plan->scan.plan.workmem =
+		normalize_workmem(relation_byte_size(scan_plan->scan.plan.plan_rows,
+											 scan_plan->scan.plan.plan_width));
+
 	return scan_plan;
 }
 
@@ -4615,6 +4772,9 @@ create_mergejoin_plan(PlannerInfo *root,
 		 */
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
+		matplan->workmem =
+			normalize_workmem(relation_byte_size(matplan->plan_rows,
+												 matplan->plan_width));
 
 		inner_plan = matplan;
 	}
@@ -4961,6 +5121,10 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	/* Display "workmem" on the Hash subnode, not its parent HashJoin node. */
+	hash_plan->plan.workmem = join_plan->join.plan.workmem;
+	join_plan->join.plan.workmem = 0;
+
 	return join_plan;
 }
 
@@ -5458,6 +5622,7 @@ copy_generic_path_info(Plan *dest, Path *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = (int) Min(src->workmem, (double) MAX_KILOBYTES);
 	dest->plan_rows = src->rows;
 	dest->plan_width = src->pathtarget->width;
 	dest->parallel_aware = src->parallel_aware;
@@ -5474,6 +5639,7 @@ copy_plan_costsize(Plan *dest, Plan *src)
 	dest->disabled_nodes = src->disabled_nodes;
 	dest->startup_cost = src->startup_cost;
 	dest->total_cost = src->total_cost;
+	dest->workmem = src->workmem;
 	dest->plan_rows = src->plan_rows;
 	dest->plan_width = src->plan_width;
 	/* Assume the inserted node is not parallel-aware. */
@@ -5509,6 +5675,7 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 			  limit_tuples);
 	plan->plan.startup_cost = sort_path.startup_cost;
 	plan->plan.total_cost = sort_path.total_cost;
+	plan->plan.workmem = (int) Min(sort_path.workmem, (double) MAX_KILOBYTES);
 	plan->plan.plan_rows = lefttree->plan_rows;
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
@@ -5540,6 +5707,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 						  limit_tuples);
 	plan->sort.plan.startup_cost = sort_path.startup_cost;
 	plan->sort.plan.total_cost = sort_path.total_cost;
+	plan->sort.plan.workmem = (int) Min(sort_path.workmem,
+										(double) MAX_KILOBYTES);
 	plan->sort.plan.plan_rows = lefttree->plan_rows;
 	plan->sort.plan.plan_width = lefttree->plan_width;
 	plan->sort.plan.parallel_aware = false;
@@ -6673,7 +6842,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 		 Size transitionSpace, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6689,6 +6858,8 @@ make_agg(List *tlist, List *qual,
 	node->grpColIdx = grpColIdx;
 	node->grpOperators = grpOperators;
 	node->grpCollations = grpCollations;
+	node->numSorts = numSorts;
+	node->sortWorkMem = 0;		/* caller will fill this */
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index c0a2f04a8c3..3eba364484d 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -691,5 +691,17 @@ get_agg_clause_costs(PlannerInfo *root, AggSplit aggsplit, AggClauseCosts *costs
 			costs->finalCost.startup += argcosts.startup;
 			costs->finalCost.per_tuple += argcosts.per_tuple;
 		}
+
+		/*
+		 * How many aggrefs need to sort their input? (Each such aggref gets
+		 * its own sort buffer. The logic here MUST match the corresponding
+		 * logic in function build_pertrans_for_aggref().)
+		 */
+		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind) &&
+			!aggref->aggpresorted &&
+			(aggref->aggdistinct || aggref->aggorder))
+		{
+			++costs->numSorts;
+		}
 	}
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..c533bfb9a58 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1709,6 +1709,13 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
 	pathnode->path.rows = subpath->rows;
 
+	/*
+	 * For now, set workmem at hash memory limit. Function
+	 * cost_memoize_rescan() will adjust this field, same as it does for field
+	 * "est_entries".
+	 */
+	pathnode->path.workmem = normalize_workmem(get_hash_memory_limit());
+
 	return pathnode;
 }
 
@@ -1937,12 +1944,14 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		pathnode->path.disabled_nodes = agg_path.disabled_nodes;
 		pathnode->path.startup_cost = agg_path.startup_cost;
 		pathnode->path.total_cost = agg_path.total_cost;
+		pathnode->path.workmem = agg_path.workmem;
 	}
 	else
 	{
 		pathnode->path.disabled_nodes = sort_path.disabled_nodes;
 		pathnode->path.startup_cost = sort_path.startup_cost;
 		pathnode->path.total_cost = sort_path.total_cost;
+		pathnode->path.workmem = sort_path.workmem;
 	}
 
 	rel->cheapest_unique_path = (Path *) pathnode;
@@ -2289,6 +2298,13 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	/* Cost is the same as for a regular CTE scan */
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
+	/*
+	 * But working memory used is 0, since the worktable scan doesn't create a
+	 * tuplestore -- it just reuses a tuplestore already created by a
+	 * recursive union.
+	 */
+	pathnode->workmem = 0;
+
 	return pathnode;
 }
 
@@ -3283,6 +3299,7 @@ create_agg_path(PlannerInfo *root,
 
 	pathnode->aggstrategy = aggstrategy;
 	pathnode->aggsplit = aggsplit;
+	pathnode->numSorts = aggcosts ? aggcosts->numSorts : 0;
 	pathnode->numGroups = numGroups;
 	pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
 	pathnode->groupClause = groupClause;
@@ -3333,6 +3350,8 @@ create_groupingsets_path(PlannerInfo *root,
 	ListCell   *lc;
 	bool		is_first = true;
 	bool		is_first_sort = true;
+	int			num_sort_nodes = 0;
+	double		max_sort_workmem = 0.0;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3369,6 +3388,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->numSorts = agg_costs ? agg_costs->numSorts : 0;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 	pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
@@ -3432,6 +3452,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 subpath->pathtarget->width);
 				if (!rollup->is_hashed)
 					is_first_sort = false;
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 			else
 			{
@@ -3444,6 +3466,12 @@ create_groupingsets_path(PlannerInfo *root,
 						  work_mem,
 						  -1.0);
 
+				/*
+				 * We costed sorting the previous "sort" rollup's "sort_out"
+				 * buffer. How much memory did it need?
+				 */
+				max_sort_workmem = Max(max_sort_workmem, sort_path.workmem);
+
 				/* Account for cost of aggregation */
 
 				cost_agg(&agg_path, root,
@@ -3457,12 +3485,17 @@ create_groupingsets_path(PlannerInfo *root,
 						 sort_path.total_cost,
 						 sort_path.rows,
 						 subpath->pathtarget->width);
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 
 			pathnode->path.disabled_nodes += agg_path.disabled_nodes;
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
+
+		if (!rollup->is_hashed)
+			++num_sort_nodes;
 	}
 
 	/* add tlist eval cost for each output row */
@@ -3470,6 +3503,17 @@ create_groupingsets_path(PlannerInfo *root,
 	pathnode->path.total_cost += target->cost.startup +
 		target->cost.per_tuple * pathnode->path.rows;
 
+	/*
+	 * Include working memory needed to sort agg output. If there's only 1
+	 * sort rollup, then we don't need any memory. If there are 2 sort
+	 * rollups, we need enough memory for 1 sort buffer. If there are >= 3
+	 * sort rollups, we need only 2 sort buffers, since we're
+	 * double-buffering.
+	 */
+	pathnode->path.workmem += num_sort_nodes > 2 ?
+		max_sort_workmem * 2.0 :
+		max_sort_workmem;
+
 	return pathnode;
 }
 
@@ -3619,7 +3663,8 @@ create_windowagg_path(PlannerInfo *root,
 				   subpath->disabled_nodes,
 				   subpath->startup_cost,
 				   subpath->total_cost,
-				   subpath->rows);
+				   subpath->rows,
+				   subpath->pathtarget->width);
 
 	/* add tlist eval cost for each output row */
 	pathnode->path.startup_cost += target->cost.startup;
@@ -3744,7 +3789,11 @@ create_setop_path(PlannerInfo *root,
 			MAXALIGN(SizeofMinimalTupleHeader);
 		if (hashentrysize * numGroups > get_hash_memory_limit())
 			pathnode->path.disabled_nodes++;
+
+		pathnode->path.workmem =
+			normalize_workmem(numGroups * hashentrysize);
 	}
+
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
@@ -3795,7 +3844,7 @@ create_recursiveunion_path(PlannerInfo *root,
 	pathnode->wtParam = wtParam;
 	pathnode->numGroups = numGroups;
 
-	cost_recursive_union(&pathnode->path, leftpath, rightpath);
+	cost_recursive_union(pathnode, leftpath, rightpath);
 
 	return pathnode;
 }
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index e4e9e0d1de1..6cd9bffbee5 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -63,7 +63,8 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
-									int *num_skew_mcvs);
+									int *num_skew_mcvs,
+									int *workmem);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fbf05322c75..2285544396d 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -60,6 +60,7 @@ typedef struct AggClauseCosts
 	QualCost	transCost;		/* total per-input-row execution costs */
 	QualCost	finalCost;		/* total per-aggregated-row costs */
 	Size		transitionSpace;	/* space for pass-by-ref transition data */
+	int			numSorts;		/* # of required input-sort buffers */
 } AggClauseCosts;
 
 /*
@@ -1696,6 +1697,7 @@ typedef struct Path
 	int			disabled_nodes; /* count of disabled nodes */
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
 	Cost		total_cost;		/* total cost (assuming all tuples fetched) */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* sort ordering of path's output; a List of PathKey nodes; see above */
 	List	   *pathkeys;
@@ -2290,6 +2292,7 @@ typedef struct AggPath
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy, see nodes.h */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+	int			numSorts;		/* number of inputs that require sorting */
 	Cardinality numGroups;		/* estimated number of groups in input */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	List	   *groupClause;	/* a list of SortGroupClause's */
@@ -2331,6 +2334,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	int			numSorts;		/* number of inputs that require sorting */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
@@ -3374,6 +3378,7 @@ typedef struct JoinCostWorkspace
 
 	/* Fields below here should be treated as private to costsize.c */
 	Cost		run_cost;		/* non-startup cost components */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* private for cost_nestloop code */
 	Cost		inner_run_cost; /* also used by cost_mergejoin code */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 396f7881420..e2a7a12d2a3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -167,6 +167,8 @@ typedef struct Plan
 	Cost		startup_cost;
 	/* total cost (assuming all tuples fetched) */
 	Cost		total_cost;
+	/* estimated working memory (in KB) */
+	int			workmem;		/* estimated work_mem (in KB) */
 
 	/* (runtime) working memory limit (in KB) */
 	int			workmem_limit;
@@ -432,6 +434,8 @@ typedef struct RecursiveUnion
 	/* estimated number of groups in input */
 	long		numGroups;
 
+	/* estimated work_mem for hash table (in KB) */
+	int			hashWorkMem;
 	/* work_mem reserved for hash table */
 	int			hashWorkMemLimit;
 } RecursiveUnion;
@@ -1155,7 +1159,8 @@ typedef struct Agg
 
 	/* number of inputs that require sorting */
 	int			numSorts;
-
+	/* estimated work_mem needed to sort each input (in KB) */
+	int			sortWorkMem;
 	/* work_mem limit to sort one input (in KB) */
 	int			sortWorkMemLimit;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index b932168237c..5e2e804f455 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1109,6 +1109,9 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	/* Estimated working memory (in KB): */
+	int			hashtab_workmem;	/* estimate for hashtable */
+	int			hashnul_workmem;	/* estimate for hashnull */
 	/* (Runtime) working-memory limits (in KB): */
 	int			hashtab_workmem_limit;	/* limit for hashtable */
 	int			hashnul_workmem_limit;	/* limit for hashnulls */
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index e185635c10b..b5c98a39af7 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -108,6 +108,7 @@ extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
 													dsa_pointer dp);
 extern int	tbm_calculate_entries(Size maxbytes);
+extern double tbm_calculate_bytes(double maxentries);
 
 extern TBMIterator tbm_begin_iterate(TIDBitmap *tbm,
 									 dsa_area *dsa, dsa_pointer dsp);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 3aa3c16e442..737c553a409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -106,7 +106,7 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 									 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_resultscan(Path *path, PlannerInfo *root,
 							RelOptInfo *baserel, ParamPathInfo *param_info);
-extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
+extern void cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, int disabled_nodes,
 					  Cost input_cost, double tuples, int width,
@@ -139,7 +139,7 @@ extern void cost_windowagg(Path *path, PlannerInfo *root,
 						   List *windowFuncs, WindowClause *winclause,
 						   int input_disabled_nodes,
 						   Cost input_startup_cost, Cost input_total_cost,
-						   double input_tuples);
+						   double input_tuples, int width);
 extern void cost_group(Path *path, PlannerInfo *root,
 					   int numGroupCols, double numGroups,
 					   List *quals,
@@ -217,9 +217,17 @@ extern void set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *re
 extern void set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double relation_byte_size(double tuples, int width);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 								   Path *bitmapqual, double loop_count,
 								   Cost *cost_p, double *tuples_p);
 extern double compute_gather_rows(Path *path);
+extern int	compute_agg_input_workmem(double input_tuples, double input_width);
+extern int	compute_agg_output_workmem(PlannerInfo *root,
+									   AggStrategy aggstrategy,
+									   double numGroups, uint64 transitionSpace,
+									   double input_tuples, double input_width,
+									   bool cost_sort);
+extern int	normalize_workmem(double nbytes);
 
 #endif							/* COST_H */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5a930199611..cf3694a744f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 List *groupingSets, List *chain, int numSorts, double dNumGroups,
 					 Size transitionSpace, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
 						 LimitOption limitOption, int uniqNumCols,
-- 
2.47.1

0004-Add-EXPLAIN-work_mem-on-command-option.patchapplication/octet-stream; name=0004-Add-EXPLAIN-work_mem-on-command-option.patchDownload

From 3860b0435225518f927e51d6c25cf6e132148d04 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 26 Feb 2025 01:02:19 +0000
Subject: [PATCH 4/5] Add EXPLAIN (work_mem on) command option

So that users can see how much working memory a query is likely to use, as
well as how much memory it will be limited to, this commit adds an
EXPLAIN (work_mem on) command option that displays the "workmem" and
"workmem_limit" Plan fields, added in the previous two commits, as part
of the plan.
---
 src/backend/commands/explain.c        | 289 ++++++++++++
 src/backend/executor/nodeHash.c       |   7 +-
 src/backend/optimizer/path/costsize.c |   4 +-
 src/include/commands/explain.h        |   4 +
 src/include/executor/nodeHash.h       |   2 +-
 src/test/regress/expected/workmem.out | 653 ++++++++++++++++++++++++++
 src/test/regress/parallel_schedule    |   2 +-
 src/test/regress/sql/workmem.sql      | 307 ++++++++++++
 8 files changed, 1262 insertions(+), 6 deletions(-)
 create mode 100644 src/test/regress/expected/workmem.out
 create mode 100644 src/test/regress/sql/workmem.sql

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c0d614866a9..2b893b4e50f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -18,6 +18,8 @@
 #include "commands/createas.h"
 #include "commands/defrem.h"
 #include "commands/prepare.h"
+#include "executor/hashjoin.h"
+#include "executor/nodeHash.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "libpq/pqformat.h"
@@ -25,6 +27,7 @@
 #include "nodes/extensible.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteHandler.h"
@@ -180,6 +183,8 @@ static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 static SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+static void compute_subplan_workmem(List *plans, double *workmem, double *limit);
+static void compute_agg_workmem(Agg *agg, double *workmem, double *limit);
 
 
 
@@ -235,6 +240,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		}
 		else if (strcmp(opt->defname, "memory") == 0)
 			es->memory = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "work_mem") == 0)
+			es->work_mem = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "serialize") == 0)
 		{
 			if (opt->arg)
@@ -835,6 +842,14 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 		ExplainPropertyFloat("Execution Time", "ms", 1000.0 * totaltime, 3,
 							 es);
 
+	if (es->work_mem)
+	{
+		ExplainPropertyFloat("Total Working Memory", "kB",
+							 es->total_workmem, 0, es);
+		ExplainPropertyFloat("Total Working Memory Limit", "kB",
+							 es->total_workmem_limit, 0, es);
+	}
+
 	ExplainCloseGroup("Query", NULL, true, es);
 }
 
@@ -1970,6 +1985,135 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
+	if (es->work_mem)
+	{
+		double		plan_workmem = 0.0;
+		double		plan_limit = 0.0;
+
+		/*
+		 * Include working memory used by this Plan's SubPlan objects, whether
+		 * they are included on the Plan's initPlan or subPlan lists.
+		 */
+		compute_subplan_workmem(planstate->initPlan, &plan_workmem, &plan_limit);
+		compute_subplan_workmem(planstate->subPlan, &plan_workmem, &plan_limit);
+
+		/* Include working memory used by this Plan, itself. */
+		switch (nodeTag(plan))
+		{
+			case T_Agg:
+				compute_agg_workmem((Agg *) plan, &plan_workmem, &plan_limit);
+				break;
+			case T_FunctionScan:
+				{
+					FunctionScan *fscan = (FunctionScan *) plan;
+
+					plan_workmem += (double) plan->workmem *
+						list_length(fscan->functions);
+					plan_limit += (double) plan->workmem_limit *
+						list_length(fscan->functions);
+					break;
+				}
+			case T_Hash:
+				{
+					Hash	   *hash = (Hash *) plan;
+					HashState  *hstate = (HashState *) planstate;
+					Plan	   *outerNode = outerPlan(plan);
+					double		rows;
+					size_t		nbytes;
+					size_t		total_space_allowed;	/* ignored */
+					int			nbuckets;	/* ignored */
+					int			nbatch;
+					int			num_skew_mcvs;	/* ignored */
+					int			workmem;	/* ignored */
+
+					/*
+					 * For Hash Joins, we currently don't count per-batch
+					 * metadata against the "workmem_limit", but we can at
+					 * least estimate it for display with the Plan.
+					 */
+					rows = plan->parallel_aware ? hash->rows_total :
+						outerNode->plan_rows;
+					nbytes = (size_t) plan->workmem_limit * 1024;
+
+					ExecChooseHashTableSize(rows, outerNode->plan_width,
+											OidIsValid(hash->skewTable),
+											hstate->parallel_state != NULL,
+											hstate->parallel_state != NULL ?
+											hstate->parallel_state->nparticipants - 1 : 0,
+											&nbytes,
+											&total_space_allowed,
+											&nbuckets, &nbatch, &num_skew_mcvs,
+											&workmem);
+
+					/*
+					 * Include space for per-batch metadata, if any: 2 blocks
+					 * per batch.
+					 */
+					if (nbatch > 1)
+						nbytes += nbatch * 2 * BLCKSZ;
+
+					plan_workmem += plan->workmem;
+					plan_limit += (double) normalize_workmem(nbytes);
+					break;
+				}
+			case T_IncrementalSort:
+
+				/*
+				 * IncrementalSort creates two Tuplestores, each of
+				 * (estimated) size workmem.
+				 */
+				plan_workmem += (double) plan->workmem * 2;
+				plan_limit += (double) plan->workmem_limit * 2;
+				break;
+			case T_RecursiveUnion:
+				{
+					RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+					/*
+					 * RecursiveUnion creates two Tuplestores, each of
+					 * (estimated) size workmem, plus (possibly) a hash table
+					 * of size hashWorkMem.
+					 */
+					plan_workmem += (double) plan->workmem * 2 +
+						runion->hashWorkMem;
+					plan_limit += (double) plan->workmem_limit * 2 +
+						runion->hashWorkMemLimit;
+					break;
+				}
+			default:
+				if (plan->workmem > 0)
+				{
+					plan_workmem += plan->workmem;
+					plan_limit += plan->workmem_limit;
+				}
+				break;
+		}
+
+		/*
+		 * Every parallel worker (plus the leader) gets its own copy of
+		 * working memory.
+		 */
+		plan_workmem *= (1 + es->num_workers);
+		plan_limit *= (1 + es->num_workers);
+
+		es->total_workmem += plan_workmem;
+		es->total_workmem_limit += plan_limit;
+
+		if (plan_workmem > 0.0 || plan_limit > 0.0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "  (work_mem=%.0f kB limit=%.0f kB)",
+								 plan_workmem, plan_limit);
+			else
+			{
+				ExplainPropertyFloat("Working Memory", "kB",
+									 plan_workmem, 0, es);
+				ExplainPropertyFloat("Working Memory Limit", "kB",
+									 plan_limit, 0, es);
+			}
+		}
+	}
+
 	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
@@ -2536,6 +2680,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (planstate->initPlan)
 		ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/*
+		 * Other than initPlan-s, every node below us gets the # of planned
+		 * workers we specified.
+		 */
+		Assert(es->num_workers == 0);
+
+		if (nodeTag(plan) == T_Gather)
+			es->num_workers = ((Gather *) plan)->num_workers;
+		else
+			es->num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
 	/* lefttree */
 	if (outerPlanState(planstate))
 		ExplainNode(outerPlanState(planstate), ancestors,
@@ -2592,6 +2750,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainCloseGroup("Plans", "Plans", false, es);
 	}
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/* End of parallel sub-tree. */
+		es->num_workers = 0;
+	}
+
 	/* in text format, undo whatever indentation we added */
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent = save_indent;
@@ -5952,3 +6116,128 @@ GetSerializationMetrics(DestReceiver *dest)
 
 	return empty;
 }
+
+/*
+ * compute_subplan_work_mem - compute total workmem for a SubPlan object
+ *
+ * If a SubPlan object uses a hash table, then that hash table needs working
+ * memory. We display that working memory on the owning Plan. This function
+ * increments work_mem counters to include the SubPlan's working-memory.
+ */
+static void
+compute_subplan_workmem(List *plans, double *workmem, double *limit)
+{
+	foreach_node(SubPlanState, sps, plans)
+	{
+		SubPlan    *sp = sps->subplan;
+
+		if (sp->hashtab_workmem > 0)
+		{
+			*workmem += sp->hashtab_workmem;
+			*limit += sp->hashtab_workmem_limit;
+		}
+
+		if (sp->hashnul_workmem > 0)
+		{
+			*workmem += sp->hashnul_workmem;
+			*limit += sp->hashnul_workmem_limit;
+		}
+	}
+}
+
+/* Compute an Agg's working memory estimate and limit. */
+typedef struct AggWorkMem
+{
+	double		input_sort_workmem;
+	double		input_sort_limit;
+
+	double		output_hash_workmem;
+	double		output_hash_limit;
+
+	int			num_sort_nodes;
+
+	double		max_output_sort_workmem;
+	double		output_sort_limit;
+}			AggWorkMem;
+
+static void
+compute_agg_workmem_node(Agg *agg, AggWorkMem * mem)
+{
+	/* Record memory used for input sort buffers. */
+	mem->input_sort_workmem += (double) agg->numSorts * agg->sortWorkMem;
+	mem->input_sort_limit += (double) agg->numSorts * agg->sortWorkMemLimit;
+
+	/* Record memory used for output data structures. */
+	switch (agg->aggstrategy)
+	{
+		case AGG_SORTED:
+
+			/* We'll have at most two sort buffers alive, at any time. */
+			mem->max_output_sort_workmem =
+				Max(mem->max_output_sort_workmem, agg->plan.workmem);
+
+			if (mem->output_sort_limit == 0)
+				mem->output_sort_limit = agg->plan.workmem_limit;
+
+			++mem->num_sort_nodes;
+			break;
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			/*
+			 * All hash tables created by "hash" phases are kept for the
+			 * lifetime of the Agg.
+			 */
+			mem->output_hash_workmem += agg->plan.workmem;
+			mem->output_hash_limit += agg->plan.workmem_limit;
+			break;
+		default:
+
+			/*
+			 * "Plain" phases don't use working memory (they output a single
+			 * aggregated tuple).
+			 */
+			break;
+	}
+}
+
+/*
+ * compute_agg_workmem - compute total workmem for an Agg node
+ *
+ * An Agg node might point to a chain of additional Agg nodes. When we explain
+ * the plan, we display only the first, "main" Agg node. However, to make life
+ * easier for the executor, we stored the estimated working memory ("workmem")
+ * on each individual Agg node.
+ *
+ * This function returns the combined workmem, so that we can display this
+ * value on the main Agg node.
+ */
+static void
+compute_agg_workmem(Agg *agg, double *workmem, double *limit)
+{
+	AggWorkMem	mem;
+	ListCell   *lc;
+
+	memset(&mem, 0, sizeof(mem));
+
+	compute_agg_workmem_node(agg, &mem);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach(lc, agg->chain)
+	{
+		Agg		   *aggnode = (Agg *) lfirst(lc);
+
+		compute_agg_workmem_node(aggnode, &mem);
+	}
+
+	*workmem = mem.input_sort_workmem + mem.output_hash_workmem;
+	*limit = mem.input_sort_limit + mem.output_hash_limit;;
+
+	/* We'll have at most two sort buffers alive, at any time. */
+	*workmem += mem.num_sort_nodes > 2 ?
+		mem.max_output_sort_workmem * 2.0 :
+		mem.max_output_sort_workmem;
+	*limit += mem.num_sort_nodes > 2 ?
+		mem.output_sort_limit * 2.0 :
+		mem.output_sort_limit;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3f60f6305bd..ee867712732 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -483,7 +483,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
-							worker_space_allowed,
+							&worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
@@ -667,7 +667,7 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t worker_space_allowed,
+						size_t *worker_space_allowed,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
@@ -700,7 +700,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = worker_space_allowed;
+	hash_table_bytes = *worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -945,6 +945,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
+		*worker_space_allowed = (*worker_space_allowed) * 2;
 		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b455721fcb7..6062e402bce 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4276,6 +4276,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	int			numbuckets;
 	int			numbatches;
 	int			num_skew_mcvs;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;	/* unused */
 
 	/* Count up disabled nodes. */
@@ -4321,12 +4322,13 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	 * XXX at some point it might be interesting to try to account for skew
 	 * optimization in the cost estimate, but for now, we don't.
 	 */
+	worker_space_allowed = get_hash_memory_limit();
 	ExecChooseHashTableSize(inner_path_rows_total,
 							inner_path->pathtarget->width,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
-							get_hash_memory_limit(),
+							&worker_space_allowed,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 570e7cad1fa..498a1a3a4b6 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -53,6 +53,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		memory;			/* print planner's memory usage information */
+	bool		work_mem;		/* print work_mem estimates per node */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
@@ -69,6 +70,9 @@ typedef struct ExplainState
 	bool		hide_workers;	/* set if we find an invisible Gather */
 	int			rtable_size;	/* length of rtable excluding the RTE_GROUP
 								 * entry */
+	int			num_workers;	/* # of worker processes planned to use */
+	double		total_workmem;	/* total working memory estimate (in bytes) */
+	double		total_workmem_limit;	/* total working-memory limit (in kB) */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 6cd9bffbee5..b346a270b67 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,7 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t worker_space_allowed,
+									size_t *worker_space_allowed,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
new file mode 100644
index 00000000000..25e1dbb315b
--- /dev/null
+++ b/src/test/regress/expected/workmem.out
@@ -0,0 +1,653 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Unique -> hash agg
+set enable_hashagg = on;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                         workmem_filter                          
+-----------------------------------------------------------------
+ Sort  (work_mem=N kB limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(11 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Unique -> sort
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ Sort  (work_mem=N kB limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  Unique
+               ->  Sort  (work_mem=N kB limit=4096 kB)
+                     Sort Key: "*VALUES*".column1, "*VALUES*".column2
+                     ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 8192 kB
+(12 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+                    workmem_filter                     
+-------------------------------------------------------
+ Limit
+   ->  Incremental Sort  (work_mem=N kB limit=8192 kB)
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort  (work_mem=N kB limit=4096 kB)
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(9 rows)
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+    4220 |    5017 |   0 |    0 |   0 |      0 |      20 |      220 |         220 |      4220 |     4220 |  40 |   41 | IGAAAA   | ZKHAAA   | HHHHxx
+(1 row)
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+                                 workmem_filter                                 
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Hash Join
+               Hash Cond: (t3.thousand = t1.unique1)
+               ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+                     Group Key: t3.thousand, t3.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
+               ->  Hash  (work_mem=N kB limit=8192 kB)
+                     ->  Index Only Scan using onek_unique1 on onek t1
+                           Index Cond: (unique1 < 1)
+         ->  Index Only Scan using tenk1_hundred on tenk1 t2
+               Index Cond: (hundred = t3.tenthous)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(14 rows)
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+ count 
+-------
+   100
+(1 row)
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+                              workmem_filter                              
+--------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Nested Loop Left Join
+               Filter: (t4.f1 IS NULL)
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize  (work_mem=N kB limit=4096 kB)
+                     ->  Nested Loop Left Join
+                           Join Filter: (t3.f1 > 1)
+                           ->  Seq Scan on int4_tbl t3
+                                 Filter: (f1 > 0)
+                           ->  Materialize  (work_mem=N kB limit=4096 kB)
+                                 ->  Seq Scan on int4_tbl t4
+         ->  Seq Scan on int4_tbl t1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 8192 kB
+(15 rows)
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+ count 
+-------
+     0
+(1 row)
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=4096 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 20480 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+                          workmem_filter                           
+-------------------------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB limit=8192 kB)
+   Group Key: (length((stringu1)::text))
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial HashAggregate  (work_mem=N kB limit=40960 kB)
+               Group Key: length((stringu1)::text)
+               ->  Parallel Seq Scan on tenk1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 49152 kB
+(9 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length 
+--------
+      6
+(1 row)
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+            QUERY PLAN            
+----------------------------------
+ Aggregate
+   ->  Seq Scan on tenk1
+ Total Working Memory: 0 kB
+ Total Working Memory Limit: 0 kB
+(4 rows)
+
+select MAX(length(stringu1)) from tenk1;
+ max 
+-----
+   6
+(1 row)
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                             
+------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=12288 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+                             workmem_filter                             
+------------------------------------------------------------------------
+ Nested Loop
+   ->  Seq Scan on workmem_xmldata
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(5 rows)
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+ id | _id | country_name | country_id | region_id | size | unit | premier_name 
+----+-----+--------------+------------+-----------+------+------+--------------
+(0 rows)
+
+drop table workmem_xmldata;
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ SetOp Except
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+ Total Working Memory: 0 kB
+ Total Working Memory Limit: 0 kB
+(6 rows)
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+ unique1 
+---------
+      10
+(1 row)
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   ->  HashSetOp Intersect  (work_mem=N kB limit=8192 kB)
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 8192 kB
+(6 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+ count 
+-------
+  5000
+(1 row)
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  Memoize  (work_mem=N kB limit=8192 kB)
+               Cache Key: o.four
+               Cache Mode: binary
+               ->  CTE Scan on x  (work_mem=N kB limit=4096 kB)
+                     CTE x
+                       ->  Recursive Union  (work_mem=N kB limit=16384 kB)
+                             ->  Result
+                             ->  WorkTable Scan on x x_1
+                                   Filter: (a < 10)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 28672 kB
+(15 rows)
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+ sum  | sum  
+------+------
+ 1700 | 5350
+(1 row)
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   CTE q1
+     ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+           Group Key: tenk1.hundred
+           ->  Seq Scan on tenk1
+   InitPlan 2
+     ->  Aggregate
+           ->  CTE Scan on q1 qsub  (work_mem=N kB limit=4096 kB)
+   ->  CTE Scan on q1  (work_mem=N kB limit=4096 kB)
+         Filter: ((y)::numeric > (InitPlan 2).col1)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(12 rows)
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+ count 
+-------
+    50
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=4096 kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+                                            workmem_filter                                             
+-------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (((a.unique1 = 1) AND (b.unique1 = 2)) OR ((a.unique2 = 3) AND (b.hundred = 4)))
+         ->  Bitmap Heap Scan on tenk1 b
+               Recheck Cond: ((hundred = 4) OR (unique1 = 2))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB limit=4096 kB)
+                           Index Cond: (hundred = 4)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
+                           Index Cond: (unique1 = 2)
+         ->  Materialize  (work_mem=N kB limit=4096 kB)
+               ->  Bitmap Heap Scan on tenk1 a
+                     Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
+                     ->  BitmapOr
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB limit=4096 kB)
+                                 Index Cond: (unique2 = 3)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
+                                 Index Cond: (unique1 = 1)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 20480 kB
+(20 rows)
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+ count 
+-------
+   101
+(1 row)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter             
+----------------------------------------
+ Result  (work_mem=N kB limit=16384 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=16384 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=8192 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 24576 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..1089e3bdf96 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate workmem
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/workmem.sql b/src/test/regress/sql/workmem.sql
new file mode 100644
index 00000000000..d1cec9eb051
--- /dev/null
+++ b/src/test/regress/sql/workmem.sql
@@ -0,0 +1,307 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Unique -> hash agg
+set enable_hashagg = on;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Unique -> sort
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+
+select length(stringu1) from tenk1 group by length(stringu1);
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+
+select MAX(length(stringu1)) from tenk1;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+
+drop table workmem_xmldata;
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
-- 
2.47.1

0005-Add-workmem_hook-to-allow-extensions-to-override-per.patchapplication/octet-stream; name=0005-Add-workmem_hook-to-allow-extensions-to-override-per.patchDownload

From 11f6858be36b6715ebf0f7f7bceb129f6cfcc3af Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Fri, 21 Feb 2025 00:41:31 +0000
Subject: [PATCH 5/5] Add "workmem_hook" to allow extensions to override
 per-node work_mem

---
 contrib/Makefile                     |   3 +-
 contrib/workmem/Makefile             |  20 +
 contrib/workmem/expected/workmem.out | 676 +++++++++++++++++++++++++++
 contrib/workmem/meson.build          |  28 ++
 contrib/workmem/sql/workmem.sql      | 304 ++++++++++++
 contrib/workmem/workmem.c            | 655 ++++++++++++++++++++++++++
 src/backend/executor/execWorkmem.c   |  37 +-
 src/include/executor/executor.h      |   4 +
 8 files changed, 1717 insertions(+), 10 deletions(-)
 create mode 100644 contrib/workmem/Makefile
 create mode 100644 contrib/workmem/expected/workmem.out
 create mode 100644 contrib/workmem/meson.build
 create mode 100644 contrib/workmem/sql/workmem.sql
 create mode 100644 contrib/workmem/workmem.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..b4880ab7067 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,7 +50,8 @@ SUBDIRS = \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
-		vacuumlo
+		vacuumlo	\
+		workmem
 
 ifeq ($(with_ssl),openssl)
 SUBDIRS += pgcrypto sslinfo
diff --git a/contrib/workmem/Makefile b/contrib/workmem/Makefile
new file mode 100644
index 00000000000..f920cdb9964
--- /dev/null
+++ b/contrib/workmem/Makefile
@@ -0,0 +1,20 @@
+# contrib/workmem/Makefile
+
+MODULE_big = workmem
+OBJS = \
+	$(WIN32RES) \
+	workmem.o
+PGFILEDESC = "workmem - extension that adjusts PostgreSQL work_mem per node"
+
+REGRESS = workmem
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/workmem
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/workmem/expected/workmem.out b/contrib/workmem/expected/workmem.out
new file mode 100644
index 00000000000..a2c6d3be4d2
--- /dev/null
+++ b/contrib/workmem/expected/workmem.out
@@ -0,0 +1,676 @@
+load 'workmem';
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=25600 kB)
+   ->  Sort  (work_mem=N kB limit=25600 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=51200 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20480 kB)
+   ->  Sort  (work_mem=N kB limit=20480 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=40960 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=20480 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=102400 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=102399 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102399 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                    
+--------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=34133 kB)
+         ->  Sort  (work_mem=N kB limit=34133 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=34134 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter              
+-----------------------------------------
+ Result  (work_mem=N kB limit=102400 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=68267 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=34133 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=1024 kB)
+   ->  Sort  (work_mem=N kB limit=1024 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=2048 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=819 kB)
+   ->  Sort  (work_mem=N kB limit=819 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=1638 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=820 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=4095 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4095 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=1365 kB)
+         ->  Sort  (work_mem=N kB limit=1365 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=1366 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+            workmem_filter             
+---------------------------------------
+ Result  (work_mem=N kB limit=4096 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=2731 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=1365 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20 kB)
+   ->  Sort  (work_mem=N kB limit=20 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=40 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=16 kB)
+   ->  Sort  (work_mem=N kB limit=16 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=32 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=16 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=80 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                           workmem_filter                            
+---------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=78 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 78 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                                  workmem_filter                                   
+-----------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=26 kB)
+         ->  Sort  (work_mem=N kB limit=27 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=27 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+           workmem_filter            
+-------------------------------------
+ Result  (work_mem=N kB limit=80 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=54 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=26 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/meson.build b/contrib/workmem/meson.build
new file mode 100644
index 00000000000..fce8030ba45
--- /dev/null
+++ b/contrib/workmem/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+workmem_sources = files(
+  'workmem.c',
+)
+
+if host_system == 'windows'
+  workmem_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'workmem',
+    '--FILEDESC', 'workmem - extension that adjusts PostgreSQL work_mem per node',])
+endif
+
+workmem = shared_module('workmem',
+  workmem_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += workmem
+
+tests += {
+  'name': 'workmem',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'workmem',
+    ],
+  },
+}
diff --git a/contrib/workmem/sql/workmem.sql b/contrib/workmem/sql/workmem.sql
new file mode 100644
index 00000000000..e6dbc35bf10
--- /dev/null
+++ b/contrib/workmem/sql/workmem.sql
@@ -0,0 +1,304 @@
+load 'workmem';
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
+
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/workmem.c b/contrib/workmem/workmem.c
new file mode 100644
index 00000000000..b512c1f9f8c
--- /dev/null
+++ b/contrib/workmem/workmem.c
@@ -0,0 +1,655 @@
+/*-------------------------------------------------------------------------
+ *
+ * workmem.c
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/workmem/workmem.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+
+PG_MODULE_MAGIC;
+
+/* Local variables */
+
+/*
+ * A Target represents a collection of data structures, belonging to an
+ * execution node, that all share the same memory limit.
+ *
+ * For example, in parallel query, every parallel worker (plus the leader)
+ * gets a copy of the execution node, and therefore a copy of all of that
+ * node's work_mem limits. In this case, we'll track a single Target, but its
+ * count will include (1 + num_workers), because this Target gets "applied"
+ * to (1 + num_workers) execution nodes.
+ */
+typedef struct Target
+{
+	/* # of data structures to which target applies: */
+	int			count;
+	/* workmem estimate for each of these data structures: */
+	int			workmem;
+	/* (original) workmem limit for each of these data structures: */
+	int			limit;
+	/* workmem estimate, but capped at (original) workmem limit: */
+	int			priority;
+	/* ratio of (priority / limit); measure's Target's "greediness": */
+	double		ratio;
+	/* link to target's actual limit, so we can set it: */
+	int		   *target_limit;
+}			Target;
+
+typedef struct WorkMemStats
+{
+	/* total # of data structures that get working memory: */
+	uint64		count;
+	/* total working memory estimated for this query: */
+	uint64		workmem;
+	/* total working memory (currently) reserved for this query: */
+	uint64		limit;
+	/* total "capped" working memory estimate: */
+	uint64		priority;
+	/* list of Targets, used to update actual workmem limits: */
+	List	   *targets;
+}			WorkMemStats;
+
+/* GUC variables */
+static int	workmem_query_work_mem = 100 * 1024;	/* kB */
+
+/* internal functions */
+static void workmem_fn(PlannedStmt *plannedstmt);
+
+static int	clamp_priority(int workmem, int limit);
+static Target * make_target(int workmem, int *target_limit, int count);
+static void add_target(WorkMemStats * workmem_stats, Target * target);
+
+/* Sort comparators: sort by ratio, ascending or descending. */
+static int	target_compare_asc(const ListCell *a, const ListCell *b);
+static int	target_compare_desc(const ListCell *a, const ListCell *b);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Define custom GUC variable. */
+	DefineCustomIntVariable("workmem.query_work_mem",
+							"Amount of working-memory (in kB) to provide each "
+							"query.",
+							NULL,
+							&workmem_query_work_mem,
+							100 * 1024, /* default to 100 MB */
+							64,
+							INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_KB,
+							NULL,
+							NULL,
+							NULL);
+
+	MarkGUCPrefixReserved("workmem");
+
+	/* Install hooks. */
+	ExecAssignWorkMem_hook = workmem_fn;
+}
+
+/* Compute an Agg's working memory estimate and limit. */
+typedef struct AggWorkMem
+{
+	uint64		hash_workmem;
+	int		   *hash_limit;
+
+	int			num_sorts;
+	int			max_sort_workmem;
+	int		   *sort_limit;
+}			AggWorkMem;
+
+static void
+workmem_analyze_agg_node(Agg *agg, AggWorkMem * mem,
+						 WorkMemStats * workmem_stats)
+{
+	if (agg->sortWorkMem > 0 || agg->sortWorkMemLimit > 0)
+	{
+		/* Record memory used for input sort buffers. */
+		Target	   *target = make_target(agg->sortWorkMem,
+										 &agg->sortWorkMemLimit,
+										 agg->numSorts);
+
+		add_target(workmem_stats, target);
+	}
+
+	switch (agg->aggstrategy)
+	{
+		case AGG_HASHED:
+		case AGG_MIXED:
+
+			mem->hash_workmem += agg->plan.workmem;
+
+			/* Read hash limit from the first AGG_HASHED node. */
+			if (mem->hash_limit == NULL)
+				mem->hash_limit = &agg->plan.workmem_limit;
+
+			break;
+		case AGG_SORTED:
+
+			++mem->num_sorts;
+
+			mem->max_sort_workmem = Max(mem->max_sort_workmem, agg->plan.workmem);
+
+			/* Read sort limit from the first AGG_SORTED node. */
+			if (mem->sort_limit == NULL)
+				mem->sort_limit = &agg->plan.workmem_limit;
+
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze_agg(Agg *agg, int num_workers, WorkMemStats * workmem_stats)
+{
+	AggWorkMem	mem;
+
+	memset(&mem, 0, sizeof(mem));
+
+	/* Analyze main Agg node. */
+	workmem_analyze_agg_node(agg, &mem, workmem_stats);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach_node(Agg, aggnode, agg->chain)
+		workmem_analyze_agg_node(aggnode, &mem, workmem_stats);
+
+	/*
+	 * Working memory for hash tables, if needed. All hash tables share the
+	 * same limit:
+	 */
+	if (mem.hash_workmem > 0 || mem.hash_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.hash_workmem, mem.hash_limit,
+						1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	/*
+	 * Workimg memory for (output) sort buffers, if needed. We'll need at most
+	 * 2 sort buffers:
+	 */
+	if (mem.max_sort_workmem > 0 || mem.sort_limit != NULL)
+	{
+		Target	   *target =
+			make_target(mem.max_sort_workmem, mem.sort_limit,
+						Min(mem.num_sorts, 2) * (1 + num_workers));
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_subplan(SubPlan *subplan, int num_workers,
+						WorkMemStats * workmem_stats)
+{
+	if (subplan->hashtab_workmem > 0 || subplan->hashtab_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash table */
+		Target	   *target = make_target(subplan->hashtab_workmem,
+										 &subplan->hashtab_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+
+	if (subplan->hashnul_workmem > 0 || subplan->hashnul_workmem_limit > 0)
+	{
+		/* working memory for SubPlan's hash-NULL table */
+		Target	   *target = make_target(subplan->hashnul_workmem,
+										 &subplan->hashnul_workmem_limit,
+										 1 + num_workers);
+
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_analyze_plan(Plan *plan, int num_workers, WorkMemStats * workmem_stats)
+{
+	/* Make sure there's enough stack available. */
+	check_stack_depth();
+
+	/* Analyze this node's SubPlans. */
+	foreach_node(SubPlan, subplan, plan->initPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	if (IsA(plan, Gather) || IsA(plan, GatherMerge))
+	{
+		/*
+		 * Parallel query apparently does not run InitPlans in parallel. Well,
+		 * currently, Gather and GatherMerge Plan nodes don't contain any
+		 * quals, so they can't contain SubPlans at all; so maybe we should
+		 * move this below the SubPlan-analysis loop, as well? For now, to
+		 * maintain consistency with explain.c, we'll just leave this here.
+		 */
+		Assert(num_workers == 0);
+
+		if (IsA(plan, Gather))
+			num_workers = ((Gather *) plan)->num_workers;
+		else
+			num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
+	foreach_node(SubPlan, subplan, plan->subPlan)
+		workmem_analyze_subplan(subplan, num_workers, workmem_stats);
+
+	/* Analyze this node's working memory. */
+	switch (nodeTag(plan))
+	{
+		case T_BitmapIndexScan:
+		case T_CteScan:
+		case T_Material:
+		case T_Sort:
+		case T_TableFuncScan:
+		case T_WindowAgg:
+		case T_Hash:
+		case T_Memoize:
+		case T_SetOp:
+			if (plan->workmem > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 1 + num_workers);
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_Agg:
+			workmem_analyze_agg((Agg *) plan, num_workers, workmem_stats);
+			break;
+		case T_FunctionScan:
+			if (plan->workmem > 0)
+			{
+				int			nfuncs =
+					list_length(((FunctionScan *) plan)->functions);
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 nfuncs * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_IncrementalSort:
+			if (plan->workmem > 0)
+			{
+				Target	   *target = make_target(plan->workmem,
+												 &plan->workmem_limit,
+												 2 * (1 + num_workers));
+
+				add_target(workmem_stats, target);
+			}
+			break;
+		case T_RecursiveUnion:
+			{
+				RecursiveUnion *runion = (RecursiveUnion *) plan;
+				Target	   *target;
+
+				/* working memory for two tuplestores */
+				target = make_target(plan->workmem, &plan->workmem_limit,
+									 2 * (1 + num_workers));
+				add_target(workmem_stats, target);
+
+				/* working memory for a hash table, if needed */
+				if (runion->hashWorkMem > 0)
+				{
+					target = make_target(runion->hashWorkMem,
+										 &runion->hashWorkMem,
+										 1 + num_workers);
+					add_target(workmem_stats, target);
+				}
+			}
+			break;
+		default:
+			Assert(plan->workmem == 0);
+			Assert(plan->workmem_limit == 0);
+			break;
+	}
+
+	/* Now analyze this Plan's children. */
+	if (outerPlan(plan))
+		workmem_analyze_plan(outerPlan(plan), num_workers, workmem_stats);
+
+	if (innerPlan(plan))
+		workmem_analyze_plan(innerPlan(plan), num_workers, workmem_stats);
+
+	switch (nodeTag(plan))
+	{
+		case T_Append:
+			foreach_ptr(Plan, child, ((Append *) plan)->appendplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_MergeAppend:
+			foreach_ptr(Plan, child, ((MergeAppend *) plan)->mergeplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapAnd:
+			foreach_ptr(Plan, child, ((BitmapAnd *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_BitmapOr:
+			foreach_ptr(Plan, child, ((BitmapOr *) plan)->bitmapplans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		case T_SubqueryScan:
+			workmem_analyze_plan(((SubqueryScan *) plan)->subplan,
+								 num_workers, workmem_stats);
+			break;
+		case T_CustomScan:
+			foreach_ptr(Plan, child, ((CustomScan *) plan)->custom_plans)
+				workmem_analyze_plan(child, num_workers, workmem_stats);
+			break;
+		default:
+			break;
+	}
+}
+
+static void
+workmem_analyze(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	/* Analyze the Plans referred to by SubPlan objects. */
+	foreach_ptr(Plan, plan, plannedstmt->subplans)
+	{
+		if (plan)
+			workmem_analyze_plan(plan, 0 /* num_workers */ , workmem_stats);
+	}
+
+	/* Analyze the main Plan tree itself. */
+	workmem_analyze_plan(plannedstmt->planTree, 0 /* num_workers */ ,
+						 workmem_stats);
+}
+
+static void
+workmem_set(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			remaining = workmem_query_work_mem;
+
+	if (workmem_stats->limit <= remaining)
+	{
+		/*
+		 * "High memory" case: we have more than enough query_work_mem; now
+		 * hand out the excess.
+		 */
+
+		/* This is memory that exceeds workmem limits. */
+		remaining -= workmem_stats->limit;
+
+		/*
+		 * Sort targets from highest ratio to lowest. When we assign memory to
+		 * a Target, we'll truncate fractional KB; so by going through the
+		 * list from highest to lowest ratio, we ensure that the lowest ratios
+		 * get the leftover fractional KBs.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/* NOTE: This is extra workmem *per data structure*. */
+			extra_workmem = (int) (fraction * remaining);
+
+			*target->target_limit += extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else if (workmem_stats->priority <= remaining)
+	{
+		/*
+		 * "Medium memory" case: we don't have enough query_work_mem to give
+		 * every target its full allotment, but we do have enough to give it
+		 * as much as (we estimate) it needs.
+		 *
+		 * So, just take some memory away from nodes that (we estimate) won't
+		 * need it.
+		 */
+
+		/* This is memory that exceeds workmem estimates. */
+		remaining -= workmem_stats->priority;
+
+		/*
+		 * Sort targets from highest ratio to lowest. We'll skip any Target
+		 * with ratio > 1.0, because (we estimate) they already need their
+		 * full allotment. Also, once a target reaches its workmem limit,
+		 * we'll stop giving it more workmem, leaving the surplus memory to be
+		 * assigned to targets with smaller ratios.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/*
+			 * Don't give the target more than its (original) limit.
+			 *
+			 * NOTE: This is extra workmem *per data structure*.
+			 */
+			extra_workmem = Min((int) (fraction * remaining),
+								target->limit - target->priority);
+
+			*target->target_limit = target->priority + extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else
+	{
+		uint64		limit = workmem_stats->limit;
+
+		/*
+		 * "Low memory" case: we are severely memory constrained, and need to
+		 * take "priority" memory away from targets that (we estimate)
+		 * actually need it. We'll do this by (effectively) reducing the
+		 * global "work_mem" limit, uniformly, for all targets, until we're
+		 * under the query_work_mem limit.
+		 */
+		elog(WARNING,
+			 "not enough working memory for query: increase "
+			 "workmem.query_work_mem");
+
+		/*
+		 * Sort targets from lowest ratio to highest. For any target whose
+		 * ratio is < the target_ratio, we'll just assign it its priority (=
+		 * workmem) as limit, and return the excess workmem to our "limit",
+		 * for use by subsequent, greedier, targets.
+		 */
+		list_sort(workmem_stats->targets, target_compare_asc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		target_ratio;
+			int			target_limit;
+
+			/*
+			 * If we restrict our targets to this ratio, we'll stay within the
+			 * query_work_mem limit.
+			 */
+			target_ratio = (double) remaining / limit;
+
+			/*
+			 * Don't give this target more than its priority request (but we
+			 * might give it less).
+			 */
+			target_limit = Min(target->priority,
+							   target_ratio * target->limit);
+			*target->target_limit = target_limit;
+
+			/* "Remaining" decreases by memory we actually assigned. */
+			remaining -= (target_limit * target->count);
+
+			/*
+			 * "Limit" decreases by target's original memory limit.
+			 *
+			 * If target_limit <= target->priority, so we restricted this
+			 * target to less memory than (we estimate) it needs, then the
+			 * target_ratio will stay the same, since, letting A = remaining,
+			 * B = limit, and R = ratio, we'll have:
+			 *
+			 * R=A/B <=> A=R*B <=> A-R*X = R*B - R*X <=> A-R*X = R * (B-X) <=>
+			 * R = (A-R*X) / (B-X)
+			 *
+			 * -- which is what we wanted to prove.
+			 *
+			 * And if target_limit > target->priority, so we didn't need to
+			 * restrict this target beyond its priority estimate, then the
+			 * target_ratio will increase. This means more memory for the
+			 * remaining, greedier, targets.
+			 */
+			limit -= (target->limit * target->count);
+
+			target_ratio = (double) remaining / limit;
+		}
+	}
+}
+
+/*
+ * workmem_fn: updates the query plan's work_mem based on query_work_mem
+ */
+static void
+workmem_fn(PlannedStmt *plannedstmt)
+{
+	WorkMemStats workmem_stats;
+	MemoryContext context,
+				oldcontext;
+
+	/*
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	if (workmem_query_work_mem == -1)
+		return;					/* disabled */
+
+	/*
+	 * Start by assigning default working memory to all of this query's Plan
+	 * nodes.
+	 */
+	standard_ExecAssignWorkMem(plannedstmt);
+
+	memset(&workmem_stats, 0, sizeof(workmem_stats));
+
+	/*
+	 * Set up our own memory context, so we can drop the metadata we generate,
+	 * all at once.
+	 */
+	context = AllocSetContextCreate(CurrentMemoryContext,
+									"workmem_fn context",
+									ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	/* Figure out how much total working memory this query wants/needs. */
+	workmem_analyze(plannedstmt, &workmem_stats);
+
+	/* Now restrict the query to workmem.query_work_mem. */
+	workmem_set(plannedstmt, &workmem_stats);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Drop all our metadata. */
+	MemoryContextDelete(context);
+}
+
+static int
+clamp_priority(int workmem, int limit)
+{
+	return Min(workmem, limit);
+}
+
+static Target *
+make_target(int workmem, int *target_limit, int count)
+{
+	Target	   *result = palloc_object(Target);
+
+	result->count = count;
+	result->workmem = workmem;
+	result->limit = *target_limit;
+	result->priority = clamp_priority(result->workmem, result->limit);
+	result->ratio = (double) result->priority / result->limit;
+	result->target_limit = target_limit;
+
+	return result;
+}
+
+static void
+add_target(WorkMemStats * workmem_stats, Target * target)
+{
+	workmem_stats->count += target->count;
+	workmem_stats->workmem += target->count * target->workmem;
+	workmem_stats->limit += target->count * target->limit;
+	workmem_stats->priority += target->count * target->priority;
+	workmem_stats->targets = lappend(workmem_stats->targets, target);
+}
+
+/* This "ascending" comparator sorts least-greedy Targets first. */
+static int
+target_compare_asc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in ascending order: smallest ratio first, then (if ratios equal)
+	 * smallest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) a->ptr_value)->workmem -
+			((Target *) b->ptr_value)->workmem;
+	}
+	else
+		return a_val > b_val ? 1 : -1;
+}
+
+/* This "descending" comparator sorts most-greedy Targets first. */
+static int
+target_compare_desc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in descending order: largest ratio first, then (if ratios equal)
+	 * largest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) b->ptr_value)->workmem -
+			((Target *) a->ptr_value)->workmem;
+	}
+	else
+		return b_val > a_val ? 1 : -1;
+}
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
index 5ec176d1355..f4e9557f015 100644
--- a/src/backend/executor/execWorkmem.c
+++ b/src/backend/executor/execWorkmem.c
@@ -57,6 +57,9 @@
 #include "optimizer/cost.h"
 
 
+/* Hook for plugins to get control in ExecAssignWorkMem */
+ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook = NULL;
+
 /* decls for local routines only used within this module */
 static void assign_workmem_subplan(SubPlan *subplan);
 static void assign_workmem_plan(Plan *plan);
@@ -81,16 +84,32 @@ static void assign_workmem_agg_node(Agg *agg, bool is_first, bool is_last,
 void
 ExecAssignWorkMem(PlannedStmt *plannedstmt)
 {
-	/*
-	 * No need to re-assign working memory on parallel workers, since workers
-	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
-	 *
-	 * We already assigned working-memory limits on the leader, and those
-	 * limits were sent to the workers inside the serialized Plan.
-	 */
-	if (IsParallelWorker())
-		return;
+	if (ExecAssignWorkMem_hook)
+		(*ExecAssignWorkMem_hook) (plannedstmt);
+	else
+	{
+		/*
+		 * No need to re-assign working memory on parallel workers, since
+		 * workers have the same work_mem and hash_mem_multiplier GUCs as the
+		 * leader.
+		 *
+		 * We already assigned working-memory limits on the leader, and those
+		 * limits were sent to the workers inside the serialized Plan.
+		 *
+		 * We bail out here, in case the hook wants to re-assign memory on
+		 * parallel workers, and maybe wants to call
+		 * standard_ExecAssignWorkMem() first, as well.
+		 */
+		if (IsParallelWorker())
+			return;
 
+		standard_ExecAssignWorkMem(plannedstmt);
+	}
+}
+
+void
+standard_ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
 	/* Assign working memory to the Plans referred to by SubPlan objects. */
 	foreach_ptr(Plan, plan, plannedstmt->subplans)
 	{
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4147876d55..c12625d2061 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -96,6 +96,9 @@ typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
 											  bool ereport_on_violation);
 extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
+/* Hook for plugins to get control in ExecAssignWorkMem() */
+typedef void (*ExecAssignWorkMem_hook_type) (PlannedStmt *plannedstmt);
+extern PGDLLIMPORT ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook;
 
 /*
  * prototypes from functions in execAmi.c
@@ -730,5 +733,6 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
  * prototypes from functions in execWorkmem.c
  */
 extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+extern void standard_ExecAssignWorkMem(PlannedStmt *plannedstmt);
 
 #endif							/* EXECUTOR_H  */
-- 
2.47.1

#21

Jeff Davis

pgsql@j-davis.com

11 months ago

In reply to: James Hunter (#20)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Wed, 2025-02-26 at 13:27 -0800, James Hunter wrote:

Attaching a new refactoring, which splits the code changes into
patches by functionality. This refactoring yields 5 patches, each of
which is relatively localized. I hope that the result will be more
focused and more feasible to review.

Thank you, yes, this is helpful.

Taking a step back:

My idea was that we'd attach work_mem to each Path node and Plan node
at create time. For example, in create_agg_path() it could do:

pathnode->path.node_work_mem = work_mem;

And then add to copy_generic_path_info():

dest->node_work_mem = src->node_work_mem;

(and similarly for other nodes; at least those that care about
work_mem)

Then, at execution time, use node->ps.ss.plan.node_work_mem instead of
work_mem.

Similarly, we could track the node_mem_wanted, which would be the
estimated amount of memory the node would use if unlimited memory were
available. I believe that's already known (or a simple calculation) at
costing time, and we can propagate it from the Path to the Plan the
same way.

(A variant of this approach could carry the values into the PlanState
nodes as well, and the executor would that value instead.)

Extensions like yours could have a GUC like ext.query_work_mem and use
planner_hook to modify plan after standard_planner is done, walking the
tree and adjusting each Plan node's node_work_mem to obey
ext.query_work_mem. Another extension might hook in at path generation
time with set_rel_pathlist_hook or set_join_pathlist_hook to create
paths with lower node_work_mem settings that total up to
ext.query_work_mem. Either way, the node_work_mem settings would end up
being enforced by the executor by tracking memory usage and spilling
when it exceeds node->ps.ss.plan.node_work_mem.

Your patch 0001 looks like it makes it easier to find all the relevant
Plan nodes, so that seems like a reasonable step (didn't look at the
details).

But your patch 0002 and 0003 are entirely different from what I
expected. I just don't understand why we need anything as complex or
specific as ExecAssignWorkMem(). If we just add it at the time the Path
is created, and then propagate it to the plan with
copy_generic_path_info(), that would be a lot less code. What am I
missing?

Regards,
Jeff Davis

#22

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: Jeff Davis (#21)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Wed, Feb 26, 2025 at 4:09 PM Jeff Davis <pgsql@j-davis.com> wrote:

My idea was that we'd attach work_mem to each Path node and Plan node
at create time. For example, in create_agg_path() it could do:

pathnode->path.node_work_mem = work_mem;

And then add to copy_generic_path_info():

dest->node_work_mem = src->node_work_mem;

(and similarly for other nodes; at least those that care about
work_mem)

Then, at execution time, use node->ps.ss.plan.node_work_mem instead of
work_mem.

This is essentially what patches 2 and 3 do. Some comments:

First, there's no need to set the workmem_limit at Path time, since
it's not needed until the Plan is init-ted into a PlanState. So I set
it on the Plan but not on the Path.

Second, the logic to assign a workmem_limit to the Agg node is a bit
more complicated than in your example, because the Agg could be either
a Hash or a Sort. If it's a Hash, it gets work_mem * hash_mem_limit;
and if it's a Sort it gets either 0 or work_mem.

We can adjust the logic so that it gets work_mem instead of 0, by
pushing the complexity out of the original workmem_limit assignment
and into later code blocks -- e.g., in an extension -- but we still
need to decide whether the Agg is a Hash or a Sort. This is why Patch
2 does:

switch (agg->aggstrategy)
{
case AGG_HASHED:
case AGG_MIXED:

/*
* Because nodeAgg.c will combine all AGG_HASHED nodes into a
* single phase, it's easier to store the hash working-memory
* limit on the first AGG_{HASHED,MIXED} node, and set it to zero
* for all subsequent AGG_HASHED nodes.
*/
agg->plan.workmem_limit = is_first ?
get_hash_memory_limit() / 1024 : 0;
break;
case AGG_SORTED:

/*
* Also store the sort-output working-memory limit on the first
* AGG_SORTED node, and set it to zero for all subsequent
* AGG_SORTED nodes.
*
* We'll need working-memory to hold the "sort_out" only if this
* isn't the last Agg node (in which case there's no one to sort
* our output).
*/
agg->plan.workmem_limit = *is_first_sort && !is_last ?
work_mem : 0;

*is_first_sort = false;
break;
default:
break;
}

Notice that the logic also sets the limit to 0 on certain Agg nodes --
this can be avoided, at the cost of added complexity later. The added
complexity is because, for example, for Hash Aggs, all Hash tables
share the same overall workmem_limit. So any workmem_limit set on
subsequent hash Agg nodes would be ignored, which means that setting
that limit adds conmplexity by obscuring the underlying logic.

Similarly, we could track the node_mem_wanted, which would be the
estimated amount of memory the node would use if unlimited memory were
available. I believe that's already known (or a simple calculation) at
costing time, and we can propagate it from the Path to the Plan the
same way.

And this is what Patch 3 does. As you point out, this is already
known, or, if not, it's a simple calculation. This is all that Patch 3
does.

(A variant of this approach could carry the values into the PlanState
nodes as well, and the executor would that value instead.)

That's not needed, though, and would violate existing PG conventions:
we don't copy anything from Plan to PlanState, because the assumption
is that the PlanState always has access to its corresponding Plan.
(The reason we copy from Path to Plan, I believe, is that we drop all
Paths, to save memory; because we generally have many more Paths than
Plans.)

Extensions like yours could have a GUC like ext.query_work_mem and use
planner_hook to modify plan after standard_planner is done, walking the
tree and adjusting each Plan node's node_work_mem to obey
ext.query_work_mem. Another extension might hook in at path generation
time with set_rel_pathlist_hook or set_join_pathlist_hook to create
paths with lower node_work_mem settings that total up to
ext.query_work_mem.

I don't sent workmem_limit at Path time, because it's not needed
there; but I do set the workmem (estimate) at Path time, exactly so
that future optimizer hooks could make use of a Path's workmem
(estimate) to decide between different Paths.

Patch 3 sets workmem (estimate) on the Path and copies it to the Plan.
As you know, there's deliberately not a 1-1 correspondence between
Path and Plan (the way there is generally a 1-1 correspondence between
Plan and PlanState), so Patch 3 has to do some additional work to
propagate the workmem (estimate) from Path to Plan. You can see
existing examples of similar work inside file createplan.c. Creating a
Plan from a Path is not generally as simple as just copying the Path's
fields over; there are lots of special cases.

Although Patch 3 sets workmem (estimate) on the Plan, inside
createplan.c, Patch 2 doesn't set workmem_limit inside createplan.c.
An earlier draft of the patchset *did* set it there, but because of
all the special casing in createplan.c, this ended up becoming
difficult to understand and maintain.

Either way, the node_work_mem settings would end up
being enforced by the executor by tracking memory usage and spilling
when it exceeds node->ps.ss.plan.node_work_mem.

More precisely: the settings are enforced by the executor, by having
each PlanState's ExecInitNode() override refer to the
Plan.workmem_limit field, rather the corresponding GUC(s). This means
that the final workmem_lmit needs to be set before ExecInitNode() is
called.

But your patch 0002 and 0003 are entirely different from what I
expected. I just don't understand why we need anything as complex or
specific as ExecAssignWorkMem(). If we just add it at the time the Path
is created, and then propagate it to the plan with
copy_generic_path_info(), that would be a lot less code. What am I
missing?

Patches 2 and 3 are as you described above. I have been trying to
understand what you mean by "a lot less code," and I think two things
about these patches stand out to you:

1. Patch 2 performs its own Plan tree traversal, in
ExecAssignWorkMem(), instead of relying on the existing traversal in
function create_plan(). I outlined the reasons for this decision
above. Because of the point immediately below this, embedding Patch
2's logic into create_plan() ended up making the code much harder to
follow, so I broke out the traversal into its own (very simple)
ExecAssignWorkMem() function.

2. Patches 2 and 3 necessarily contain logic for various special
cases, where the workmem (estimate) and workmem_limit are not as
simple as in your example above. (But your example is actually not as
simple as you make it out to be, as discussed above and below...)

To understand (2), it helps to have a sense for how, for example, file
createplan.c has already been extended to handle special cases. We
already have functions like copy_plan_costsize(),
make_unique_from_sortclauses(), make_sort_from_groupcols(), etc. Just
from the function names, it's clear that we routinely generate new
Plan nodes that don't have corresponding Paths. Existing function
create_groupingsets_plan() is 150 Lines of Code, because it turns a
single GroupingSetsPath into a chain of Agg nodes. And, of course,
there's the whole logic around turning AlternativeSubPlans into
SubPlans, inside file setrefs.c!

So, generally, Patches 2 and 3 do exactly what you expect, but they
handle (existing) special cases, which ends up requiring more code. If
PostgreSQL didn't have these special cases, I think the patches would
be as short as you expect. For example, if Agg nodes behaved as in
your example, quoted at the top of this email, then we wouldn't need
Patch 2's additional logic to assign workmem_limit to Aggs (and we
wouldn't need the corresponding logic in Patch 3, to assign workmem
(estimate) to Aggs, either).

But Aggs aren't as simple as in your example -- they have Hash limits
and Sort limits; they have a side-chain of Agg nodes; they have input
sets they need to Sort; etc. And so we need a couple dozen lines of
code to handle them.

Thanks for the feedback,
James Hunter

#23

James Hunter

james.hunter.pg@gmail.com

10 months ago

In reply to: James Hunter (#22)

4 attachment(s)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

Attaching a new revision, which substantially reworks the previous revision --

For the previous revision, I ran into problems (exposed by CI tests)
when trying to get my "subPlan" list to work, because this means we
have two pointers into a single SubPlan, which breaks both
serialization and copyObject().

This led to a new approach. The former Patch 1 is no longer needed,
because that "subPlan" logic never worked anyway.

Now, I store the workmem info, in Lists, first on the PlannerGlobal,
then transferred to the PlannedStmt. Every [Sub]Plan that needs
working memory now gets a "workmem_id" index into these Lists. Since
it's just an index, it survives serialization and copyObject().

So, now the workmem info can now be successfully roundtripped. It also
makes it easier (and faster) for an extension to adjust workmem limits
for an entire query, since all of the query's workmem info is
available directly from the PlannedStmt -- without requiring us to
traverse the Plan + Expr trees. (My example hook/extension dropped by
a couple hundred LoC, since the previous revision, because now it can
just loop over a List, instead of needing to walk a Plan tree.)

So, now we have:

- Patch 1: adds a workmem limit to the PlannerGlobal, inside
createplan.c, and stores the corresponding workmem_id on the Plan or
SubPlan. The List is copied from the PlannerGlobal to the PlannedStmt,
as normal. We trivially set the workmem limit inside
ExecAssignWorkMem(), called from InitPlan.

This patch is a no-op, since it just copies existing GUC values to the
workmem limit, and then applies that limit inside ExecInitNode().

- Patch 2: copies the planner's workmem estimate to the PlannerGlobal
/ PlannedStmt, to allow an extension to set the workmem limit
intelligently (without needing to traverse to the Plan or SubPlan).

This patch is a no-op, since it just records an estimate on the
PlannerGlobal / PlannedStmt, but doesn't do anything with it (yet).

- Patch 3: displays the workmem info we set in Patches 1 and 2, to a
new EXPLAIN (work_mem on) option. Also adds a unit test.

- Patch 4: adds a hook and extension that show how to override the
default workmem limits, to implement a query_work_mem GUC.

I think this version is pretty close to a finished design proposal:

* top-level list(s) of workmem info;
* Plans and SubPlans that need workmem "registering" themselves
during createplan.c;
* exec nodes reading their workmem limits from the PlannedStmt, via
plan->workmem_id (or variants, in cases where a [Sub]Plan has multiple
data structures of *different* sizes);
* InitPlan() calls a function or hook to fill in the actual workmem limits;
* Workmem info copied / serialized to PQ workers, and stored in Plan
cache (but the limit is always overwritten inside InitPlan()); and
* Hook / extension reads the workmem info and sets a sensible limit,
based on its own heuristic.

Patch 4 shows that we can pretty easily (400 lines, including
comments) propagate a per-query workmem limit to individual
[Sub]Plans' data structures, in a reasonable way.

Compared to the previous revision, this patch set:
- eliminates the Plan traversal in execWorkMem.c and workmem.c;
- removes the "SubPlan" logic from setrefs.c, leaving setrefs unchanged; and
- sets the estimate and reserves a slot for the limit, inside createplan.c.

So, now, the logic to assign workmem limits is just a for- loop in
execWorkMem.c; and it's just 2 for- loops + 1 sort, in the workmem
extension.

Questions, comments?

Thanks,
James

Attachments:

0001-Store-working-memory-limit-per-Plan-SubPlan-rather-t.patchapplication/octet-stream; name=0001-Store-working-memory-limit-per-Plan-SubPlan-rather-t.patchDownload

From 058456f2a25f0b030d78b4fcdbb0df9238ad7f12 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Tue, 25 Feb 2025 22:44:01 +0000
Subject: [PATCH 1/4] Store working memory limit per Plan/SubPlan, rather than
 in GUC

This commit moves the working-memory limit that an executor node checks, at
runtime, from the "work_mem" and "hash_mem_multiplier" GUCs, to a new
list, "workMemLimits", added to the PlannedStmt node. At runtimem an exec
node checks its limit by looking up the list element corresponding to its
plan->workmem_id field.

Indirecting the workMemLimit via a List index allows us to handle SubPlans
as well as Plans. It also allows a future extension to set limits on
individual Plans/SubPlans, without needing to re-traverse the Plan +
Expr tree.

To preserve backward, this commit also copies the "work_mem", etc., values
from the existing GUCs to the new field. This means that this commit is
just a refactoring, and doesn't change any behavior.

This "workmem_id" field is on the Plan node, instead of the corresponding
PlanState, because the workMemLimit needs to be set before we can call
ExecInitNode().
---
 src/backend/executor/Makefile              |   1 +
 src/backend/executor/execGrouping.c        |  10 +-
 src/backend/executor/execMain.c            |   6 +
 src/backend/executor/execParallel.c        |   2 +
 src/backend/executor/execSRF.c             |   5 +-
 src/backend/executor/execWorkmem.c         |  87 ++++++++++++
 src/backend/executor/meson.build           |   1 +
 src/backend/executor/nodeAgg.c             |  64 ++++++---
 src/backend/executor/nodeBitmapIndexscan.c |   2 +-
 src/backend/executor/nodeBitmapOr.c        |   2 +-
 src/backend/executor/nodeCtescan.c         |   3 +-
 src/backend/executor/nodeFunctionscan.c    |   2 +
 src/backend/executor/nodeHash.c            |  22 +++-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeMaterial.c        |   3 +-
 src/backend/executor/nodeMemoize.c         |   2 +-
 src/backend/executor/nodeRecursiveunion.c  |  14 +-
 src/backend/executor/nodeSetOp.c           |   1 +
 src/backend/executor/nodeSort.c            |   4 +-
 src/backend/executor/nodeSubplan.c         |  16 +++
 src/backend/executor/nodeTableFuncscan.c   |   3 +-
 src/backend/executor/nodeWindowAgg.c       |   3 +-
 src/backend/optimizer/path/costsize.c      |  15 ++-
 src/backend/optimizer/plan/createplan.c    | 146 ++++++++++++++++++---
 src/backend/optimizer/plan/planner.c       |   5 +-
 src/backend/optimizer/plan/subselect.c     |   2 +-
 src/include/executor/executor.h            |   7 +
 src/include/executor/hashjoin.h            |   3 +-
 src/include/executor/nodeAgg.h             |   3 +-
 src/include/executor/nodeHash.h            |   3 +-
 src/include/nodes/execnodes.h              |  13 ++
 src/include/nodes/pathnodes.h              |  11 ++
 src/include/nodes/plannodes.h              |  27 +++-
 src/include/nodes/primnodes.h              |   3 +
 src/include/optimizer/planmain.h           |   4 +-
 35 files changed, 433 insertions(+), 66 deletions(-)
 create mode 100644 src/backend/executor/execWorkmem.c

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..8aa9580558f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -30,6 +30,7 @@ OBJS = \
 	execScan.o \
 	execTuples.o \
 	execUtils.o \
+	execWorkmem.o \
 	functions.o \
 	instrument.o \
 	nodeAgg.o \
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 33b124fbb0a..bcd1822da80 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -168,6 +168,7 @@ BuildTupleHashTable(PlanState *parent,
 					Oid *collations,
 					long nbuckets,
 					Size additionalsize,
+					Size hash_mem_limit,
 					MemoryContext metacxt,
 					MemoryContext tablecxt,
 					MemoryContext tempcxt,
@@ -175,15 +176,18 @@ BuildTupleHashTable(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
-	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
 
 	Assert(nbuckets > 0);
 
-	/* Limit initial table size request to not more than hash_mem */
-	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	/*
+	 * Limit initial table size request to not more than hash_mem
+	 *
+	 * XXX - we should also limit the *maximum* table size to hash_mem.
+	 */
+	hash_mem_limit = hash_mem_limit / entrysize;
 	if (nbuckets > hash_mem_limit)
 		nbuckets = hash_mem_limit;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0493b7d5365..78fd887a84d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1050,6 +1050,12 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/* signal that this EState is not used for EPQ */
 	estate->es_epq_active = NULL;
 
+	/*
+	 * Assign working memory to SubPlan and Plan nodes, before initializing
+	 * their states.
+	 */
+	ExecAssignWorkMem(plannedstmt);
+
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 1bedb808368..97d83bae571 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -213,6 +213,8 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->utilityStmt = NULL;
 	pstmt->stmt_location = -1;
 	pstmt->stmt_len = -1;
+	pstmt->workMemCategories = estate->es_plannedstmt->workMemCategories;
+	pstmt->workMemLimits = estate->es_plannedstmt->workMemLimits;
 
 	/* Return serialized copy of our dummy PlannedStmt. */
 	return nodeToString(pstmt);
diff --git a/src/backend/executor/execSRF.c b/src/backend/executor/execSRF.c
index a03fe780a02..4b1e7e0ad1e 100644
--- a/src/backend/executor/execSRF.c
+++ b/src/backend/executor/execSRF.c
@@ -102,6 +102,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 							ExprContext *econtext,
 							MemoryContext argContext,
 							TupleDesc expectedDesc,
+							int workMem,
 							bool randomAccess)
 {
 	Tuplestorestate *tupstore = NULL;
@@ -261,7 +262,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 				MemoryContext oldcontext =
 					MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-				tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+				tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 				rsinfo.setResult = tupstore;
 				if (!returnsTuple)
 				{
@@ -396,7 +397,7 @@ no_function_result:
 		MemoryContext oldcontext =
 			MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-		tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+		tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 		rsinfo.setResult = tupstore;
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
new file mode 100644
index 00000000000..d8a19a58ebe
--- /dev/null
+++ b/src/backend/executor/execWorkmem.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * execWorkmem.c
+ *	 routine to set the "workmem_limit" field(s) on Plan nodes that need
+ *   workimg memory.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execWorkmem.c
+ *
+ * INTERFACE ROUTINES
+ *		ExecAssignWorkMem	- assign working memory to Plan nodes
+ *
+ *	 NOTES
+ *		Historically, every PlanState node, during initialization, looked at
+ *		the "work_mem" (plus maybe "hash_mem_multiplier") GUC, to determine
+ *		its working-memory limit.
+ *
+ *		Now, to allow different PlanState nodes to be restricted to different
+ *		amounts of memory, each PlanState node reads this limit off the
+ *		PlannedStmt's workMemLimits List, at the (1-based) position indicated
+ *		by the PlanState's Plan node's "workmem_id" field.
+ *
+ *		We assign the workmem_id and expand the workMemLimits List, when
+ *		creating the Plan node; and then we set this limit by calling
+ *		ExecAssignWorkMem(), from InitPlan(), before we initialize the PlanState
+ *		nodes.
+ *
+ * 		The workMemLimit has always applied "per data structure," rather than
+ *		"per PlanState". So a single SQL operator (e.g., RecursiveUnion) can
+ *		use more than the workMemLimit, even though each of its data
+ *		structures is restricted to it.
+ *
+ *		We store the "workmem_id" field(s) on the Plan, instead of the
+ *		PlanState, even though it conceptually belongs to execution rather than
+ *		to planning, because we need it to be set before initializing the
+ *		corresponding PlanState. This is a chicken-and-egg problem. We could,
+ *		of course, make ExecInitNode() a two-phase operation, but that seems
+ *		like overkill. Instead, we store these "workmem_id" fields on the Plan,
+ *		but set the workMemLimit when we start execution, as part of
+ *		InitPlan().
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "nodes/plannodes.h"
+
+
+/* ------------------------------------------------------------------------
+ *		ExecAssignWorkMem
+ *
+ *		Assigns working memory to any Plans or SubPlans that need it.
+ *
+ *		Inputs:
+ *		  'plannedstmt' is the statement to which we assign working memory
+ *
+ * ------------------------------------------------------------------------
+ */
+void
+ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	ListCell   *lc_category;
+	ListCell   *lc_limit;
+
+	/*
+	 * No need to re-assign working memory on parallel workers, since workers
+	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
+	 *
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	forboth(lc_category, plannedstmt->workMemCategories,
+			lc_limit, plannedstmt->workMemLimits)
+	{
+		lfirst_int(lc_limit) = lfirst_int(lc_category) == WORKMEM_HASH ?
+			get_hash_memory_limit() / 1024 : work_mem;
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..4e65974f5f3 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -18,6 +18,7 @@ backend_sources += files(
   'execScan.c',
   'execTuples.c',
   'execUtils.c',
+  'execWorkmem.c',
   'functions.c',
   'instrument.c',
   'nodeAgg.c',
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ceb8c8a8039..b06306d4961 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -258,6 +258,7 @@
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "executor/nodeHash.h"
 #include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -403,7 +404,8 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
 					  Bitmapset **unaggregated);
 static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets,
+							 Size hash_mem_limit);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
 static long hash_choose_num_buckets(double hashentrysize,
@@ -411,6 +413,7 @@ static long hash_choose_num_buckets(double hashentrysize,
 static int	hash_choose_num_partitions(double input_groups,
 									   double hashentrysize,
 									   int used_bits,
+									   Size hash_mem_limit,
 									   int *log2_npartitions);
 static void initialize_hash_entry(AggState *aggstate,
 								  TupleHashTable hashtable,
@@ -433,7 +436,7 @@ static HashAggBatch *hashagg_batch_new(LogicalTape *input_tape, int setno,
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset,
 							   int used_bits, double input_groups,
-							   double hashentrysize);
+							   double hashentrysize, Size hash_mem_limit);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 								TupleTableSlot *inputslot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
@@ -521,6 +524,15 @@ initialize_phase(AggState *aggstate, int newphase)
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
+		int			workmem_limit;
+
+		/*
+		 * Read the sort-output workmem limit off the first AGG_SORTED node.
+		 * Since phase 0 is always AGG_HASHED, this will always be phase 1.
+		 */
+		workmem_limit =
+			workMemLimitFromId(aggstate,
+							   aggstate->phases[1].aggnode->plan.workmem_id);
 
 		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
 												  sortnode->numCols,
@@ -528,7 +540,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->sortOperators,
 												  sortnode->collations,
 												  sortnode->nullsFirst,
-												  work_mem,
+												  workmem_limit,
 												  NULL, TUPLESORT_NONE);
 	}
 
@@ -584,6 +596,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 	 */
 	if (pertrans->aggsortrequired)
 	{
+		int			workmem_limit;
+
 		/*
 		 * In case of rescan, maybe there could be an uncompleted sort
 		 * operation?  Clean it up if so.
@@ -591,6 +605,12 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 		if (pertrans->sortstates[aggstate->current_set])
 			tuplesort_end(pertrans->sortstates[aggstate->current_set]);
 
+		/*
+		 * Read the sort-input workmem limit off the first Agg node.
+		 */
+		workmem_limit =
+			workMemLimitFromId(aggstate,
+							   ((Agg *) aggstate->ss.ps.plan)->sortWorkMemId);
 
 		/*
 		 * We use a plain Datum sorter when there's a single input column;
@@ -606,7 +626,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, TUPLESORT_NONE);
+									  workmem_limit, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -616,7 +636,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, TUPLESORT_NONE);
+									 workmem_limit, NULL, TUPLESORT_NONE);
 	}
 
 	/*
@@ -1498,7 +1518,7 @@ build_hash_tables(AggState *aggstate)
 		}
 #endif
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, setno, nbuckets, memory);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1508,7 +1528,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, long nbuckets,
+				 Size hash_mem_limit)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
@@ -1537,6 +1558,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 perhash->aggnode->grpCollations,
 											 nbuckets,
 											 additionalsize,
+											 hash_mem_limit,
 											 metacxt,
 											 hashcxt,
 											 tmpcxt,
@@ -1805,12 +1827,11 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  */
 void
 hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
-					Size *mem_limit, uint64 *ngroups_limit,
+					Size hash_mem_limit, Size *mem_limit, uint64 *ngroups_limit,
 					int *num_partitions)
 {
 	int			npartitions;
 	Size		partition_mem;
-	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
 	if (input_groups * hashentrysize <= hash_mem_limit)
@@ -1830,6 +1851,7 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups,
 											 hashentrysize,
 											 used_bits,
+											 hash_mem_limit,
 											 NULL);
 	if (num_partitions != NULL)
 		*num_partitions = npartitions;
@@ -1927,7 +1949,8 @@ hash_agg_enter_spill_mode(AggState *aggstate)
 
 			hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
+							   aggstate->hashentrysize,
+							   (Size) workMemLimit(aggstate) * 1024);
 		}
 	}
 }
@@ -2014,9 +2037,9 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  */
 static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
-						   int used_bits, int *log2_npartitions)
+						   int used_bits, Size hash_mem_limit,
+						   int *log2_npartitions)
 {
-	Size		hash_mem_limit = get_hash_memory_limit();
 	double		partition_limit;
 	double		mem_wanted;
 	double		dpartitions;
@@ -2156,7 +2179,8 @@ lookup_hash_entries(AggState *aggstate)
 			if (spill->partitions == NULL)
 				hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 								   perhash->aggnode->numGroups,
-								   aggstate->hashentrysize);
+								   aggstate->hashentrysize,
+								   (Size) workMemLimit(aggstate) * 1024);
 
 			hashagg_spill_tuple(aggstate, spill, slot, hash);
 			pergroup[setno] = NULL;
@@ -2630,7 +2654,9 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
-						batch->used_bits, &aggstate->hash_mem_limit,
+						batch->used_bits,
+						(Size) workMemLimit(aggstate) * 1024,
+						&aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
 	/*
@@ -2718,7 +2744,8 @@ agg_refill_hash_table(AggState *aggstate)
 				 */
 				spill_initialized = true;
 				hashagg_spill_init(&spill, tapeset, batch->used_bits,
-								   batch->input_card, aggstate->hashentrysize);
+								   batch->input_card, aggstate->hashentrysize,
+								   (Size) workMemLimit(aggstate) * 1024);
 			}
 			/* no memory for a new group, spill */
 			hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
@@ -2916,13 +2943,15 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
  */
 static void
 hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
-				   double input_groups, double hashentrysize)
+				   double input_groups, double hashentrysize,
+				   Size hash_mem_limit)
 {
 	int			npartitions;
 	int			partition_bits;
 
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
-											 used_bits, &partition_bits);
+											 used_bits, hash_mem_limit,
+											 &partition_bits);
 
 #ifdef USE_INJECTION_POINTS
 	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
@@ -3649,6 +3678,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			totalGroups += aggstate->perhash[k].aggnode->numGroups;
 
 		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							(Size) workMemLimit(aggstate) * 1024,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 0b32c3a022f..0b33a1f4533 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -91,7 +91,7 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	else
 	{
 		/* XXX should we use less than work_mem for this? */
-		tbm = tbm_create(work_mem * (Size) 1024,
+		tbm = tbm_create(workMemLimit(node) * (Size) 1024,
 						 ((BitmapIndexScan *) node->ss.ps.plan)->isshared ?
 						 node->ss.ps.state->es_query_dsa : NULL);
 	}
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 231760ec93d..16d0a164292 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -143,7 +143,7 @@ MultiExecBitmapOr(BitmapOrState *node)
 			if (result == NULL) /* first subplan */
 			{
 				/* XXX should we use less than work_mem for this? */
-				result = tbm_create(work_mem * (Size) 1024,
+				result = tbm_create(workMemLimit(subnode) * (Size) 1024,
 									((BitmapOr *) node->ps.plan)->isshared ?
 									node->ps.state->es_query_dsa : NULL);
 			}
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index e1675f66b43..08f48f88e65 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -232,7 +232,8 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 		/* I am the leader */
 		prmdata->value = PointerGetDatum(scanstate);
 		scanstate->leader = scanstate;
-		scanstate->cte_table = tuplestore_begin_heap(true, false, work_mem);
+		scanstate->cte_table =
+			tuplestore_begin_heap(true, false, workMemLimit(scanstate));
 		tuplestore_set_eflags(scanstate->cte_table, scanstate->eflags);
 		scanstate->readptr = 0;
 	}
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 644363582d9..fda42a278b8 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -95,6 +95,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											node->funcstates[0].tupdesc,
+											workMemLimit(node),
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
@@ -154,6 +155,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											fs->tupdesc,
+											workMemLimit(node),
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..bb9af08dc5d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -448,6 +448,7 @@ ExecHashTableCreate(HashState *state)
 	Hash	   *node;
 	HashJoinTable hashtable;
 	Plan	   *outerNode;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;
 	int			nbuckets;
 	int			nbatch;
@@ -471,11 +472,15 @@ ExecHashTableCreate(HashState *state)
 	 */
 	rows = node->plan.parallel_aware ? node->rows_total : outerNode->plan_rows;
 
+	worker_space_allowed = (size_t) workMemLimit(state) * 1024;
+	Assert(worker_space_allowed > 0);
+
 	ExecChooseHashTableSize(rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
+							worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs);
 
@@ -599,6 +604,7 @@ ExecHashTableCreate(HashState *state)
 		{
 			pstate->nbatch = nbatch;
 			pstate->space_allowed = space_allowed;
+			pstate->worker_space_allowed = worker_space_allowed;
 			pstate->growth = PHJ_GROWTH_OK;
 
 			/* Set up the shared state for coordinating batches. */
@@ -658,7 +664,8 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t *space_allowed,
+						size_t worker_space_allowed,
+						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
 						int *num_skew_mcvs)
@@ -687,9 +694,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
-	 * Compute in-memory hashtable size limit from GUCs.
+	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = get_hash_memory_limit();
+	hash_table_bytes = worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -706,7 +713,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		hash_table_bytes = (size_t) newlimit;
 	}
 
-	*space_allowed = hash_table_bytes;
+	*total_space_allowed = hash_table_bytes;
 
 	/*
 	 * If skew optimization is possible, estimate the number of skew buckets
@@ -808,7 +815,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		{
 			ExecChooseHashTableSize(ntuples, tupwidth, useskew,
 									false, parallel_workers,
-									space_allowed,
+									worker_space_allowed,
+									total_space_allowed,
 									numbuckets,
 									numbatches,
 									num_skew_mcvs);
@@ -929,7 +937,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
 	Assert(nbuckets > 0);
@@ -1235,7 +1243,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					 * to switch from one large combined memory budget to the
 					 * regular hash_mem budget.
 					 */
-					pstate->space_allowed = get_hash_memory_limit();
+					pstate->space_allowed = pstate->worker_space_allowed;
 
 					/*
 					 * The combined hash_mem of all participants wasn't
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 975b0397e7a..7a92c1eb2c0 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -312,7 +312,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.sortOperators[nPresortedCols]),
 												&(plannode->sort.collations[nPresortedCols]),
 												&(plannode->sort.nullsFirst[nPresortedCols]),
-												work_mem,
+												workMemLimit(pstate),
 												NULL,
 												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
@@ -613,7 +613,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.sortOperators,
 												  plannode->sort.collations,
 												  plannode->sort.nullsFirst,
-												  work_mem,
+												  workMemLimit(pstate),
 												  NULL,
 												  node->bounded ?
 												  TUPLESORT_ALLOWBOUNDED :
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9798bb75365..bf5e921a205 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -61,7 +61,8 @@ ExecMaterial(PlanState *pstate)
 	 */
 	if (tuplestorestate == NULL && node->eflags != 0)
 	{
-		tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
+		tuplestorestate =
+			tuplestore_begin_heap(true, false, workMemLimit(node));
 		tuplestore_set_eflags(tuplestorestate, node->eflags);
 		if (node->eflags & EXEC_FLAG_MARK)
 		{
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 609deb12afb..4e3da4aab6b 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -1036,7 +1036,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
 	mstate->mem_used = 0;
 
 	/* Limit the total memory consumed by the cache to this */
-	mstate->mem_limit = get_hash_memory_limit();
+	mstate->mem_limit = (Size) workMemLimit(mstate) * 1024;
 
 	/* A memory context dedicated for the cache */
 	mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..5ffffd327d2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -33,6 +33,8 @@ build_hash_table(RecursiveUnionState *rustate)
 {
 	RecursiveUnion *node = (RecursiveUnion *) rustate->ps.plan;
 	TupleDesc	desc = ExecGetResultType(outerPlanState(rustate));
+	int			workmem_limit = workMemLimitFromId(rustate,
+												   node->hashWorkMemId);
 
 	Assert(node->numCols > 0);
 	Assert(node->numGroups > 0);
@@ -52,6 +54,7 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->dupCollations,
 											 node->numGroups,
 											 0,
+											 (Size) workmem_limit * 1024,
 											 rustate->ps.state->es_query_cxt,
 											 rustate->tableContext,
 											 rustate->tempContext,
@@ -202,8 +205,15 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/* initialize processing state */
 	rustate->recursing = false;
 	rustate->intermediate_empty = true;
-	rustate->working_table = tuplestore_begin_heap(false, false, work_mem);
-	rustate->intermediate_table = tuplestore_begin_heap(false, false, work_mem);
+
+	/*
+	 * NOTE: each of our working tables gets the same workmem_limit, since
+	 * we're going to swap them repeatedly.
+	 */
+	rustate->working_table = tuplestore_begin_heap(false, false,
+												   workMemLimit(rustate));
+	rustate->intermediate_table = tuplestore_begin_heap(false, false,
+														workMemLimit(rustate));
 
 	/*
 	 * If hashing, we need a per-tuple memory context for comparisons, and a
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5b7ff9c3748..2e256f634c8 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -105,6 +105,7 @@ build_hash_table(SetOpState *setopstate)
 												node->cmpCollations,
 												node->numGroups,
 												sizeof(SetOpStatePerGroupData),
+												(Size) workMemLimit(setopstate) * 1024,
 												setopstate->ps.state->es_query_cxt,
 												setopstate->tableContext,
 												econtext->ecxt_per_tuple_memory,
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f603337ecd3..8ec939e25d7 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -107,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->sortOperators[0],
 												   plannode->collations[0],
 												   plannode->nullsFirst[0],
-												   work_mem,
+												   workMemLimit(pstate),
 												   NULL,
 												   tuplesortopts);
 		else
@@ -117,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->sortOperators,
 												  plannode->collations,
 												  plannode->nullsFirst,
-												  work_mem,
+												  workMemLimit(pstate),
 												  NULL,
 												  tuplesortopts);
 		if (node->bounded)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 49767ed6a52..2d0df165c25 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -536,6 +536,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 	if (node->hashtable)
 		ResetTupleHashTable(node->hashtable);
 	else
+	{
+		int			workmem_limit;
+
+		workmem_limit = workMemLimitFromId(planstate,
+										   subplan->hashtab_workmem_id);
+
 		node->hashtable = BuildTupleHashTable(node->parent,
 											  node->descRight,
 											  &TTSOpsVirtual,
@@ -546,10 +552,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  node->tab_collations,
 											  nbuckets,
 											  0,
+											  (Size) workmem_limit * 1024,
 											  node->planstate->state->es_query_cxt,
 											  node->hashtablecxt,
 											  node->hashtempcxt,
 											  false);
+	}
 
 	if (!subplan->unknownEqFalse)
 	{
@@ -565,6 +573,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		if (node->hashnulls)
 			ResetTupleHashTable(node->hashnulls);
 		else
+		{
+			int			workmem_limit;
+
+			workmem_limit = workMemLimitFromId(planstate,
+											   subplan->hashnul_workmem_id);
+
 			node->hashnulls = BuildTupleHashTable(node->parent,
 												  node->descRight,
 												  &TTSOpsVirtual,
@@ -575,10 +589,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 												  node->tab_collations,
 												  nbuckets,
 												  0,
+												  (Size) workmem_limit * 1024,
 												  node->planstate->state->es_query_cxt,
 												  node->hashtablecxt,
 												  node->hashtempcxt,
 												  false);
+		}
 	}
 	else
 		node->hashnulls = NULL;
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 83ade3f9437..f679bd67bee 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -276,7 +276,8 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	/* build tuplestore for the result */
 	oldcxt = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
-	tstate->tupstore = tuplestore_begin_heap(false, false, work_mem);
+	tstate->tupstore = tuplestore_begin_heap(false, false,
+											 workMemLimit(tstate));
 
 	/*
 	 * Each call to fetch a new set of rows - of which there may be very many
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 9a1acce2b5d..7660aa626b6 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1092,7 +1092,8 @@ prepare_tuplestore(WindowAggState *winstate)
 	Assert(winstate->buffer == NULL);
 
 	/* Create new tuplestore */
-	winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
+	winstate->buffer = tuplestore_begin_heap(false, false,
+											 workMemLimit(winstate));
 
 	/*
 	 * Set up read pointers for the tuplestore.  The current pointer doesn't
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 73d78617009..ca4ab9bd315 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2802,7 +2802,8 @@ cost_agg(Path *path, PlannerInfo *root,
 		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
 											input_width,
 											aggcosts->transitionSpace);
-		hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+		hash_agg_set_limits(hashentrysize, numGroups, 0,
+							get_hash_memory_limit(), &mem_limit,
 							&ngroups_limit, &num_partitions);
 
 		nbatches = Max((numGroups * hashentrysize) / mem_limit,
@@ -4224,6 +4225,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
+							get_hash_memory_limit(),
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
@@ -4541,6 +4543,17 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		sp_cost.startup += plan->total_cost +
 			cpu_operator_cost * plan->plan_rows;
 
+		/*
+		 * Working memory needed for the hashtable (and hashnulls, if needed).
+		 */
+		subplan->hashtab_workmem_id = add_hash_workmem(root->glob);
+
+		if (!subplan->unknownEqFalse)
+		{
+			/* Also needs a hashnulls table.  */
+			subplan->hashnul_workmem_id = add_hash_workmem(root->glob);
+		}
+
 		/*
 		 * The per-tuple costs include the cost of evaluating the lefthand
 		 * expressions, plus the cost of probing the hashtable.  We already
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 816a2b2a576..97e43d49d1f 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1656,6 +1656,8 @@ create_material_plan(PlannerInfo *root, MaterialPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -1710,6 +1712,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -1856,6 +1860,8 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 best_path->path.rows,
 								 0,
 								 subplan);
+
+		plan->workmem_id = add_hash_workmem(root->glob);
 	}
 	else
 	{
@@ -2202,6 +2208,8 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2228,6 +2236,8 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
+	plan->sort.plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2339,6 +2349,12 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	if (plan->aggstrategy == AGG_HASHED)
+		plan->plan.workmem_id = add_hash_workmem(root->glob);
+
+	/* Also include working memory needed to sort the input: */
+	plan->sortWorkMemId = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2392,6 +2408,7 @@ static Plan *
 create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 {
 	Agg		   *plan;
+	Agg		   *first_sort_agg = NULL;
 	Plan	   *subplan;
 	List	   *rollups = best_path->rollups;
 	AttrNumber *grouping_map;
@@ -2457,7 +2474,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			RollupData *rollup = lfirst(lc);
 			AttrNumber *new_grpColIdx;
 			Plan	   *sort_plan = NULL;
-			Plan	   *agg_plan;
+			Agg		   *agg_plan;
 			AggStrategy strat;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
@@ -2480,19 +2497,19 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			else
 				strat = AGG_SORTED;
 
-			agg_plan = (Plan *) make_agg(NIL,
-										 NIL,
-										 strat,
-										 AGGSPLIT_SIMPLE,
-										 list_length((List *) linitial(rollup->gsets)),
-										 new_grpColIdx,
-										 extract_grouping_ops(rollup->groupClause),
-										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
-										 NIL,
-										 rollup->numGroups,
-										 best_path->transitionSpace,
-										 sort_plan);
+			agg_plan = make_agg(NIL,
+								NIL,
+								strat,
+								AGGSPLIT_SIMPLE,
+								list_length((List *) linitial(rollup->gsets)),
+								new_grpColIdx,
+								extract_grouping_ops(rollup->groupClause),
+								extract_grouping_collations(rollup->groupClause, subplan->targetlist),
+								rollup->gsets,
+								NIL,
+								rollup->numGroups,
+								best_path->transitionSpace,
+								sort_plan);
 
 			/*
 			 * Remove stuff we don't need to avoid bloating debug output.
@@ -2503,6 +2520,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan->lefttree = NULL;
 			}
 
+			if (agg_plan->aggstrategy == AGG_SORTED && !first_sort_agg)
+			{
+				/* This might be the first Sort agg. */
+				first_sort_agg = agg_plan;
+			}
+
 			chain = lappend(chain, agg_plan);
 		}
 	}
@@ -2535,6 +2558,29 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
+
+		/*
+		 * NOTE: We will place the workmem needed to sort the input (if any)
+		 * on the first agg, the Hash workmem on the first Hash agg, and the
+		 * Sort workmem (if any) on the first Sort agg.
+		 */
+		if (plan->aggstrategy == AGG_HASHED || plan->aggstrategy == AGG_MIXED)
+		{
+			/* All Hash Grouping Sets share the same workmem limit. */
+			plan->plan.workmem_id = add_hash_workmem(root->glob);
+		}
+		else if (plan->aggstrategy == AGG_SORTED)
+		{
+			/* Every Sort Grouping Set gets its own workmem limit. */
+			first_sort_agg = plan;
+		}
+
+		/* Store the workmem limit, for all Sorts, on the first Sort. */
+		if (first_sort_agg)
+			first_sort_agg->plan.workmem_id = add_workmem(root->glob);
+
+		/* Also include working memory needed to sort the input: */
+		plan->sortWorkMemId = add_workmem(root->glob);
 	}
 
 	return (Plan *) plan;
@@ -2707,6 +2753,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2747,6 +2795,8 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2783,6 +2833,12 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
+	/* Also include working memory for hash table. */
+	if (plan->numCols > 0)
+		plan->hashWorkMemId = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -3489,6 +3545,9 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
+
+		plan->workmem_id = add_workmem(root->glob);
+
 		/* Extract original index clauses, actual index quals, relevant ECs */
 		subquals = NIL;
 		subindexquals = NIL;
@@ -3796,6 +3855,8 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -3839,6 +3900,8 @@ create_tablefuncscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -3977,6 +4040,8 @@ create_ctescan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -4616,6 +4681,8 @@ create_mergejoin_plan(PlannerInfo *root,
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
 
+		matplan->workmem_id = add_workmem(root->glob);
+
 		inner_plan = matplan;
 	}
 
@@ -4961,6 +5028,9 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	/* Assign workmem to the Hash subnode, not its parent HashJoin node. */
+	hash_plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return join_plan;
 }
 
@@ -5513,6 +5583,8 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
 	plan->plan.parallel_safe = lefttree->parallel_safe;
+
+	plan->plan.workmem_id = add_workmem(root->glob);
 }
 
 /*
@@ -5544,6 +5616,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 	plan->sort.plan.plan_width = lefttree->plan_width;
 	plan->sort.plan.parallel_aware = false;
 	plan->sort.plan.parallel_safe = lefttree->parallel_safe;
+
+	plan->sort.plan.workmem_id = add_workmem(root->glob);
 }
 
 /*
@@ -6595,14 +6669,14 @@ make_material(Plan *lefttree)
 
 /*
  * materialize_finished_plan: stick a Material node atop a completed plan
- *
+ *r/
  * There are a couple of places where we want to attach a Material node
  * after completion of create_plan(), without any MaterialPath path.
  * Those places should probably be refactored someday to do this on the
  * Path representation, but it's not worth the trouble yet.
  */
 Plan *
-materialize_finished_plan(Plan *subplan)
+materialize_finished_plan(PlannerGlobal *glob, Plan *subplan)
 {
 	Plan	   *matplan;
 	Path		matpath;		/* dummy for result of cost_material */
@@ -6641,6 +6715,8 @@ materialize_finished_plan(Plan *subplan)
 	matplan->parallel_aware = false;
 	matplan->parallel_safe = subplan->parallel_safe;
 
+	matplan->workmem_id = add_workmem(glob);
+
 	return matplan;
 }
 
@@ -7403,3 +7479,41 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+static int
+add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
+{
+	glob->workMemCategories = lappend_int(glob->workMemCategories, category);
+	/* the executor will fill this in later: */
+	glob->workMemLimits = lappend_int(glob->workMemLimits, 0);
+
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemLimits));
+
+	return list_length(glob->workMemCategories);
+}
+
+/*
+ * add_workmem
+ *	  Add (non-hash) workmem info to the glob's lists
+ *
+ * This data structure will have its working-memory limit set to work_mem.
+ */
+int
+add_workmem(PlannerGlobal *glob)
+{
+	return add_workmem_internal(glob, WORKMEM_NORMAL);
+}
+
+/*
+ * add_hash_workmem
+ *	  Add hash workmem info to the glob's lists
+ *
+ * This data structure will have its working-memory limit set to work_mem *
+ * hash_mem_multiplier.
+ */
+int
+add_hash_workmem(PlannerGlobal *glob)
+{
+	return add_workmem_internal(glob, WORKMEM_HASH);
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36ee6dd43de..56846fdeaab 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -437,7 +437,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	if (cursorOptions & CURSOR_OPT_SCROLL)
 	{
 		if (!ExecSupportsBackwardScan(top_plan))
-			top_plan = materialize_finished_plan(top_plan);
+			top_plan = materialize_finished_plan(glob, top_plan);
 	}
 
 	/*
@@ -573,6 +573,9 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->stmt_location = parse->stmt_location;
 	result->stmt_len = parse->stmt_len;
 
+	result->workMemCategories = glob->workMemCategories;
+	result->workMemLimits = glob->workMemLimits;
+
 	result->jitFlags = PGJIT_NONE;
 	if (jit_enabled && jit_above_cost >= 0 &&
 		top_plan->total_cost > jit_above_cost)
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 8230cbea3c3..27ccd04cada 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -533,7 +533,7 @@ build_subplan(PlannerInfo *root, Plan *plan, Path *path,
 		 */
 		else if (splan->parParam == NIL && enable_material &&
 				 !ExecMaterializesOutput(nodeTag(plan)))
-			plan = materialize_finished_plan(plan);
+			plan = materialize_finished_plan(root->glob, plan);
 
 		result = (Node *) splan;
 		isInitPlan = false;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d12e3f451d2..c4147876d55 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  Oid *collations,
 										  long nbuckets,
 										  Size additionalsize,
+										  Size hash_mem_limit,
 										  MemoryContext metacxt,
 										  MemoryContext tablecxt,
 										  MemoryContext tempcxt,
@@ -499,6 +500,7 @@ extern Tuplestorestate *ExecMakeTableFunctionResult(SetExprState *setexpr,
 													ExprContext *econtext,
 													MemoryContext argContext,
 													TupleDesc expectedDesc,
+													int workMem,
 													bool randomAccess);
 extern SetExprState *ExecInitFunctionResultSet(Expr *expr,
 											   ExprContext *econtext, PlanState *parent);
@@ -724,4 +726,9 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execWorkmem.c
+ */
+extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..9b184c47322 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -253,7 +253,8 @@ typedef struct ParallelHashJoinState
 	ParallelHashGrowth growth;	/* control batch/bucket growth */
 	dsa_pointer chunk_work_queue;	/* chunk work queue */
 	int			nparticipants;
-	size_t		space_allowed;
+	size_t		space_allowed;	/* -- might be shared with other workers */
+	size_t		worker_space_allowed;	/* -- exclusive to this worker */
 	size_t		total_tuples;	/* total number of inner tuples */
 	LWLock		lock;			/* lock protecting the above */
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 34b82d0f5d1..dee74d42d13 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -329,7 +329,8 @@ extern void ExecReScanAgg(AggState *node);
 extern Size hash_agg_entry_size(int numTrans, Size tupleWidth,
 								Size transitionSpace);
 extern void hash_agg_set_limits(double hashentrysize, double input_groups,
-								int used_bits, Size *mem_limit,
+								int used_bits,
+								Size hash_mem_limit, Size *mem_limit,
 								uint64 *ngroups_limit, int *num_partitions);
 
 /* parallel instrumentation support */
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..e4e9e0d1de1 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,8 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t *space_allowed,
+									size_t worker_space_allowed,
+									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
 									int *num_skew_mcvs);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a323fa98bbb..461db7a8822 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1265,6 +1265,19 @@ typedef struct PlanState
 			((PlanState *)(node))->instrument->nfiltered2 += (delta); \
 	} while(0)
 
+/* macros for fetching the workmem info associated with a PlanState */
+#define workMemFieldFromId(node, field, id)								\
+	(list_nth_int(((PlanState *)(node))->state->es_plannedstmt->field, \
+				  (id) - 1))
+#define workMemField(node, field)   \
+	(workMemFieldFromId((node), field, ((PlanState *)(node))->plan->workmem_id))
+
+/* workmem limit: */
+#define workMemLimitFromId(node, id) \
+	(workMemFieldFromId(node, workMemLimits, id))
+#define workMemLimit(node) \
+	(workMemField(node, workMemLimits))
+
 /*
  * EPQState is state for executing an EvalPlanQual recheck on a candidate
  * tuples e.g. in ModifyTable or LockRows.
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fbf05322c75..b2901568ceb 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -179,6 +179,17 @@ typedef struct PlannerGlobal
 
 	/* partition descriptors */
 	PartitionDirectory partition_directory pg_node_attr(read_write_ignore);
+
+	/*
+	 * Working-memory info, for Plan and SubPlans. Any Plan or SubPlan that
+	 * needs working memory for a data structure maintains a "workmem_id"
+	 * index into the following lists (all kept in sync).
+	 */
+
+	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
+	List	   *workMemCategories;
+	/* - IntList: limit (in KB), after which data structure must spill */
+	List	   *workMemLimits;
 } PlannerGlobal;
 
 /* macro for fetching the Plan associated with a SubPlan node */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index bf1f25c0dba..9f86f37e6ea 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -133,13 +133,23 @@ typedef struct PlannedStmt
 	ParseLoc	stmt_location;
 	/* length in bytes; 0 means "rest of string" */
 	ParseLoc	stmt_len;
+
+	/*
+	 * Working-memory info, for Plan and SubPlans. Any Plan or SubPlan that
+	 * needs working memory for a data structure maintains a "workmem_id"
+	 * index into the following lists (all kept in sync).
+	 */
+
+	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
+	List	   *workMemCategories;
+	/* - IntList: limit (in KB), after which data structure must spill */
+	List	   *workMemLimits;
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
 #define exec_subplan_get_plan(plannedstmt, subplan) \
 	((Plan *) list_nth((plannedstmt)->subplans, (subplan)->plan_id - 1))
 
-
 /* ----------------
  *		Plan node
  *
@@ -195,6 +205,8 @@ typedef struct Plan
 	 */
 	/* unique across entire final plan tree */
 	int			plan_node_id;
+	/* 1-based id of workMem to use, or else zero */
+	int			workmem_id;
 	/* target list to be computed at this node */
 	List	   *targetlist;
 	/* implicitly-ANDed qual conditions */
@@ -426,6 +438,9 @@ typedef struct RecursiveUnion
 
 	/* estimated number of groups in input */
 	long		numGroups;
+
+	/* 1-based id of workMem to use for hash table, or else zero */
+	int			hashWorkMemId;
 } RecursiveUnion;
 
 /* ----------------
@@ -1145,6 +1160,9 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* 1-based id of workMem to use to sort inputs, or else zero */
+	int			sortWorkMemId;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
@@ -1758,4 +1776,11 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+/* different data structures get different working-memory limits*/
+typedef enum WorkMemCategory
+{
+	WORKMEM_NORMAL,				/* gets work_mem */
+	WORKMEM_HASH,				/* gets hash_mem_multiplier * work_mem */
+}			WorkMemCategory;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index d0576da3e25..2698cf09304 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1109,6 +1109,9 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	/* 1-based id of workMem to use, or else zero: */
+	int			hashtab_workmem_id; /* for hash table */
+	int			hashnul_workmem_id; /* for NULLs hash table */
 } SubPlan;
 
 /*
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5a930199611..bf5e89e8415 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -46,9 +46,11 @@ extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
 									 Plan *outer_plan);
 extern Plan *change_plan_targetlist(Plan *subplan, List *tlist,
 									bool tlist_parallel_safe);
-extern Plan *materialize_finished_plan(Plan *subplan);
+extern Plan *materialize_finished_plan(PlannerGlobal *glob, Plan *subplan);
 extern bool is_projection_capable_path(Path *path);
 extern bool is_projection_capable_plan(Plan *plan);
+extern int	add_workmem(PlannerGlobal *glob);
+extern int	add_hash_workmem(PlannerGlobal *glob);
 
 /* External use of these functions is deprecated: */
 extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
-- 
2.47.1

0002-Add-workmem-estimates-to-Path-node-and-PlannedStmt.patchapplication/octet-stream; name=0002-Add-workmem-estimates-to-Path-node-and-PlannedStmt.patchDownload

From fb957111f9261b759e98691183eaa74213ad0e73 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Tue, 4 Mar 2025 23:03:19 +0000
Subject: [PATCH 2/4] Add "workmem" estimates to Path node and PlannedStmt

To allow for future optimizers to make decisions at Path time, this commit
aggregates the Path's total working memory onto the Path's "workmem" field,
normalized to a minimum of 64 KB and rounded up to the next whole KB.

To allow future hooks to override ExecAssignWorkMem(), this commit then
breaks that total working memory into per-data structure working memory,
and stores it, next to the workMemLimit, on the PlannedStmt.
---
 src/backend/executor/execParallel.c     |   2 +
 src/backend/executor/nodeHash.c         |  13 +-
 src/backend/nodes/tidbitmap.c           |  18 ++
 src/backend/optimizer/path/costsize.c   | 407 ++++++++++++++++++++++--
 src/backend/optimizer/plan/createplan.c | 267 +++++++++++++---
 src/backend/optimizer/plan/planner.c    |   2 +
 src/backend/optimizer/prep/prepagg.c    |  12 +
 src/backend/optimizer/util/pathnode.c   |  53 ++-
 src/include/executor/nodeHash.h         |   3 +-
 src/include/nodes/execnodes.h           |  12 +
 src/include/nodes/pathnodes.h           |  10 +-
 src/include/nodes/plannodes.h           |   7 +-
 src/include/nodes/tidbitmap.h           |   1 +
 src/include/optimizer/cost.h            |  13 +-
 src/include/optimizer/planmain.h        |   3 +-
 15 files changed, 744 insertions(+), 79 deletions(-)

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 97d83bae571..c247ce1e901 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -214,6 +214,8 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->stmt_location = -1;
 	pstmt->stmt_len = -1;
 	pstmt->workMemCategories = estate->es_plannedstmt->workMemCategories;
+	pstmt->workMemEstimates = estate->es_plannedstmt->workMemEstimates;
+	pstmt->workMemCounts = estate->es_plannedstmt->workMemCounts;
 	pstmt->workMemLimits = estate->es_plannedstmt->workMemLimits;
 
 	/* Return serialized copy of our dummy PlannedStmt. */
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index bb9af08dc5d..f6219df708a 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
@@ -453,6 +454,7 @@ ExecHashTableCreate(HashState *state)
 	int			nbuckets;
 	int			nbatch;
 	double		rows;
+	int			workmem;		/* ignored */
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	MemoryContext oldcxt;
@@ -482,7 +484,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state->nparticipants - 1 : 0,
 							worker_space_allowed,
 							&space_allowed,
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
 	/* nbuckets must be a power of 2 */
 	log2_nbuckets = my_log2(nbuckets);
@@ -668,7 +670,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
-						int *num_skew_mcvs)
+						int *num_skew_mcvs,
+						int *workmem)
 {
 	int			tupsize;
 	double		inner_rel_bytes;
@@ -799,6 +802,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * the required bucket headers, we will need multiple batches.
 	 */
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	*workmem = normalize_work_bytes(inner_rel_bytes + bucket_bytes);
+
 	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
@@ -819,7 +825,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									total_space_allowed,
 									numbuckets,
 									numbatches,
-									num_skew_mcvs);
+									num_skew_mcvs,
+									workmem);
 			return;
 		}
 
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 3d835024caa..ac4c6b67350 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -1554,6 +1554,24 @@ tbm_calculate_entries(Size maxbytes)
 	return (int) nbuckets;
 }
 
+/*
+ * tbm_calculate_bytes
+ *
+ * Estimate number of bytes needed to store maxentries hashtable entries.
+ *
+ * This function is the inverse of tbm_calculate_entries(), and is used to
+ * estimate a work_mem limit, based on cardinality.
+ */
+double
+tbm_calculate_bytes(double maxentries)
+{
+	maxentries = Min(maxentries, INT_MAX - 1);	/* safety limit */
+	maxentries = Max(maxentries, 16);	/* sanity limit */
+
+	return maxentries * (sizeof(PagetableEntry) + sizeof(Pointer) +
+						 sizeof(Pointer));
+}
+
 /*
  * Create a shared or private bitmap iterator and start iteration.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ca4ab9bd315..12b1f1d82a9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -102,8 +102,10 @@
 #include "optimizer/paths.h"
 #include "optimizer/placeholder.h"
 #include "optimizer/plancat.h"
+#include "optimizer/planmain.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -200,9 +202,14 @@ static Cost append_nonpartial_cost(List *subpaths, int numpaths,
 								   int parallel_workers);
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static int32 get_expr_width(PlannerInfo *root, const Node *expr);
-static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static void compute_sort_output_sizes(double input_tuples, int input_width,
+									  double limit_tuples,
+									  double *output_tuples,
+									  double *output_bytes);
+static double compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+									 Cardinality max_ancestor_rows);
 
 
 /*
@@ -1112,6 +1119,18 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->disabled_nodes = enable_bitmapscan ? 0 : 1;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+
+	/*
+	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
+	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
+	 *
+	 * (When we convert this path into a BitmapHeapScan plan, we'll break this
+	 * overall estimate down into per-node estimates, just as we do for
+	 * AggPaths.)
+	 */
+	path->workmem = compute_bitmap_workmem(baserel, bitmapqual,
+										   0.0 /* max_ancestor_rows */ );
 }
 
 /*
@@ -1587,6 +1606,16 @@ cost_functionscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem = list_length(rte->functions) *
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1644,6 +1673,16 @@ cost_tablefuncscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1740,6 +1779,9 @@ cost_ctescan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1823,7 +1865,7 @@ cost_resultscan(Path *path, PlannerInfo *root,
  * We are given Paths for the nonrecursive and recursive terms.
  */
 void
-cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
+cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -1850,12 +1892,37 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	 */
 	total_cost += cpu_tuple_cost * total_rows;
 
-	runion->disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
-	runion->startup_cost = startup_cost;
-	runion->total_cost = total_cost;
-	runion->rows = total_rows;
-	runion->pathtarget->width = Max(nrterm->pathtarget->width,
-									rterm->pathtarget->width);
+	runion->path.disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
+	runion->path.startup_cost = startup_cost;
+	runion->path.total_cost = total_cost;
+	runion->path.rows = total_rows;
+	runion->path.pathtarget->width = Max(nrterm->pathtarget->width,
+										 rterm->pathtarget->width);
+
+	/*
+	 * Include memory for working and intermediate tables. Since we'll
+	 * repeatedly swap the two tables, use 2x whichever is larger as our
+	 * estimate.
+	 */
+	runion->path.workmem =
+		normalize_work_bytes(
+							 Max(relation_byte_size(nrterm->rows,
+													nrterm->pathtarget->width),
+								 relation_byte_size(rterm->rows,
+													rterm->pathtarget->width))
+							 * 2);
+
+	if (list_length(runion->distinctList) > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		hashentrysize;
+
+		hashentrysize = MAXALIGN(runion->path.pathtarget->width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		runion->path.workmem +=
+			normalize_work_bytes(runion->numGroups * hashentrysize);
+	}
 }
 
 /*
@@ -1895,7 +1962,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
  */
 static void
-cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+cost_tuplesort(Cost *startup_cost, Cost *run_cost, Cost *nbytes,
 			   double tuples, int width,
 			   Cost comparison_cost, int sort_mem,
 			   double limit_tuples)
@@ -1915,17 +1982,8 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	/* Include the default cost-per-comparison */
 	comparison_cost += 2.0 * cpu_operator_cost;
 
-	/* Do we have a useful LIMIT? */
-	if (limit_tuples > 0 && limit_tuples < tuples)
-	{
-		output_tuples = limit_tuples;
-		output_bytes = relation_byte_size(output_tuples, width);
-	}
-	else
-	{
-		output_tuples = tuples;
-		output_bytes = input_bytes;
-	}
+	compute_sort_output_sizes(tuples, width, limit_tuples,
+							  &output_tuples, &output_bytes);
 
 	if (output_bytes > sort_mem_bytes)
 	{
@@ -1982,6 +2040,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	 * counting the LIMIT otherwise.
 	 */
 	*run_cost = cpu_operator_cost * tuples;
+	*nbytes = output_bytes;
 }
 
 /*
@@ -2011,6 +2070,7 @@ cost_incremental_sort(Path *path,
 				input_groups;
 	Cost		group_startup_cost,
 				group_run_cost,
+				group_nbytes,
 				group_input_run_cost;
 	List	   *presortedExprs = NIL;
 	ListCell   *l;
@@ -2085,7 +2145,7 @@ cost_incremental_sort(Path *path,
 	 * Estimate the average cost of sorting of one group where presorted keys
 	 * are equal.
 	 */
-	cost_tuplesort(&group_startup_cost, &group_run_cost,
+	cost_tuplesort(&group_startup_cost, &group_run_cost, &group_nbytes,
 				   group_tuples, width, comparison_cost, sort_mem,
 				   limit_tuples);
 
@@ -2126,6 +2186,14 @@ cost_incremental_sort(Path *path,
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Incremental sort switches between two Tuplesortstates: one that sorts
+	 * all columns ("full"), and that sorts only suffix columns ("prefix").
+	 * We'll assume they're both around the same size: large enough to hold
+	 * one sort group.
+	 */
+	path->workmem = normalize_work_bytes(group_nbytes * 2.0);
 }
 
 /*
@@ -2150,8 +2218,9 @@ cost_sort(Path *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	Cost		nbytes;
 
-	cost_tuplesort(&startup_cost, &run_cost,
+	cost_tuplesort(&startup_cost, &run_cost, &nbytes,
 				   tuples, width,
 				   comparison_cost, sort_mem,
 				   limit_tuples);
@@ -2162,6 +2231,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_work_bytes(nbytes);
 }
 
 /*
@@ -2522,6 +2592,7 @@ cost_material(Path *path,
 	path->disabled_nodes = input_disabled_nodes + (enable_material ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_work_bytes(nbytes);
 }
 
 /*
@@ -2592,6 +2663,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	if ((estinfo.flags & SELFLAG_USED_DEFAULT) != 0)
 		ndistinct = calls;
 
+	/* How much working memory would we need, to store every distinct tuple? */
+	mpath->path.workmem = normalize_work_bytes(ndistinct * est_entry_bytes);
+
 	/*
 	 * Since we've already estimated the maximum number of entries we can
 	 * store at once and know the estimated number of distinct values we'll be
@@ -2867,6 +2941,19 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->disabled_nodes = disabled_nodes;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/* Include memory needed to produce output. */
+	path->workmem =
+		compute_agg_output_workmem(root, aggstrategy, numGroups,
+								   aggcosts->transitionSpace, input_tuples,
+								   input_width, false /* cost_sort */ );
+
+	/* Also include memory needed to sort inputs (if needed): */
+	if (aggcosts->numSortBuffers > 0)
+	{
+		path->workmem += (double) aggcosts->numSortBuffers *
+			compute_agg_input_workmem(input_tuples, input_width);
+	}
 }
 
 /*
@@ -3101,7 +3188,7 @@ cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, WindowClause *winclause,
 			   int input_disabled_nodes,
 			   Cost input_startup_cost, Cost input_total_cost,
-			   double input_tuples)
+			   double input_tuples, int width)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -3183,6 +3270,11 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	if (startup_tuples > 1.0)
 		path->startup_cost += (total_cost - startup_cost) / input_tuples *
 			(startup_tuples - 1.0);
+
+
+	/* We need to store a window of size "startup_tuples", in a Tuplestore. */
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(startup_tuples, width));
 }
 
 /*
@@ -3337,6 +3429,7 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost;
 	/* Save private data for final_cost_nestloop */
 	workspace->run_cost = run_cost;
+	workspace->workmem = 0;
 }
 
 /*
@@ -3800,6 +3893,14 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost + inner_run_cost;
 	/* Save private data for final_cost_mergejoin */
 	workspace->run_cost = run_cost;
+
+	/*
+	 * By itself, Merge Join requires no working memory. If it adds one or
+	 * more Sort or Material nodes, we'll track their working memory when we
+	 * create them, inside createplan.c.
+	 */
+	workspace->workmem = 0;
+
 	workspace->inner_run_cost = inner_run_cost;
 	workspace->outer_rows = outer_rows;
 	workspace->inner_rows = inner_rows;
@@ -4171,6 +4272,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	double		outer_path_rows = outer_path->rows;
 	double		inner_path_rows = inner_path->rows;
 	double		inner_path_rows_total = inner_path_rows;
+	int			workmem;
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
@@ -4229,7 +4331,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
-							&num_skew_mcvs);
+							&num_skew_mcvs,
+							&workmem);
 
 	/*
 	 * If inner relation is too big then we will need to "batch" the join,
@@ -4260,6 +4363,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->numbuckets = numbuckets;
 	workspace->numbatches = numbatches;
 	workspace->inner_rows_total = inner_path_rows_total;
+	workspace->workmem = workmem;
 }
 
 /*
@@ -4268,8 +4372,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
  *
  * Note: the numbatches estimate is also saved into 'path' for use later
  *
- * 'path' is already filled in except for the rows and cost fields and
- *		num_batches
+ * 'path' is already filled in except for the rows and cost fields,
+ *		num_batches, and workmem
  * 'workspace' is the result from initial_cost_hashjoin
  * 'extra' contains miscellaneous information about the join
  */
@@ -4286,6 +4390,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	List	   *hashclauses = path->path_hashclauses;
 	Cost		startup_cost = workspace->startup_cost;
 	Cost		run_cost = workspace->run_cost;
+	int			workmem = workspace->workmem;
 	int			numbuckets = workspace->numbuckets;
 	int			numbatches = workspace->numbatches;
 	Cost		cpu_per_tuple;
@@ -4512,6 +4617,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	path->jpath.path.startup_cost = startup_cost;
 	path->jpath.path.total_cost = startup_cost + run_cost;
+	path->jpath.path.workmem = workmem;
 }
 
 
@@ -4534,6 +4640,9 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 	if (subplan->useHashTable)
 	{
+		long		nbuckets;
+		Size		hashentrysize;
+
 		/*
 		 * If we are using a hash table for the subquery outputs, then the
 		 * cost of evaluating the query is a one-time cost.  We charge one
@@ -4545,13 +4654,37 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 		/*
 		 * Working memory needed for the hashtable (and hashnulls, if needed).
+		 * The logic below MUST match the logic in buildSubPlanHash() and
+		 * ExecInitSubPlan().
 		 */
-		subplan->hashtab_workmem_id = add_hash_workmem(root->glob);
+		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
+		if (nbuckets < 1)
+			nbuckets = 1;
+
+		hashentrysize = MAXALIGN(plan->plan_width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		subplan->hashtab_workmem_id =
+			add_hash_workmem(root->glob,
+							 normalize_work_bytes((double) nbuckets *
+												  hashentrysize));
 
 		if (!subplan->unknownEqFalse)
 		{
 			/* Also needs a hashnulls table.  */
-			subplan->hashnul_workmem_id = add_hash_workmem(root->glob);
+			if (IsA(subplan->testexpr, OpExpr))
+				nbuckets = 1;	/* there can be only one entry */
+			else
+			{
+				nbuckets /= 16;
+				if (nbuckets < 1)
+					nbuckets = 1;
+			}
+
+			subplan->hashnul_workmem_id =
+				add_hash_workmem(root->glob,
+								 normalize_work_bytes((double) nbuckets *
+													  hashentrysize));
 		}
 
 		/*
@@ -6437,7 +6570,7 @@ get_expr_width(PlannerInfo *root, const Node *expr)
  *	  Estimate the storage space in bytes for a given number of tuples
  *	  of a given width (size in bytes).
  */
-static double
+double
 relation_byte_size(double tuples, int width)
 {
 	return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
@@ -6616,3 +6749,219 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * compute_sort_output_sizes
+ *	  Estimate amount of memory and rows needed to hold a Sort operator's output
+ */
+static void
+compute_sort_output_sizes(double input_tuples, int input_width,
+						  double limit_tuples,
+						  double *output_tuples, double *output_bytes)
+{
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Do we have a useful LIMIT? */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+		*output_tuples = limit_tuples;
+	else
+		*output_tuples = input_tuples;
+
+	*output_bytes = relation_byte_size(*output_tuples, input_width);
+}
+
+/*
+ * compute_agg_input_workmem
+ *	  Estimate memory (in KB) needed to hold a sort buffer for aggregate's input
+ *
+ * Some aggregates involve DISTINCT or ORDER BY, so they need to sort their
+ * input, before they can process it. We need one sort buffer per such
+ * aggregate, and this function returns that sort buffer's (estimated) size (in
+ * KB).
+ */
+int
+compute_agg_input_workmem(double input_tuples, double input_width)
+{
+	double		output_tuples;	/* ignored */
+	double		output_bytes;
+
+	/* Account for size of one buffer needed to sort the input. */
+	compute_sort_output_sizes(input_tuples, input_width,
+							  0.0 /* limit_tuples */ ,
+							  &output_tuples, &output_bytes);
+	return normalize_work_bytes(output_bytes);
+}
+
+/*
+ * compute_agg_output_workmem
+ *	  Estimate amount of memory needed (in KB) to hold an aggregate's output
+ *
+ * In a Hash aggregate, we need space for the hash table that holds the
+ * aggregated data.
+ *
+ * Sort aggregates require output space only if they are part of a Grouping
+ * Sets chain: the first aggregate writes to its "sort_out" buffer, which the
+ * second aggregate uses as its "sort_in" buffer, and sorts.
+ *
+ * In the latter case, the "Path" code already costs the sort by calling
+ * cost_sort(), so it passes "cost_sort = false" to this function, to avoid
+ * double-counting.
+ */
+int
+compute_agg_output_workmem(PlannerInfo *root, AggStrategy aggstrategy,
+						   double numGroups, uint64 transitionSpace,
+						   double input_tuples, double input_width,
+						   bool cost_sort)
+{
+	/* Account for size of hash table to hold the output. */
+	if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+	{
+		double		hashentrysize;
+
+		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
+											input_width, transitionSpace);
+		return normalize_work_bytes(numGroups * hashentrysize);
+	}
+
+	/* Account for the size of the "sort_out" buffer. */
+	if (cost_sort && aggstrategy == AGG_SORTED)
+	{
+		double		output_tuples;	/* ignored */
+		double		output_bytes;
+
+		Assert(aggstrategy == AGG_SORTED);
+
+		compute_sort_output_sizes(numGroups, input_width,
+								  0.0 /* limit_tuples */ ,
+								  &output_tuples, &output_bytes);
+		return normalize_work_bytes(output_bytes);
+	}
+
+	return 0;
+}
+
+/*
+ * compute_bitmap_workmem
+ *	  Estimate total working memory (in KB) needed by bitmapqual
+ *
+ * Although we don't fill in the workmem_est or rows fields on the bitmapqual's
+ * paths, we fill them in on the owning BitmapHeapPath. This function estimates
+ * the total work_mem needed by all BitmapOrPaths and IndexPaths inside
+ * bitmapqual.
+ */
+static double
+compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+					   Cardinality max_ancestor_rows)
+{
+	double		workmem = 0.0;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * baserel->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
+
+	if (IsA(bitmapqual, BitmapAndPath))
+	{
+		BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, apath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, BitmapOrPath))
+	{
+		BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, opath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, IndexPath))
+	{
+		/* Working memory needed for 1 TID bitmap. */
+		workmem +=
+			normalize_work_bytes(tbm_calculate_bytes(max_ancestor_rows));
+	}
+
+	return workmem;
+}
+
+/*
+ * normalize_work_kb
+ *	  Convert a double, "KB" working-memory estimate to an int, "KB" value
+ *
+ * Normalizes non-zero input to a minimum of 64 (KB), rounding up to the
+ * nearest whole KB.
+ */
+int
+normalize_work_kb(double nkb)
+{
+	double		workmem;
+
+	if (nkb == 0.0)
+		return 0;				/* caller apparently doesn't need any workmem */
+
+	/*
+	 * We'll assign working-memory to SQL operators in 1 KB increments, so
+	 * round up to the next whole KB.
+	 */
+	workmem = ceil(nkb);
+
+	/*
+	 * Although some components can probably work with < 64 KB of working
+	 * memory, PostgreSQL has imposed a hard minimum of 64 KB on the
+	 * "work_mem" GUC, for a long time; so, by now, some components probably
+	 * rely on this minimum, implicitly, and would fail if we tried to assign
+	 * them < 64 KB.
+	 *
+	 * Perhaps this minimum can be relaxed, in the future; but memory sizes
+	 * keep increasing, and right now the minimum of 64 KB = 1.6 percent of
+	 * the default "work_mem" of 4 MB.
+	 *
+	 * So, even with this (overly?) cautious normalization, with the default
+	 * GUC settings, we can still achieve a working-memory reduction of
+	 * 64-to-1.
+	 */
+	workmem = Max((double) 64, workmem);
+
+	/* And clamp to MAX_KILOBYTES. */
+	workmem = Min(workmem, (double) MAX_KILOBYTES);
+
+	return (int) workmem;
+}
+
+/*
+ * normalize_work_bytes
+ *	  Convert a double, "bytes" working-memory estimate to an int, "KB" value
+ *
+ * Same as above, but takes input in bytes rather than in KB.
+ */
+int
+normalize_work_bytes(double nbytes)
+{
+	return normalize_work_kb(nbytes / 1024.0);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 97e43d49d1f..263c7e4eb9d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -130,6 +130,7 @@ static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
 											   BitmapHeapPath *best_path,
 											   List *tlist, List *scan_clauses);
 static Plan *create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+								   Cardinality max_ancestor_rows,
 								   List **qual, List **indexqual, List **indexECs);
 static void bitmap_subplan_mark_shared(Plan *plan);
 static TidScan *create_tidscan_plan(PlannerInfo *root, TidPath *best_path,
@@ -319,6 +320,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
+static int	add_workmem(PlannerGlobal *glob, int estimate);
+static int	add_workmems(PlannerGlobal *glob, int estimate, int count);
 
 
 /*
@@ -1656,7 +1659,8 @@ create_material_plan(PlannerInfo *root, MaterialPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -1712,7 +1716,9 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_hash_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -1861,7 +1867,9 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 0,
 								 subplan);
 
-		plan->workmem_id = add_hash_workmem(root->glob);
+		plan->workmem_id =
+			add_hash_workmem(root->glob,
+							 normalize_work_kb(best_path->path.workmem));
 	}
 	else
 	{
@@ -2208,7 +2216,9 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob,
+					normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2236,7 +2246,13 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
-	plan->sort.plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * IncrementalSort creates two sort buffers, which the Path's "workmem"
+	 * estimate combined into a single value. Split it into two now.
+	 */
+	plan->sort.plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_kb(best_path->spath.path.workmem / 2), 2);
 
 	return plan;
 }
@@ -2349,11 +2365,32 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace the AggPath's overall workmem estimate with finer-grained
+	 * estimates.
+	 */
 	if (plan->aggstrategy == AGG_HASHED)
-		plan->plan.workmem_id = add_hash_workmem(root->glob);
+	{
+		int			workmem =
+			compute_agg_output_workmem(root, AGG_HASHED,
+									   plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows,
+									   subplan->plan_width,
+									   false /* cost_sort */ );
+
+		plan->plan.workmem_id = add_hash_workmem(root->glob, workmem);
+	}
+
+	/* Also include estimated memory needed to sort the input: */
+	if (best_path->numSortBuffers > 0)
+	{
+		int			workmem = compute_agg_input_workmem(subplan->plan_rows,
+														subplan->plan_width);
 
-	/* Also include working memory needed to sort the input: */
-	plan->sortWorkMemId = add_workmem(root->glob);
+		plan->sortWorkMemId =
+			add_workmems(root->glob, workmem, best_path->numSortBuffers);
+	}
 
 	return plan;
 }
@@ -2415,6 +2452,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	int			maxref;
 	List	   *chain;
 	ListCell   *lc;
+	int			num_sort_aggs = 0;
+	int			max_sort_agg_workmem = 0.0;
+	double		sum_hash_agg_workmem = 0.0;
 
 	/* Shouldn't get here without grouping sets */
 	Assert(root->parse->groupingSets);
@@ -2476,6 +2516,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			Plan	   *sort_plan = NULL;
 			Agg		   *agg_plan;
 			AggStrategy strat;
+			bool		cost_sort;
+			int			workmem;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2526,6 +2568,33 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				first_sort_agg = agg_plan;
 			}
 
+			/*
+			 * If we're an AGG_SORTED, but not the last, we need to cost
+			 * working memory needed to produce our "sort_out" buffer.
+			 */
+			cost_sort = foreach_current_index(lc) < list_length(rollups) - 1;
+
+			/* Estimated memory needed to hold the output: */
+			workmem =
+				compute_agg_output_workmem(root, agg_plan->aggstrategy,
+										   agg_plan->numGroups,
+										   agg_plan->transitionSpace,
+										   subplan->plan_rows,
+										   subplan->plan_width,
+										   cost_sort);
+
+			if (agg_plan->aggstrategy == AGG_HASHED)
+			{
+				/* All Hash Grouping Sets share the same workmem limit. */
+				sum_hash_agg_workmem += workmem;
+			}
+			else if (agg_plan->aggstrategy == AGG_SORTED)
+			{
+				/* Every Sort Grouping Set gets its own workmem limit. */
+				max_sort_agg_workmem = Max(max_sort_agg_workmem, workmem);
+				++num_sort_aggs;
+			}
+
 			chain = lappend(chain, agg_plan);
 		}
 	}
@@ -2537,6 +2606,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
 		int			numGroupCols;
+		bool		cost_sort;
+		int			workmem;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2559,6 +2630,27 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
 
+		/*
+		 * If we're an AGG_SORTED, but not the last, we need to cost working
+		 * memory needed to produce our "sort_out" buffer.
+		 */
+		cost_sort = list_length(rollups) > 1;
+
+		/*
+		 * Replace the overall workmem estimate that we copied from the Path
+		 * with finer-grained estimates.
+		 *
+		 */
+
+		/* Estimated memory needed to hold the output: */
+		workmem =
+			compute_agg_output_workmem(root, plan->aggstrategy,
+									   plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows,
+									   subplan->plan_width,
+									   cost_sort);
+
 		/*
 		 * NOTE: We will place the workmem needed to sort the input (if any)
 		 * on the first agg, the Hash workmem on the first Hash agg, and the
@@ -2567,20 +2659,37 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		if (plan->aggstrategy == AGG_HASHED || plan->aggstrategy == AGG_MIXED)
 		{
 			/* All Hash Grouping Sets share the same workmem limit. */
-			plan->plan.workmem_id = add_hash_workmem(root->glob);
+			sum_hash_agg_workmem += workmem;
+			plan->plan.workmem_id = add_hash_workmem(root->glob,
+													 sum_hash_agg_workmem);
 		}
 		else if (plan->aggstrategy == AGG_SORTED)
 		{
 			/* Every Sort Grouping Set gets its own workmem limit. */
+			max_sort_agg_workmem = Max(max_sort_agg_workmem, workmem);
+			++num_sort_aggs;
+
 			first_sort_agg = plan;
 		}
 
 		/* Store the workmem limit, for all Sorts, on the first Sort. */
-		if (first_sort_agg)
-			first_sort_agg->plan.workmem_id = add_workmem(root->glob);
+		if (num_sort_aggs > 1)
+		{
+			first_sort_agg->plan.workmem_id =
+				add_workmems(root->glob, max_sort_agg_workmem,
+							 num_sort_aggs > 2 ? 2 : 1);
+		}
 
 		/* Also include working memory needed to sort the input: */
-		plan->sortWorkMemId = add_workmem(root->glob);
+		if (best_path->numSortBuffers > 0)
+		{
+			workmem = compute_agg_input_workmem(subplan->plan_rows,
+												subplan->plan_width);
+
+			plan->sortWorkMemId =
+				add_workmems(root->glob, workmem,
+							 best_path->numSortBuffers * list_length(rollups));
+		}
 	}
 
 	return (Plan *) plan;
@@ -2753,7 +2862,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2795,7 +2905,9 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_hash_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2833,11 +2945,38 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * Replace our overall "workmem" estimate with estimates at finer
+	 * granularity.
+	 */
+
+	/*
+	 * Include memory for working and intermediate tables.  Since we'll
+	 * repeatedly swap the two tables, use the larger of the two as our
+	 * working- memory estimate.
+	 *
+	 * NOTE: The Path's "workmem" estimate is for the whole Path, but the
+	 * Plan's "workmem" estimates are *per data structure*. So, this value is
+	 * half of the corresponding Path's value.
+	 */
+	plan->plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_bytes(Max(relation_byte_size(leftplan->plan_rows,
+																 leftplan->plan_width),
+											  relation_byte_size(rightplan->plan_rows,
+																 rightplan->plan_width))),
+					 2);
 
 	/* Also include working memory for hash table. */
 	if (plan->numCols > 0)
-		plan->hashWorkMemId = add_hash_workmem(root->glob);
+	{
+		Size		entrysize =
+			sizeof(TupleHashEntryData) + plan->plan.plan_width;
+
+		plan->hashWorkMemId =
+			add_hash_workmem(root->glob,
+							 normalize_work_bytes(plan->numGroups * entrysize));
+	}
 
 	return plan;
 }
@@ -3279,6 +3418,7 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	/* Process the bitmapqual tree into a Plan tree and qual lists */
 	bitmapqualplan = create_bitmap_subplan(root, best_path->bitmapqual,
+										   0.0 /* max_ancestor_rows */ ,
 										   &bitmapqualorig, &indexquals,
 										   &indexECs);
 
@@ -3390,9 +3530,24 @@ create_bitmap_scan_plan(PlannerInfo *root,
  */
 static Plan *
 create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+					  Cardinality max_ancestor_rows,
 					  List **qual, List **indexqual, List **indexECs)
 {
 	Plan	   *plan;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * bitmapqual->parent->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
 
 	if (IsA(bitmapqual, BitmapAndPath))
 	{
@@ -3418,6 +3573,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3429,8 +3586,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_and(subplans);
 		plan->startup_cost = apath->path.startup_cost;
 		plan->total_cost = apath->path.total_cost;
-		plan->plan_rows =
-			clamp_row_est(apath->bitmapselectivity * apath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = apath->path.parallel_safe;
@@ -3465,6 +3621,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3493,8 +3651,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			plan = (Plan *) make_bitmap_or(subplans);
 			plan->startup_cost = opath->path.startup_cost;
 			plan->total_cost = opath->path.total_cost;
-			plan->plan_rows =
-				clamp_row_est(opath->bitmapselectivity * opath->path.parent->tuples);
+			plan->plan_rows = plan_rows;
 			plan->plan_width = 0;	/* meaningless */
 			plan->parallel_aware = false;
 			plan->parallel_safe = opath->path.parallel_safe;
@@ -3540,13 +3697,14 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
-		plan->plan_rows =
-			clamp_row_est(ipath->indexselectivity * ipath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
 
-		plan->workmem_id = add_workmem(root->glob);
+		plan->workmem_id =
+			add_workmem(root->glob,
+						normalize_work_bytes(tbm_calculate_bytes(max_ancestor_rows)));
 
 		/* Extract original index clauses, actual index quals, relevant ECs */
 		subquals = NIL;
@@ -3855,7 +4013,15 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * Replace the path's total working-memory estimate with a per-function
+	 * estimate.
+	 */
+	scan_plan->scan.plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_bytes(relation_byte_size(scan_plan->scan.plan.plan_rows,
+															 scan_plan->scan.plan.plan_width)),
+					 list_length(functions));
 
 	return scan_plan;
 }
@@ -3900,7 +4066,8 @@ create_tablefuncscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	scan_plan->scan.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->workmem));
 
 	return scan_plan;
 }
@@ -4040,7 +4207,8 @@ create_ctescan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	scan_plan->scan.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->workmem));
 
 	return scan_plan;
 }
@@ -4680,8 +4848,10 @@ create_mergejoin_plan(PlannerInfo *root,
 		 */
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
-
-		matplan->workmem_id = add_workmem(root->glob);
+		matplan->workmem_id =
+			add_workmem(root->glob,
+						normalize_work_bytes(relation_byte_size(matplan->plan_rows,
+																matplan->plan_width)));
 
 		inner_plan = matplan;
 	}
@@ -5029,7 +5199,9 @@ create_hashjoin_plan(PlannerInfo *root,
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
 	/* Assign workmem to the Hash subnode, not its parent HashJoin node. */
-	hash_plan->plan.workmem_id = add_hash_workmem(root->glob);
+	hash_plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->jpath.path.workmem));
 
 	return join_plan;
 }
@@ -5584,7 +5756,8 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	plan->plan.parallel_aware = false;
 	plan->plan.parallel_safe = lefttree->parallel_safe;
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(sort_path.workmem));
 }
 
 /*
@@ -5617,7 +5790,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 	plan->sort.plan.parallel_aware = false;
 	plan->sort.plan.parallel_safe = lefttree->parallel_safe;
 
-	plan->sort.plan.workmem_id = add_workmem(root->glob);
+	plan->sort.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(sort_path.workmem));
 }
 
 /*
@@ -6715,7 +6889,8 @@ materialize_finished_plan(PlannerGlobal *glob, Plan *subplan)
 	matplan->parallel_aware = false;
 	matplan->parallel_safe = subplan->parallel_safe;
 
-	matplan->workmem_id = add_workmem(glob);
+	matplan->workmem_id =
+		add_workmem(glob, normalize_work_kb(matpath.workmem));
 
 	return matplan;
 }
@@ -7481,12 +7656,22 @@ is_projection_capable_plan(Plan *plan)
 }
 
 static int
-add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
+add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category,
+					 int estimate, int count)
 {
+	if (estimate == 0 || count == 0)
+		return 0;
+
 	glob->workMemCategories = lappend_int(glob->workMemCategories, category);
+	glob->workMemEstimates = lappend_int(glob->workMemEstimates, estimate);
+	glob->workMemCounts = lappend_int(glob->workMemCounts, count);
 	/* the executor will fill this in later: */
 	glob->workMemLimits = lappend_int(glob->workMemLimits, 0);
 
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemEstimates));
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemCounts));
 	Assert(list_length(glob->workMemCategories) ==
 		   list_length(glob->workMemLimits));
 
@@ -7499,10 +7684,10 @@ add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
  *
  * This data structure will have its working-memory limit set to work_mem.
  */
-int
-add_workmem(PlannerGlobal *glob)
+static int
+add_workmem(PlannerGlobal *glob, int estimate)
 {
-	return add_workmem_internal(glob, WORKMEM_NORMAL);
+	return add_workmem_internal(glob, WORKMEM_NORMAL, estimate, 1);
 }
 
 /*
@@ -7513,7 +7698,13 @@ add_workmem(PlannerGlobal *glob)
  * hash_mem_multiplier.
  */
 int
-add_hash_workmem(PlannerGlobal *glob)
+add_hash_workmem(PlannerGlobal *glob, int estimate)
+{
+	return add_workmem_internal(glob, WORKMEM_HASH, estimate, 1);
+}
+
+static int
+add_workmems(PlannerGlobal *glob, int estimate, int count)
 {
-	return add_workmem_internal(glob, WORKMEM_HASH);
+	return add_workmem_internal(glob, WORKMEM_NORMAL, estimate, count);
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 56846fdeaab..f7606e513b8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -574,6 +574,8 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->stmt_len = parse->stmt_len;
 
 	result->workMemCategories = glob->workMemCategories;
+	result->workMemEstimates = glob->workMemEstimates;
+	result->workMemCounts = glob->workMemCounts;
 	result->workMemLimits = glob->workMemLimits;
 
 	result->jitFlags = PGJIT_NONE;
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index c0a2f04a8c3..0d0fb5cf8ed 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -691,5 +691,17 @@ get_agg_clause_costs(PlannerInfo *root, AggSplit aggsplit, AggClauseCosts *costs
 			costs->finalCost.startup += argcosts.startup;
 			costs->finalCost.per_tuple += argcosts.per_tuple;
 		}
+
+		/*
+		 * How many aggrefs need to sort their input? (Each such aggref gets
+		 * its own sort buffer. The logic here MUST match the corresponding
+		 * logic in function build_pertrans_for_aggref().)
+		 */
+		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind) &&
+			!aggref->aggpresorted &&
+			(aggref->aggdistinct || aggref->aggorder))
+		{
+			++costs->numSortBuffers;
+		}
 	}
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..e3242698789 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1709,6 +1709,13 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
 	pathnode->path.rows = subpath->rows;
 
+	/*
+	 * For now, set workmem at hash memory limit. Function
+	 * cost_memoize_rescan() will adjust this field, same as it does for field
+	 * "est_entries".
+	 */
+	pathnode->path.workmem = normalize_work_bytes(get_hash_memory_limit());
+
 	return pathnode;
 }
 
@@ -1937,12 +1944,14 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		pathnode->path.disabled_nodes = agg_path.disabled_nodes;
 		pathnode->path.startup_cost = agg_path.startup_cost;
 		pathnode->path.total_cost = agg_path.total_cost;
+		pathnode->path.workmem = agg_path.workmem;
 	}
 	else
 	{
 		pathnode->path.disabled_nodes = sort_path.disabled_nodes;
 		pathnode->path.startup_cost = sort_path.startup_cost;
 		pathnode->path.total_cost = sort_path.total_cost;
+		pathnode->path.workmem = sort_path.workmem;
 	}
 
 	rel->cheapest_unique_path = (Path *) pathnode;
@@ -2289,6 +2298,13 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	/* Cost is the same as for a regular CTE scan */
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
+	/*
+	 * But working memory used is 0, since the worktable scan doesn't create a
+	 * tuplestore -- it just reuses a tuplestore already created by a
+	 * recursive union.
+	 */
+	pathnode->workmem = 0;
+
 	return pathnode;
 }
 
@@ -3283,6 +3299,7 @@ create_agg_path(PlannerInfo *root,
 
 	pathnode->aggstrategy = aggstrategy;
 	pathnode->aggsplit = aggsplit;
+	pathnode->numSortBuffers = aggcosts ? aggcosts->numSortBuffers : 0;
 	pathnode->numGroups = numGroups;
 	pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
 	pathnode->groupClause = groupClause;
@@ -3333,6 +3350,8 @@ create_groupingsets_path(PlannerInfo *root,
 	ListCell   *lc;
 	bool		is_first = true;
 	bool		is_first_sort = true;
+	int			num_sort_nodes = 0;
+	double		max_sort_workmem = 0.0;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3369,6 +3388,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->numSortBuffers = agg_costs ? agg_costs->numSortBuffers : 0;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 	pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
@@ -3432,6 +3452,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 subpath->pathtarget->width);
 				if (!rollup->is_hashed)
 					is_first_sort = false;
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 			else
 			{
@@ -3444,6 +3466,12 @@ create_groupingsets_path(PlannerInfo *root,
 						  work_mem,
 						  -1.0);
 
+				/*
+				 * We costed sorting the previous "sort" rollup's "sort_out"
+				 * buffer. How much memory did it need?
+				 */
+				max_sort_workmem = Max(max_sort_workmem, sort_path.workmem);
+
 				/* Account for cost of aggregation */
 
 				cost_agg(&agg_path, root,
@@ -3457,12 +3485,17 @@ create_groupingsets_path(PlannerInfo *root,
 						 sort_path.total_cost,
 						 sort_path.rows,
 						 subpath->pathtarget->width);
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 
 			pathnode->path.disabled_nodes += agg_path.disabled_nodes;
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
+
+		if (!rollup->is_hashed)
+			++num_sort_nodes;
 	}
 
 	/* add tlist eval cost for each output row */
@@ -3470,6 +3503,17 @@ create_groupingsets_path(PlannerInfo *root,
 	pathnode->path.total_cost += target->cost.startup +
 		target->cost.per_tuple * pathnode->path.rows;
 
+	/*
+	 * Include working memory needed to sort agg output. If there's only 1
+	 * sort rollup, then we don't need any memory. If there are 2 sort
+	 * rollups, we need enough memory for 1 sort buffer. If there are >= 3
+	 * sort rollups, we need only 2 sort buffers, since we're
+	 * double-buffering.
+	 */
+	pathnode->path.workmem += num_sort_nodes > 2 ?
+		max_sort_workmem * 2.0 :
+		max_sort_workmem;
+
 	return pathnode;
 }
 
@@ -3619,7 +3663,8 @@ create_windowagg_path(PlannerInfo *root,
 				   subpath->disabled_nodes,
 				   subpath->startup_cost,
 				   subpath->total_cost,
-				   subpath->rows);
+				   subpath->rows,
+				   subpath->pathtarget->width);
 
 	/* add tlist eval cost for each output row */
 	pathnode->path.startup_cost += target->cost.startup;
@@ -3744,7 +3789,11 @@ create_setop_path(PlannerInfo *root,
 			MAXALIGN(SizeofMinimalTupleHeader);
 		if (hashentrysize * numGroups > get_hash_memory_limit())
 			pathnode->path.disabled_nodes++;
+
+		pathnode->path.workmem =
+			normalize_work_bytes(numGroups * hashentrysize);
 	}
+
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
@@ -3795,7 +3844,7 @@ create_recursiveunion_path(PlannerInfo *root,
 	pathnode->wtParam = wtParam;
 	pathnode->numGroups = numGroups;
 
-	cost_recursive_union(&pathnode->path, leftpath, rightpath);
+	cost_recursive_union(pathnode, leftpath, rightpath);
 
 	return pathnode;
 }
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index e4e9e0d1de1..6cd9bffbee5 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -63,7 +63,8 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
-									int *num_skew_mcvs);
+									int *num_skew_mcvs,
+									int *workmem);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 461db7a8822..1091e884ef7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1272,6 +1272,18 @@ typedef struct PlanState
 #define workMemField(node, field)   \
 	(workMemFieldFromId((node), field, ((PlanState *)(node))->plan->workmem_id))
 
+/* workmem estimate: */
+#define workMemEstimateFromId(node, id) \
+	(workMemFieldFromId(node, workMemEstimates, id))
+#define workMemEstimate(node) \
+	(workMemField(node, workMemEstimates))
+
+/* workmem count: */
+#define workMemCountFromId(node, id) \
+	(workMemFieldFromId(node, workMemCounts, id))
+#define workMemCount(node) \
+	(workMemField(node, workMemCounts))
+
 /* workmem limit: */
 #define workMemLimitFromId(node, id) \
 	(workMemFieldFromId(node, workMemLimits, id))
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index b2901568ceb..98a0c1f6778 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -60,6 +60,7 @@ typedef struct AggClauseCosts
 	QualCost	transCost;		/* total per-input-row execution costs */
 	QualCost	finalCost;		/* total per-aggregated-row costs */
 	Size		transitionSpace;	/* space for pass-by-ref transition data */
+	int			numSortBuffers; /* # of required input-sort buffers */
 } AggClauseCosts;
 
 /*
@@ -185,9 +186,12 @@ typedef struct PlannerGlobal
 	 * needs working memory for a data structure maintains a "workmem_id"
 	 * index into the following lists (all kept in sync).
 	 */
-
 	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
 	List	   *workMemCategories;
+	/* - IntList: estimate (in KB) of memory needed to avoid spilling */
+	List	   *workMemEstimates;
+	/* - IntList: how many data structures get a copy of this info */
+	List	   *workMemCounts;
 	/* - IntList: limit (in KB), after which data structure must spill */
 	List	   *workMemLimits;
 } PlannerGlobal;
@@ -1707,6 +1711,7 @@ typedef struct Path
 	int			disabled_nodes; /* count of disabled nodes */
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
 	Cost		total_cost;		/* total cost (assuming all tuples fetched) */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* sort ordering of path's output; a List of PathKey nodes; see above */
 	List	   *pathkeys;
@@ -2301,6 +2306,7 @@ typedef struct AggPath
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy, see nodes.h */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+	int			numSortBuffers; /* number of inputs that require sorting */
 	Cardinality numGroups;		/* estimated number of groups in input */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	List	   *groupClause;	/* a list of SortGroupClause's */
@@ -2342,6 +2348,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	int			numSortBuffers; /* number of inputs that require sorting */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
@@ -3385,6 +3392,7 @@ typedef struct JoinCostWorkspace
 
 	/* Fields below here should be treated as private to costsize.c */
 	Cost		run_cost;		/* non-startup cost components */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* private for cost_nestloop code */
 	Cost		inner_run_cost; /* also used by cost_mergejoin code */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 9f86f37e6ea..44145a51567 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -139,9 +139,12 @@ typedef struct PlannedStmt
 	 * needs working memory for a data structure maintains a "workmem_id"
 	 * index into the following lists (all kept in sync).
 	 */
-
 	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
 	List	   *workMemCategories;
+	/* - IntList: estimate (in KB) of memory needed to avoid spilling */
+	List	   *workMemEstimates;
+	/* - IntList: how many data structures get a copy of this info */
+	List	   *workMemCounts;
 	/* - IntList: limit (in KB), after which data structure must spill */
 	List	   *workMemLimits;
 } PlannedStmt;
@@ -1160,6 +1163,8 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* number of inputs that require sorting */
+	int			numSorts;
 	/* 1-based id of workMem to use to sort inputs, or else zero */
 	int			sortWorkMemId;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index e185635c10b..b5c98a39af7 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -108,6 +108,7 @@ extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
 													dsa_pointer dp);
 extern int	tbm_calculate_entries(Size maxbytes);
+extern double tbm_calculate_bytes(double maxentries);
 
 extern TBMIterator tbm_begin_iterate(TIDBitmap *tbm,
 									 dsa_area *dsa, dsa_pointer dsp);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 3aa3c16e442..587ea412bda 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -106,7 +106,7 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 									 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_resultscan(Path *path, PlannerInfo *root,
 							RelOptInfo *baserel, ParamPathInfo *param_info);
-extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
+extern void cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, int disabled_nodes,
 					  Cost input_cost, double tuples, int width,
@@ -139,7 +139,7 @@ extern void cost_windowagg(Path *path, PlannerInfo *root,
 						   List *windowFuncs, WindowClause *winclause,
 						   int input_disabled_nodes,
 						   Cost input_startup_cost, Cost input_total_cost,
-						   double input_tuples);
+						   double input_tuples, int width);
 extern void cost_group(Path *path, PlannerInfo *root,
 					   int numGroupCols, double numGroups,
 					   List *quals,
@@ -217,9 +217,18 @@ extern void set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *re
 extern void set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double relation_byte_size(double tuples, int width);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 								   Path *bitmapqual, double loop_count,
 								   Cost *cost_p, double *tuples_p);
 extern double compute_gather_rows(Path *path);
+extern int	compute_agg_input_workmem(double input_tuples, double input_width);
+extern int	compute_agg_output_workmem(PlannerInfo *root,
+									   AggStrategy aggstrategy,
+									   double numGroups, uint64 transitionSpace,
+									   double input_tuples, double input_width,
+									   bool cost_sort);
+extern int	normalize_work_kb(double nkb);
+extern int	normalize_work_bytes(double nbytes);
 
 #endif							/* COST_H */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index bf5e89e8415..91edfe96e27 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -49,8 +49,7 @@ extern Plan *change_plan_targetlist(Plan *subplan, List *tlist,
 extern Plan *materialize_finished_plan(PlannerGlobal *glob, Plan *subplan);
 extern bool is_projection_capable_path(Path *path);
 extern bool is_projection_capable_plan(Plan *plan);
-extern int	add_workmem(PlannerGlobal *glob);
-extern int	add_hash_workmem(PlannerGlobal *glob);
+extern int	add_hash_workmem(PlannerGlobal *glob, int estimate);
 
 /* External use of these functions is deprecated: */
 extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
-- 
2.47.1

0003-Add-EXPLAIN-work_mem-on-command-option.patchapplication/octet-stream; name=0003-Add-EXPLAIN-work_mem-on-command-option.patchDownload

From 6ac597951255bb797d5a9120211d44a019e2416f Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 26 Feb 2025 01:02:19 +0000
Subject: [PATCH 3/4] Add EXPLAIN (work_mem on) command option

So that users can see how much working memory a query is likely to use, as
well as how much memory it will be limited to, this commit adds an
EXPLAIN (work_mem on) command option that displays the workmem estimate
and limit, added in the previous two commits.
---
 src/backend/commands/explain.c        | 228 +++++++++
 src/backend/executor/nodeHash.c       |   7 +-
 src/backend/optimizer/path/costsize.c |   4 +-
 src/include/commands/explain.h        |   4 +
 src/include/executor/nodeHash.h       |   2 +-
 src/test/regress/expected/workmem.out | 653 ++++++++++++++++++++++++++
 src/test/regress/parallel_schedule    |   2 +-
 src/test/regress/sql/workmem.sql      | 307 ++++++++++++
 8 files changed, 1201 insertions(+), 6 deletions(-)
 create mode 100644 src/test/regress/expected/workmem.out
 create mode 100644 src/test/regress/sql/workmem.sql

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d8a7232cedb..eec2cb67ddd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -20,6 +20,8 @@
 #include "commands/explain_dr.h"
 #include "commands/explain_format.h"
 #include "commands/prepare.h"
+#include "executor/hashjoin.h"
+#include "executor/nodeHash.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "libpq/pqformat.h"
@@ -27,6 +29,7 @@
 #include "nodes/extensible.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteHandler.h"
@@ -154,6 +157,14 @@ static ExplainWorkersState *ExplainCreateWorkersState(int num_workers);
 static void ExplainOpenWorker(int n, ExplainState *es);
 static void ExplainCloseWorker(int n, ExplainState *es);
 static void ExplainFlushWorkersState(ExplainState *es);
+static void compute_subplan_workmem(List *plans, double *sp_estimate,
+									double *sp_limit);
+static void compute_agg_workmem(PlanState *planstate, Agg *agg,
+								double *agg_estimate, double *agg_limit);
+static void compute_hash_workmem(PlanState *planstate, double *hash_estimate,
+								 double *hash_limit);
+static void increment_workmem(PlanState *planstate, int workmem_id,
+							  double *estimate, double *limit);
 
 
 
@@ -209,6 +220,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		}
 		else if (strcmp(opt->defname, "memory") == 0)
 			es->memory = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "work_mem") == 0)
+			es->work_mem = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "serialize") == 0)
 		{
 			if (opt->arg)
@@ -809,6 +822,14 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 		ExplainPropertyFloat("Execution Time", "ms", 1000.0 * totaltime, 3,
 							 es);
 
+	if (es->work_mem)
+	{
+		ExplainPropertyFloat("Total Working Memory", "kB",
+							 es->total_workmem_estimate, 0, es);
+		ExplainPropertyFloat("Total Working Memory Limit", "kB",
+							 es->total_workmem_limit, 0, es);
+	}
+
 	ExplainCloseGroup("Query", NULL, true, es);
 }
 
@@ -1944,6 +1965,71 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
+	if (es->work_mem)
+	{
+		double		plan_estimate = 0.0;
+		double		plan_limit = 0.0;
+
+		/*
+		 * Include working memory used by this Plan's SubPlan objects, whether
+		 * they are included on the Plan's initPlan or subPlan lists.
+		 */
+		compute_subplan_workmem(planstate->initPlan, &plan_estimate,
+								&plan_limit);
+		compute_subplan_workmem(planstate->subPlan, &plan_estimate,
+								&plan_limit);
+
+		/* Include working memory used by this Plan, itself. */
+		switch (nodeTag(plan))
+		{
+			case T_Agg:
+				compute_agg_workmem(planstate, (Agg *) plan,
+									&plan_estimate, &plan_limit);
+				break;
+			case T_Hash:
+				compute_hash_workmem(planstate, &plan_estimate, &plan_limit);
+				break;
+			case T_RecursiveUnion:
+				{
+					RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+					if (runion->hashWorkMemId > 0)
+						increment_workmem(planstate, runion->hashWorkMemId,
+										  &plan_estimate, &plan_limit);
+				}
+				/* FALLTHROUGH */
+			default:
+				if (plan->workmem_id > 0)
+					increment_workmem(planstate, plan->workmem_id,
+									  &plan_estimate, &plan_limit);
+				break;
+		}
+
+		/*
+		 * Every parallel worker (plus the leader) gets its own copy of
+		 * working memory.
+		 */
+		plan_estimate *= (1 + es->num_workers);
+		plan_limit *= (1 + es->num_workers);
+
+		es->total_workmem_estimate += plan_estimate;
+		es->total_workmem_limit += plan_limit;
+
+		if (plan_estimate > 0.0 || plan_limit > 0.0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str, "  (work_mem=%.0f kB limit=%.0f kB)",
+								 plan_estimate, plan_limit);
+			else
+			{
+				ExplainPropertyFloat("Working Memory", "kB",
+									 plan_estimate, 0, es);
+				ExplainPropertyFloat("Working Memory Limit", "kB",
+									 plan_limit, 0, es);
+			}
+		}
+	}
+
 	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
@@ -2488,6 +2574,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (planstate->initPlan)
 		ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/*
+		 * Other than initPlan-s, every node below us gets the # of planned
+		 * workers we specified.
+		 */
+		Assert(es->num_workers == 0);
+
+		if (nodeTag(plan) == T_Gather)
+			es->num_workers = ((Gather *) plan)->num_workers;
+		else
+			es->num_workers = ((GatherMerge *) plan)->num_workers;
+	}
+
 	/* lefttree */
 	if (outerPlanState(planstate))
 		ExplainNode(outerPlanState(planstate), ancestors,
@@ -2544,6 +2644,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainCloseGroup("Plans", "Plans", false, es);
 	}
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/* End of parallel sub-tree. */
+		es->num_workers = 0;
+	}
+
 	/* in text format, undo whatever indentation we added */
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent = save_indent;
@@ -4931,3 +5037,125 @@ ExplainFlushWorkersState(ExplainState *es)
 	pfree(wstate->worker_state_save);
 	pfree(wstate);
 }
+
+/*
+ * compute_subplan_work_mem - compute total workmem for a SubPlan object
+ *
+ * If a SubPlan object uses a hash table, then that hash table needs working
+ * memory. We display that working memory on the owning Plan. This function
+ * increments work_mem counters to include the SubPlan's working-memory.
+ */
+static void
+compute_subplan_workmem(List *plans, double *sp_estimate, double *sp_limit)
+{
+	foreach_node(SubPlanState, sps, plans)
+	{
+		SubPlan    *sp = sps->subplan;
+
+		if (sp->hashtab_workmem_id > 0)
+			increment_workmem(sps->planstate, sp->hashtab_workmem_id,
+							  sp_estimate, sp_limit);
+
+		if (sp->hashnul_workmem_id > 0)
+			increment_workmem(sps->planstate, sp->hashnul_workmem_id,
+							  sp_estimate, sp_limit);
+	}
+}
+
+static void
+compute_agg_workmem_node(PlanState *planstate, Agg *agg, double *agg_estimate,
+						 double *agg_limit)
+{
+	/* Record memory used for output data structures. */
+	if (agg->plan.workmem_id > 0)
+		increment_workmem(planstate, agg->plan.workmem_id, agg_estimate,
+						  agg_limit);
+
+	/* Record memory used for input sort buffers. */
+	if (agg->sortWorkMemId > 0)
+		increment_workmem(planstate, agg->sortWorkMemId, agg_estimate,
+						  agg_limit);
+}
+
+/*
+ * compute_agg_workmem - compute Agg node's total workmem estimate and limit
+ *
+ * An Agg node might point to a chain of additional Agg nodes. When we explain
+ * the plan, we display only the first, "main" Agg node.
+ */
+static void
+compute_agg_workmem(PlanState *planstate, Agg *agg, double *agg_estimate,
+					double *agg_limit)
+{
+	compute_agg_workmem_node(planstate, agg, agg_estimate, agg_limit);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach_node(Agg, aggnode, agg->chain)
+		compute_agg_workmem_node(planstate, aggnode, agg_estimate, agg_limit);
+}
+
+/*
+ * compute_hash_workmem - compute total workmem for a Hash node
+ *
+ * This function is complicated, because we currently can adjust workmem limits
+ * for Hash (Joins), at runtime; and because the memory a Hash (Join) needs
+ * per-batch is not currently counted against the workmem limit.
+ *
+ * Here, we try to give a more accurate accounting than we'd get from just
+ * displaying limit * count.
+ */
+static void
+compute_hash_workmem(PlanState *planstate, double *hash_estimate,
+					 double *hash_limit)
+{
+	double		count = workMemCount(planstate);
+	double		estimate = workMemEstimate(planstate);
+	size_t		limit = workMemLimit(planstate);
+	HashState  *hstate = (HashState *) planstate;
+	Plan	   *plan = planstate->plan;
+	Hash	   *hash = (Hash *) plan;
+	Plan	   *outerNode = outerPlan(plan);
+	double		rows;
+	size_t		nbytes;
+	size_t		total_space_allowed;	/* ignored */
+	int			nbuckets;		/* ignored */
+	int			nbatch;
+	int			num_skew_mcvs;	/* ignored */
+	int			workmem_estimate;	/* ignored */
+
+	/*
+	 * For Hash Joins, we currently don't count per-batch memory against the
+	 * "workmem_limit", but we can at least estimate it for display with the
+	 * Plan.
+	 */
+	rows = plan->parallel_aware ? hash->rows_total : outerNode->plan_rows;
+	nbytes = limit * 1024;
+
+	ExecChooseHashTableSize(rows, outerNode->plan_width,
+							OidIsValid(hash->skewTable),
+							hstate->parallel_state != NULL,
+							hstate->parallel_state != NULL ?
+							hstate->parallel_state->nparticipants - 1 : 0,
+							&nbytes, &total_space_allowed,
+							&nbuckets, &nbatch, &num_skew_mcvs,
+							&workmem_estimate);
+
+	/* Include space for per-batch memory, if any: 2 blocks per batch. */
+	if (nbatch > 1)
+		nbytes += nbatch * 2 * BLCKSZ;
+
+	Assert(nbytes >= limit * 1024);
+
+	*hash_estimate += estimate * count;
+	*hash_limit += (double) normalize_work_bytes(nbytes) * count;
+}
+
+static void
+increment_workmem(PlanState *planstate, int workmem_id, double *estimate,
+				  double *limit)
+{
+	double		count = workMemCountFromId(planstate, workmem_id);
+
+	*estimate += workMemEstimateFromId(planstate, workmem_id) * count;
+	*limit += workMemLimitFromId(planstate, workmem_id) * count;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index f6219df708a..fb1aae7cca8 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -482,7 +482,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
-							worker_space_allowed,
+							&worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
@@ -666,7 +666,7 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t worker_space_allowed,
+						size_t *worker_space_allowed,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
@@ -699,7 +699,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = worker_space_allowed;
+	hash_table_bytes = *worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -944,6 +944,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
+		*worker_space_allowed = (*worker_space_allowed) * 2;
 		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 12b1f1d82a9..c1db6f53d10 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4277,6 +4277,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	int			numbuckets;
 	int			numbatches;
 	int			num_skew_mcvs;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;	/* unused */
 
 	/* Count up disabled nodes. */
@@ -4322,12 +4323,13 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	 * XXX at some point it might be interesting to try to account for skew
 	 * optimization in the cost estimate, but for now, we don't.
 	 */
+	worker_space_allowed = get_hash_memory_limit();
 	ExecChooseHashTableSize(inner_path_rows_total,
 							inner_path->pathtarget->width,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
-							get_hash_memory_limit(),
+							&worker_space_allowed,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 64547bd9b9c..cd8be1c5bdb 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -53,6 +53,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		memory;			/* print planner's memory usage information */
+	bool		work_mem;		/* print work_mem estimates per node */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
@@ -69,6 +70,9 @@ typedef struct ExplainState
 	bool		hide_workers;	/* set if we find an invisible Gather */
 	int			rtable_size;	/* length of rtable excluding the RTE_GROUP
 								 * entry */
+	int			num_workers;	/* # of worker processes *planned* to use */
+	double		total_workmem_estimate; /* total working memory estimate */
+	double		total_workmem_limit;	/* total working memory limit */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 6cd9bffbee5..b346a270b67 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,7 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t worker_space_allowed,
+									size_t *worker_space_allowed,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
new file mode 100644
index 00000000000..25e1dbb315b
--- /dev/null
+++ b/src/test/regress/expected/workmem.out
@@ -0,0 +1,653 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Unique -> hash agg
+set enable_hashagg = on;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                         workmem_filter                          
+-----------------------------------------------------------------
+ Sort  (work_mem=N kB limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(11 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Unique -> sort
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ Sort  (work_mem=N kB limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  Unique
+               ->  Sort  (work_mem=N kB limit=4096 kB)
+                     Sort Key: "*VALUES*".column1, "*VALUES*".column2
+                     ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 8192 kB
+(12 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+                    workmem_filter                     
+-------------------------------------------------------
+ Limit
+   ->  Incremental Sort  (work_mem=N kB limit=8192 kB)
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort  (work_mem=N kB limit=4096 kB)
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(9 rows)
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+    4220 |    5017 |   0 |    0 |   0 |      0 |      20 |      220 |         220 |      4220 |     4220 |  40 |   41 | IGAAAA   | ZKHAAA   | HHHHxx
+(1 row)
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+                                 workmem_filter                                 
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Hash Join
+               Hash Cond: (t3.thousand = t1.unique1)
+               ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+                     Group Key: t3.thousand, t3.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
+               ->  Hash  (work_mem=N kB limit=8192 kB)
+                     ->  Index Only Scan using onek_unique1 on onek t1
+                           Index Cond: (unique1 < 1)
+         ->  Index Only Scan using tenk1_hundred on tenk1 t2
+               Index Cond: (hundred = t3.tenthous)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(14 rows)
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+ count 
+-------
+   100
+(1 row)
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+                              workmem_filter                              
+--------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Nested Loop Left Join
+               Filter: (t4.f1 IS NULL)
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize  (work_mem=N kB limit=4096 kB)
+                     ->  Nested Loop Left Join
+                           Join Filter: (t3.f1 > 1)
+                           ->  Seq Scan on int4_tbl t3
+                                 Filter: (f1 > 0)
+                           ->  Materialize  (work_mem=N kB limit=4096 kB)
+                                 ->  Seq Scan on int4_tbl t4
+         ->  Seq Scan on int4_tbl t1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 8192 kB
+(15 rows)
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+ count 
+-------
+     0
+(1 row)
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=4096 kB)
+   ->  Sort  (work_mem=N kB limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=4096 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 20480 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+                          workmem_filter                           
+-------------------------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB limit=8192 kB)
+   Group Key: (length((stringu1)::text))
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial HashAggregate  (work_mem=N kB limit=40960 kB)
+               Group Key: length((stringu1)::text)
+               ->  Parallel Seq Scan on tenk1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 49152 kB
+(9 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length 
+--------
+      6
+(1 row)
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+            QUERY PLAN            
+----------------------------------
+ Aggregate
+   ->  Seq Scan on tenk1
+ Total Working Memory: 0 kB
+ Total Working Memory Limit: 0 kB
+(4 rows)
+
+select MAX(length(stringu1)) from tenk1;
+ max 
+-----
+   6
+(1 row)
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                             
+------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=12288 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+                             workmem_filter                             
+------------------------------------------------------------------------
+ Nested Loop
+   ->  Seq Scan on workmem_xmldata
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(5 rows)
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+ id | _id | country_name | country_id | region_id | size | unit | premier_name 
+----+-----+--------------+------------+-----------+------+------+--------------
+(0 rows)
+
+drop table workmem_xmldata;
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ SetOp Except
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+ Total Working Memory: 0 kB
+ Total Working Memory Limit: 0 kB
+(6 rows)
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+ unique1 
+---------
+      10
+(1 row)
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   ->  HashSetOp Intersect  (work_mem=N kB limit=8192 kB)
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+ Total Working Memory: N kB
+ Total Working Memory Limit: 8192 kB
+(6 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+ count 
+-------
+  5000
+(1 row)
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  Memoize  (work_mem=N kB limit=8192 kB)
+               Cache Key: o.four
+               Cache Mode: binary
+               ->  CTE Scan on x  (work_mem=N kB limit=4096 kB)
+                     CTE x
+                       ->  Recursive Union  (work_mem=N kB limit=16384 kB)
+                             ->  Result
+                             ->  WorkTable Scan on x x_1
+                                   Filter: (a < 10)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 28672 kB
+(15 rows)
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+ sum  | sum  
+------+------
+ 1700 | 5350
+(1 row)
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   CTE q1
+     ->  HashAggregate  (work_mem=N kB limit=8192 kB)
+           Group Key: tenk1.hundred
+           ->  Seq Scan on tenk1
+   InitPlan 2
+     ->  Aggregate
+           ->  CTE Scan on q1 qsub  (work_mem=N kB limit=4096 kB)
+   ->  CTE Scan on q1  (work_mem=N kB limit=4096 kB)
+         Filter: ((y)::numeric > (InitPlan 2).col1)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(12 rows)
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+ count 
+-------
+    50
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=4096 kB)
+         ->  Sort  (work_mem=N kB limit=4096 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 12288 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+                                            workmem_filter                                             
+-------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (((a.unique1 = 1) AND (b.unique1 = 2)) OR ((a.unique2 = 3) AND (b.hundred = 4)))
+         ->  Bitmap Heap Scan on tenk1 b
+               Recheck Cond: ((hundred = 4) OR (unique1 = 2))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB limit=4096 kB)
+                           Index Cond: (hundred = 4)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
+                           Index Cond: (unique1 = 2)
+         ->  Materialize  (work_mem=N kB limit=4096 kB)
+               ->  Bitmap Heap Scan on tenk1 a
+                     Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
+                     ->  BitmapOr
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB limit=4096 kB)
+                                 Index Cond: (unique2 = 3)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB limit=4096 kB)
+                                 Index Cond: (unique1 = 1)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 20480 kB
+(20 rows)
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+ count 
+-------
+   101
+(1 row)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter             
+----------------------------------------
+ Result  (work_mem=N kB limit=16384 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 16384 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=16384 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=8192 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 24576 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..1089e3bdf96 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate workmem
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/workmem.sql b/src/test/regress/sql/workmem.sql
new file mode 100644
index 00000000000..d1cec9eb051
--- /dev/null
+++ b/src/test/regress/sql/workmem.sql
@@ -0,0 +1,307 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Unique -> hash agg
+set enable_hashagg = on;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Unique -> sort
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+
+select length(stringu1) from tenk1 group by length(stringu1);
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+
+select MAX(length(stringu1)) from tenk1;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+
+drop table workmem_xmldata;
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
-- 
2.47.1

0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchapplication/octet-stream; name=0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchDownload

From 368bd6a1af562c88d4682f1a96b8919b5eb1a641 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 5 Mar 2025 01:21:20 +0000
Subject: [PATCH 4/4] Add "workmem_hook" to allow extensions to override
 per-node work_mem

---
 contrib/Makefile                     |   3 +-
 contrib/workmem/Makefile             |  20 +
 contrib/workmem/expected/workmem.out | 676 +++++++++++++++++++++++++++
 contrib/workmem/meson.build          |  28 ++
 contrib/workmem/sql/workmem.sql      | 304 ++++++++++++
 contrib/workmem/workmem.c            | 408 ++++++++++++++++
 src/backend/executor/execWorkmem.c   |  40 +-
 src/include/executor/executor.h      |   4 +
 8 files changed, 1472 insertions(+), 11 deletions(-)
 create mode 100644 contrib/workmem/Makefile
 create mode 100644 contrib/workmem/expected/workmem.out
 create mode 100644 contrib/workmem/meson.build
 create mode 100644 contrib/workmem/sql/workmem.sql
 create mode 100644 contrib/workmem/workmem.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..b4880ab7067 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,7 +50,8 @@ SUBDIRS = \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
-		vacuumlo
+		vacuumlo	\
+		workmem
 
 ifeq ($(with_ssl),openssl)
 SUBDIRS += pgcrypto sslinfo
diff --git a/contrib/workmem/Makefile b/contrib/workmem/Makefile
new file mode 100644
index 00000000000..f920cdb9964
--- /dev/null
+++ b/contrib/workmem/Makefile
@@ -0,0 +1,20 @@
+# contrib/workmem/Makefile
+
+MODULE_big = workmem
+OBJS = \
+	$(WIN32RES) \
+	workmem.o
+PGFILEDESC = "workmem - extension that adjusts PostgreSQL work_mem per node"
+
+REGRESS = workmem
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/workmem
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/workmem/expected/workmem.out b/contrib/workmem/expected/workmem.out
new file mode 100644
index 00000000000..cd684f5fe04
--- /dev/null
+++ b/contrib/workmem/expected/workmem.out
@@ -0,0 +1,676 @@
+load 'workmem';
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=25600 kB)
+   ->  Sort  (work_mem=N kB limit=25600 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=51200 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20480 kB)
+   ->  Sort  (work_mem=N kB limit=20480 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=40960 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=20480 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=102400 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=102399 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102399 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                    
+--------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=34134 kB)
+         ->  Sort  (work_mem=N kB limit=34133 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=34133 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter              
+-----------------------------------------
+ Result  (work_mem=N kB limit=102400 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=68267 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=34133 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=1024 kB)
+   ->  Sort  (work_mem=N kB limit=1024 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=2048 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=820 kB)
+   ->  Sort  (work_mem=N kB limit=819 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=1638 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=819 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=4096 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=4095 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4095 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=1366 kB)
+         ->  Sort  (work_mem=N kB limit=1365 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=1365 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+            workmem_filter             
+---------------------------------------
+ Result  (work_mem=N kB limit=4096 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=2731 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=1365 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=20 kB)
+   ->  Sort  (work_mem=N kB limit=20 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB limit=40 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB limit=16 kB)
+   ->  Sort  (work_mem=N kB limit=16 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB limit=32 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB limit=16 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB limit=80 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                           workmem_filter                            
+---------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB limit=78 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 78 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                                  workmem_filter                                   
+-----------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB limit=27 kB)
+         ->  Sort  (work_mem=N kB limit=27 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB limit=26 kB)
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+           workmem_filter            
+-------------------------------------
+ Result  (work_mem=N kB limit=80 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB limit=54 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB limit=26 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/meson.build b/contrib/workmem/meson.build
new file mode 100644
index 00000000000..fce8030ba45
--- /dev/null
+++ b/contrib/workmem/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+workmem_sources = files(
+  'workmem.c',
+)
+
+if host_system == 'windows'
+  workmem_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'workmem',
+    '--FILEDESC', 'workmem - extension that adjusts PostgreSQL work_mem per node',])
+endif
+
+workmem = shared_module('workmem',
+  workmem_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += workmem
+
+tests += {
+  'name': 'workmem',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'workmem',
+    ],
+  },
+}
diff --git a/contrib/workmem/sql/workmem.sql b/contrib/workmem/sql/workmem.sql
new file mode 100644
index 00000000000..e6dbc35bf10
--- /dev/null
+++ b/contrib/workmem/sql/workmem.sql
@@ -0,0 +1,304 @@
+load 'workmem';
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory: \d+\M', 'Memory: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
+
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/workmem.c b/contrib/workmem/workmem.c
new file mode 100644
index 00000000000..c5b792b2f8d
--- /dev/null
+++ b/contrib/workmem/workmem.c
@@ -0,0 +1,408 @@
+/*-------------------------------------------------------------------------
+ *
+ * workmem.c
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/workmem/workmem.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+
+PG_MODULE_MAGIC;
+
+/* Local variables */
+
+/*
+ * A Target represents a collection of data structures, belonging to an
+ * execution node, that all share the same memory limit.
+ *
+ * For example, in parallel query, every parallel worker (plus the leader)
+ * gets a copy of the execution node, and therefore a copy of all of that
+ * node's work_mem limits. In this case, we'll track a single Target, but its
+ * count will include (1 + num_workers), because this Target gets "applied"
+ * to (1 + num_workers) execution nodes.
+ */
+typedef struct Target
+{
+	/* # of data structures to which target applies: */
+	int			count;
+	/* workmem estimate for each of these data structures: */
+	int			workmem;
+	/* (original) workmem limit for each of these data structures: */
+	int			limit;
+	/* workmem estimate, but capped at (original) workmem limit: */
+	int			priority;
+	/* ratio of (priority / limit); measure's Target's "greediness": */
+	double		ratio;
+	/* link to target's actual limit, so we can set it: */
+	int		   *target_limit;
+}			Target;
+
+typedef struct WorkMemStats
+{
+	/* total # of data structures that get working memory: */
+	uint64		count;
+	/* total working memory estimated for this query: */
+	uint64		workmem;
+	/* total working memory (currently) reserved for this query: */
+	uint64		limit;
+	/* total "capped" working memory estimate: */
+	uint64		priority;
+	/* list of Targets, used to update actual workmem limits: */
+	List	   *targets;
+}			WorkMemStats;
+
+/* GUC variables */
+static int	workmem_query_work_mem = 100 * 1024;	/* kB */
+
+/* internal functions */
+static void workmem_fn(PlannedStmt *plannedstmt);
+
+static int	clamp_priority(int workmem, int limit);
+static Target * make_target(int workmem, int *target_limit, int count);
+static void add_target(WorkMemStats * workmem_stats, Target * target);
+
+/* Sort comparators: sort by ratio, ascending or descending. */
+static int	target_compare_asc(const ListCell *a, const ListCell *b);
+static int	target_compare_desc(const ListCell *a, const ListCell *b);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Define custom GUC variable. */
+	DefineCustomIntVariable("workmem.query_work_mem",
+							"Amount of working-memory (in kB) to provide each "
+							"query.",
+							NULL,
+							&workmem_query_work_mem,
+							100 * 1024, /* default to 100 MB */
+							64,
+							INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_KB,
+							NULL,
+							NULL,
+							NULL);
+
+	MarkGUCPrefixReserved("workmem");
+
+	/* Install hooks. */
+	ExecAssignWorkMem_hook = workmem_fn;
+}
+
+static void
+workmem_analyze(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			idx;
+
+	for (idx = 0; idx < list_length(plannedstmt->workMemCategories); ++idx)
+	{
+		WorkMemCategory category;
+		int			count;
+		int			estimate;
+		ListCell   *limit_cell;
+		int			limit;
+		Target	   *target;
+
+		category =
+			(WorkMemCategory) list_nth_int(plannedstmt->workMemCategories, idx);
+		count = list_nth_int(plannedstmt->workMemCounts, idx);
+		estimate = list_nth_int(plannedstmt->workMemEstimates, idx);
+
+		limit = category == WORKMEM_HASH ?
+			get_hash_memory_limit() / 1024 : work_mem;
+		limit_cell = list_nth_cell(plannedstmt->workMemLimits, idx);
+		lfirst_int(limit_cell) = limit;
+
+		target = make_target(estimate, &lfirst_int(limit_cell), count);
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_set(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			remaining = workmem_query_work_mem;
+
+	if (workmem_stats->limit <= remaining)
+	{
+		/*
+		 * "High memory" case: we have more than enough query_work_mem; now
+		 * hand out the excess.
+		 */
+
+		/* This is memory that exceeds workmem limits. */
+		remaining -= workmem_stats->limit;
+
+		/*
+		 * Sort targets from highest ratio to lowest. When we assign memory to
+		 * a Target, we'll truncate fractional KB; so by going through the
+		 * list from highest to lowest ratio, we ensure that the lowest ratios
+		 * get the leftover fractional KBs.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/* NOTE: This is extra workmem *per data structure*. */
+			extra_workmem = (int) (fraction * remaining);
+
+			*target->target_limit += extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else if (workmem_stats->priority <= remaining)
+	{
+		/*
+		 * "Medium memory" case: we don't have enough query_work_mem to give
+		 * every target its full allotment, but we do have enough to give it
+		 * as much as (we estimate) it needs.
+		 *
+		 * So, just take some memory away from nodes that (we estimate) won't
+		 * need it.
+		 */
+
+		/* This is memory that exceeds workmem estimates. */
+		remaining -= workmem_stats->priority;
+
+		/*
+		 * Sort targets from highest ratio to lowest. We'll skip any Target
+		 * with ratio > 1.0, because (we estimate) they already need their
+		 * full allotment. Also, once a target reaches its workmem limit,
+		 * we'll stop giving it more workmem, leaving the surplus memory to be
+		 * assigned to targets with smaller ratios.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/*
+			 * Don't give the target more than its (original) limit.
+			 *
+			 * NOTE: This is extra workmem *per data structure*.
+			 */
+			extra_workmem = Min((int) (fraction * remaining),
+								target->limit - target->priority);
+
+			*target->target_limit = target->priority + extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else
+	{
+		uint64		limit = workmem_stats->limit;
+
+		/*
+		 * "Low memory" case: we are severely memory constrained, and need to
+		 * take "priority" memory away from targets that (we estimate)
+		 * actually need it. We'll do this by (effectively) reducing the
+		 * global "work_mem" limit, uniformly, for all targets, until we're
+		 * under the query_work_mem limit.
+		 */
+		elog(WARNING,
+			 "not enough working memory for query: increase "
+			 "workmem.query_work_mem");
+
+		/*
+		 * Sort targets from lowest ratio to highest. For any target whose
+		 * ratio is < the target_ratio, we'll just assign it its priority (=
+		 * workmem) as limit, and return the excess workmem to our "limit",
+		 * for use by subsequent, greedier, targets.
+		 */
+		list_sort(workmem_stats->targets, target_compare_asc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		target_ratio;
+			int			target_limit;
+
+			/*
+			 * If we restrict our targets to this ratio, we'll stay within the
+			 * query_work_mem limit.
+			 */
+			target_ratio = (double) remaining / limit;
+
+			/*
+			 * Don't give this target more than its priority request (but we
+			 * might give it less).
+			 */
+			target_limit = Min(target->priority,
+							   target_ratio * target->limit);
+			*target->target_limit = target_limit;
+
+			/* "Remaining" decreases by memory we actually assigned. */
+			remaining -= (target_limit * target->count);
+
+			/*
+			 * "Limit" decreases by target's original memory limit.
+			 *
+			 * If target_limit <= target->priority, so we restricted this
+			 * target to less memory than (we estimate) it needs, then the
+			 * target_ratio will stay the same, since, letting A = remaining,
+			 * B = limit, and R = ratio, we'll have:
+			 *
+			 * R=A/B <=> A=R*B <=> A-R*X = R*B - R*X <=> A-R*X = R * (B-X) <=>
+			 * R = (A-R*X) / (B-X)
+			 *
+			 * -- which is what we wanted to prove.
+			 *
+			 * And if target_limit > target->priority, so we didn't need to
+			 * restrict this target beyond its priority estimate, then the
+			 * target_ratio will increase. This means more memory for the
+			 * remaining, greedier, targets.
+			 */
+			limit -= (target->limit * target->count);
+
+			target_ratio = (double) remaining / limit;
+		}
+	}
+}
+
+/*
+ * workmem_fn: updates the query plan's work_mem based on query_work_mem
+ */
+static void
+workmem_fn(PlannedStmt *plannedstmt)
+{
+	WorkMemStats workmem_stats;
+	MemoryContext context,
+				oldcontext;
+
+	/*
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 *
+	 * We could re-assign working-memory limits on the parallel worker, to
+	 * only those Plan nodes that got sent to the worker, but for now we don't
+	 * bother.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	if (workmem_query_work_mem == -1)
+		return;					/* disabled */
+
+	memset(&workmem_stats, 0, sizeof(workmem_stats));
+
+	/*
+	 * Set up our own memory context, so we can drop the metadata we generate,
+	 * all at once.
+	 */
+	context = AllocSetContextCreate(CurrentMemoryContext,
+									"workmem_fn context",
+									ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	/* Figure out how much total working memory this query wants/needs. */
+	workmem_analyze(plannedstmt, &workmem_stats);
+
+	/* Now restrict the query to workmem.query_work_mem. */
+	workmem_set(plannedstmt, &workmem_stats);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Drop all our metadata. */
+	MemoryContextDelete(context);
+}
+
+static int
+clamp_priority(int workmem, int limit)
+{
+	return Min(workmem, limit);
+}
+
+static Target *
+make_target(int workmem, int *target_limit, int count)
+{
+	Target	   *result = palloc_object(Target);
+
+	result->count = count;
+	result->workmem = workmem;
+	result->limit = *target_limit;
+	result->priority = clamp_priority(result->workmem, result->limit);
+	result->ratio = (double) result->priority / result->limit;
+	result->target_limit = target_limit;
+
+	return result;
+}
+
+static void
+add_target(WorkMemStats * workmem_stats, Target * target)
+{
+	workmem_stats->count += target->count;
+	workmem_stats->workmem += target->count * target->workmem;
+	workmem_stats->limit += target->count * target->limit;
+	workmem_stats->priority += target->count * target->priority;
+	workmem_stats->targets = lappend(workmem_stats->targets, target);
+}
+
+/* This "ascending" comparator sorts least-greedy Targets first. */
+static int
+target_compare_asc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in ascending order: smallest ratio first, then (if ratios equal)
+	 * smallest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) a->ptr_value)->workmem -
+			((Target *) b->ptr_value)->workmem;
+	}
+	else
+		return a_val > b_val ? 1 : -1;
+}
+
+/* This "descending" comparator sorts most-greedy Targets first. */
+static int
+target_compare_desc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in descending order: largest ratio first, then (if ratios equal)
+	 * largest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return ((Target *) b->ptr_value)->workmem -
+			((Target *) a->ptr_value)->workmem;
+	}
+	else
+		return b_val > a_val ? 1 : -1;
+}
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
index d8a19a58ebe..37420666065 100644
--- a/src/backend/executor/execWorkmem.c
+++ b/src/backend/executor/execWorkmem.c
@@ -52,6 +52,10 @@
 #include "nodes/plannodes.h"
 
 
+/* Hook for plugins to get control in ExecAssignWorkMem */
+ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook = NULL;
+
+
 /* ------------------------------------------------------------------------
  *		ExecAssignWorkMem
  *
@@ -64,20 +68,36 @@
  */
 void
 ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	if (ExecAssignWorkMem_hook)
+		(*ExecAssignWorkMem_hook) (plannedstmt);
+	else
+	{
+		/*
+		 * No need to re-assign working memory on parallel workers, since
+		 * workers have the same work_mem and hash_mem_multiplier GUCs as the
+		 * leader.
+		 *
+		 * We already assigned working-memory limits on the leader, and those
+		 * limits were sent to the workers inside the serialized Plan.
+		 *
+		 * We bail out here, in case the hook wants to re-assign memory on
+		 * parallel workers, and maybe wants to call
+		 * standard_ExecAssignWorkMem() first, as well.
+		 */
+		if (IsParallelWorker())
+			return;
+
+		standard_ExecAssignWorkMem(plannedstmt);
+	}
+}
+
+void
+standard_ExecAssignWorkMem(PlannedStmt *plannedstmt)
 {
 	ListCell   *lc_category;
 	ListCell   *lc_limit;
 
-	/*
-	 * No need to re-assign working memory on parallel workers, since workers
-	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
-	 *
-	 * We already assigned working-memory limits on the leader, and those
-	 * limits were sent to the workers inside the serialized Plan.
-	 */
-	if (IsParallelWorker())
-		return;
-
 	forboth(lc_category, plannedstmt->workMemCategories,
 			lc_limit, plannedstmt->workMemLimits)
 	{
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4147876d55..c12625d2061 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -96,6 +96,9 @@ typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
 											  bool ereport_on_violation);
 extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
+/* Hook for plugins to get control in ExecAssignWorkMem() */
+typedef void (*ExecAssignWorkMem_hook_type) (PlannedStmt *plannedstmt);
+extern PGDLLIMPORT ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook;
 
 /*
  * prototypes from functions in execAmi.c
@@ -730,5 +733,6 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
  * prototypes from functions in execWorkmem.c
  */
 extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+extern void standard_ExecAssignWorkMem(PlannedStmt *plannedstmt);
 
 #endif							/* EXECUTOR_H  */
-- 
2.47.1

#24

James Hunter

james.hunter.pg@gmail.com

9 months ago

In reply to: James Hunter (#23)

4 attachment(s)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Tue, Mar 4, 2025 at 5:47 PM James Hunter <james.hunter.pg@gmail.com> wrote:

Attaching a new revision, which substantially reworks the previous revision --

Attaching a rebased revision, with some minor changes.

Also, some context for why this change is especially useful for cloud
variants of PostgreSQL -- if you compare PostgreSQL guidance for
buffer pool size [1]https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS to Amazon Aurora's [2]https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-postgresql-parameters/shared-buffers.html, PostgreSQL recommends the
buffer pool to be sized to 25% of system memory, while Aurora
recommends it to be sized to ~ 70%. PostgreSQL explicitly relies on
the OS filesystem cache, effectively to extend the buffer pool; while
Aurora docs don't mention this at all.

Accordingly, Aurora PostgreSQL queries have less memory to work with
than ordinary PostgreSQL queries, making per-Node memory limits more
important.

Questions, comments?

Thanks,
James

[1]: https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS
[2]: https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-postgresql-parameters/shared-buffers.html

Attachments:

0003-Add-EXPLAIN-work_mem-on-command-option.patchapplication/octet-stream; name=0003-Add-EXPLAIN-work_mem-on-command-option.patchDownload

From f2494d3b33405e2af8838b876cf97d2bb06666fb Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 26 Feb 2025 01:02:19 +0000
Subject: [PATCH 3/4] Add EXPLAIN (work_mem on) command option

So that users can see how much working memory a query is likely to use, as
well as how much memory it will be limited to, this commit adds an
EXPLAIN (work_mem on) command option that displays the workmem estimate
and limit, added in the previous two commits.
---
 src/backend/commands/explain.c        | 233 +++++++++
 src/backend/executor/nodeHash.c       |   7 +-
 src/backend/optimizer/path/costsize.c |   4 +-
 src/include/commands/explain.h        |   4 +
 src/include/executor/nodeHash.h       |   2 +-
 src/test/regress/expected/workmem.out | 653 ++++++++++++++++++++++++++
 src/test/regress/parallel_schedule    |   2 +-
 src/test/regress/sql/workmem.sql      | 307 ++++++++++++
 8 files changed, 1206 insertions(+), 6 deletions(-)
 create mode 100644 src/test/regress/expected/workmem.out
 create mode 100644 src/test/regress/sql/workmem.sql

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d8a7232cedb..bc8e68e7be1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -20,6 +20,8 @@
 #include "commands/explain_dr.h"
 #include "commands/explain_format.h"
 #include "commands/prepare.h"
+#include "executor/hashjoin.h"
+#include "executor/nodeHash.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "libpq/pqformat.h"
@@ -27,6 +29,7 @@
 #include "nodes/extensible.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteHandler.h"
@@ -154,6 +157,14 @@ static ExplainWorkersState *ExplainCreateWorkersState(int num_workers);
 static void ExplainOpenWorker(int n, ExplainState *es);
 static void ExplainCloseWorker(int n, ExplainState *es);
 static void ExplainFlushWorkersState(ExplainState *es);
+static void compute_subplan_workmem(List *plans, double *sp_estimate,
+									double *sp_limit);
+static void compute_agg_workmem(PlanState *planstate, Agg *agg,
+								double *agg_estimate, double *agg_limit);
+static void compute_hash_workmem(PlanState *planstate, double *hash_estimate,
+								 double *hash_limit);
+static void increment_workmem(PlanState *planstate, int workmem_id,
+							  double *estimate, double *limit);
 
 
 
@@ -209,6 +220,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		}
 		else if (strcmp(opt->defname, "memory") == 0)
 			es->memory = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "work_mem") == 0)
+			es->work_mem = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "serialize") == 0)
 		{
 			if (opt->arg)
@@ -809,6 +822,14 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
 		ExplainPropertyFloat("Execution Time", "ms", 1000.0 * totaltime, 3,
 							 es);
 
+	if (es->work_mem)
+	{
+		ExplainPropertyFloat("Total Working Memory Estimate", "kB",
+							 es->total_workmem_estimate, 0, es);
+		ExplainPropertyFloat("Total Working Memory Limit", "kB",
+							 es->total_workmem_limit, 0, es);
+	}
+
 	ExplainCloseGroup("Query", NULL, true, es);
 }
 
@@ -1944,6 +1965,72 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
+	if (es->work_mem)
+	{
+		double		plan_estimate = 0.0;
+		double		plan_limit = 0.0;
+
+		/*
+		 * Include working memory used by this Plan's SubPlan objects, whether
+		 * they are included on the Plan's initPlan or subPlan lists.
+		 */
+		compute_subplan_workmem(planstate->initPlan, &plan_estimate,
+								&plan_limit);
+		compute_subplan_workmem(planstate->subPlan, &plan_estimate,
+								&plan_limit);
+
+		/* Include working memory used by this Plan, itself. */
+		switch (nodeTag(plan))
+		{
+			case T_Agg:
+				compute_agg_workmem(planstate, (Agg *) plan,
+									&plan_estimate, &plan_limit);
+				break;
+			case T_Hash:
+				compute_hash_workmem(planstate, &plan_estimate, &plan_limit);
+				break;
+			case T_RecursiveUnion:
+				{
+					RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+					if (runion->hashWorkMemId > 0)
+						increment_workmem(planstate, runion->hashWorkMemId,
+										  &plan_estimate, &plan_limit);
+				}
+				/* FALLTHROUGH */
+			default:
+				if (plan->workmem_id > 0)
+					increment_workmem(planstate, plan->workmem_id,
+									  &plan_estimate, &plan_limit);
+				break;
+		}
+
+		/*
+		 * Every parallel worker (plus the leader) gets its own copy of
+		 * working memory.
+		 */
+		plan_estimate *= (1 + es->num_workers);
+		plan_limit *= (1 + es->num_workers);
+
+		es->total_workmem_estimate += plan_estimate;
+		es->total_workmem_limit += plan_limit;
+
+		if (plan_estimate > 0.0 || plan_limit > 0.0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str,
+								 "  (work_mem=%.0f kB) (limit=%.0f kB)",
+								 plan_estimate, plan_limit);
+			else
+			{
+				ExplainPropertyFloat("Working Memory Estimate", "kB",
+									 plan_estimate, 0, es);
+				ExplainPropertyFloat("Working Memory Limit", "kB",
+									 plan_limit, 0, es);
+			}
+		}
+	}
+
 	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
@@ -2488,6 +2575,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (planstate->initPlan)
 		ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/*
+		 * Other than initPlan-s, every node below us gets the # of planned
+		 * workers we specified.
+		 */
+		Assert(es->num_workers == 0);
+
+		if (nodeTag(plan) == T_Gather)
+			es->num_workers = es->analyze ?
+				((GatherState *) planstate)->nworkers_launched :
+				((Gather *) plan)->num_workers;
+		else
+			es->num_workers = es->analyze ?
+				((GatherMergeState *) planstate)->nworkers_launched :
+				((GatherMerge *) plan)->num_workers;
+	}
+
 	/* lefttree */
 	if (outerPlanState(planstate))
 		ExplainNode(outerPlanState(planstate), ancestors,
@@ -2544,6 +2649,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainCloseGroup("Plans", "Plans", false, es);
 	}
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/* End of parallel sub-tree. */
+		es->num_workers = 0;
+	}
+
 	/* in text format, undo whatever indentation we added */
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent = save_indent;
@@ -4931,3 +5042,125 @@ ExplainFlushWorkersState(ExplainState *es)
 	pfree(wstate->worker_state_save);
 	pfree(wstate);
 }
+
+/*
+ * compute_subplan_work_mem - compute total workmem for a SubPlan object
+ *
+ * If a SubPlan object uses a hash table, then that hash table needs working
+ * memory. We display that working memory on the owning Plan. This function
+ * increments work_mem counters to include the SubPlan's working-memory.
+ */
+static void
+compute_subplan_workmem(List *plans, double *sp_estimate, double *sp_limit)
+{
+	foreach_node(SubPlanState, sps, plans)
+	{
+		SubPlan    *sp = sps->subplan;
+
+		if (sp->hashtab_workmem_id > 0)
+			increment_workmem(sps->planstate, sp->hashtab_workmem_id,
+							  sp_estimate, sp_limit);
+
+		if (sp->hashnul_workmem_id > 0)
+			increment_workmem(sps->planstate, sp->hashnul_workmem_id,
+							  sp_estimate, sp_limit);
+	}
+}
+
+static void
+compute_agg_workmem_node(PlanState *planstate, Agg *agg, double *agg_estimate,
+						 double *agg_limit)
+{
+	/* Record memory used for output data structures. */
+	if (agg->plan.workmem_id > 0)
+		increment_workmem(planstate, agg->plan.workmem_id, agg_estimate,
+						  agg_limit);
+
+	/* Record memory used for input sort buffers. */
+	if (agg->sortWorkMemId > 0)
+		increment_workmem(planstate, agg->sortWorkMemId, agg_estimate,
+						  agg_limit);
+}
+
+/*
+ * compute_agg_workmem - compute Agg node's total workmem estimate and limit
+ *
+ * An Agg node might point to a chain of additional Agg nodes. When we explain
+ * the plan, we display only the first, "main" Agg node.
+ */
+static void
+compute_agg_workmem(PlanState *planstate, Agg *agg, double *agg_estimate,
+					double *agg_limit)
+{
+	compute_agg_workmem_node(planstate, agg, agg_estimate, agg_limit);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach_node(Agg, aggnode, agg->chain)
+		compute_agg_workmem_node(planstate, aggnode, agg_estimate, agg_limit);
+}
+
+/*
+ * compute_hash_workmem - compute total workmem for a Hash node
+ *
+ * This function is complicated, because we currently can adjust workmem limits
+ * for Hash (Joins), at runtime; and because the memory a Hash (Join) needs
+ * per-batch is not currently counted against the workmem limit.
+ *
+ * Here, we try to give a more accurate accounting than we'd get from just
+ * displaying limit * count.
+ */
+static void
+compute_hash_workmem(PlanState *planstate, double *hash_estimate,
+					 double *hash_limit)
+{
+	double		count = workMemCount(planstate);
+	double		estimate = workMemEstimate(planstate);
+	size_t		limit = workMemLimit(planstate);
+	HashState  *hstate = (HashState *) planstate;
+	Plan	   *plan = planstate->plan;
+	Hash	   *hash = (Hash *) plan;
+	Plan	   *outerNode = outerPlan(plan);
+	double		rows;
+	size_t		nbytes;
+	size_t		total_space_allowed;	/* ignored */
+	int			nbuckets;		/* ignored */
+	int			nbatch;
+	int			num_skew_mcvs;	/* ignored */
+	int			workmem_estimate;	/* ignored */
+
+	/*
+	 * For Hash Joins, we currently don't count per-batch memory against the
+	 * "workmem_limit", but we can at least estimate it for display with the
+	 * Plan.
+	 */
+	rows = plan->parallel_aware ? hash->rows_total : outerNode->plan_rows;
+	nbytes = limit * 1024;
+
+	ExecChooseHashTableSize(rows, outerNode->plan_width,
+							OidIsValid(hash->skewTable),
+							hstate->parallel_state != NULL,
+							hstate->parallel_state != NULL ?
+							hstate->parallel_state->nparticipants - 1 : 0,
+							&nbytes, &total_space_allowed,
+							&nbuckets, &nbatch, &num_skew_mcvs,
+							&workmem_estimate);
+
+	/* Include space for per-batch memory, if any: 2 blocks per batch. */
+	if (nbatch > 1)
+		nbytes += nbatch * 2 * BLCKSZ;
+
+	Assert(nbytes >= limit * 1024);
+
+	*hash_estimate += estimate * count;
+	*hash_limit += (double) normalize_work_bytes(nbytes) * count;
+}
+
+static void
+increment_workmem(PlanState *planstate, int workmem_id, double *estimate,
+				  double *limit)
+{
+	double		count = workMemCountFromId(planstate, workmem_id);
+
+	*estimate += workMemEstimateFromId(planstate, workmem_id) * count;
+	*limit += workMemLimitFromId(planstate, workmem_id) * count;
+}
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 7d09ac8b5a3..6ae3d649be6 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -482,7 +482,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
-							worker_space_allowed,
+							&worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
@@ -666,7 +666,7 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t worker_space_allowed,
+						size_t *worker_space_allowed,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
@@ -699,7 +699,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = worker_space_allowed;
+	hash_table_bytes = *worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -963,6 +963,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
+		*worker_space_allowed = (*worker_space_allowed) * 2;
 		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 12b1f1d82a9..c1db6f53d10 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4277,6 +4277,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	int			numbuckets;
 	int			numbatches;
 	int			num_skew_mcvs;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;	/* unused */
 
 	/* Count up disabled nodes. */
@@ -4322,12 +4323,13 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	 * XXX at some point it might be interesting to try to account for skew
 	 * optimization in the cost estimate, but for now, we don't.
 	 */
+	worker_space_allowed = get_hash_memory_limit();
 	ExecChooseHashTableSize(inner_path_rows_total,
 							inner_path->pathtarget->width,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
-							get_hash_memory_limit(),
+							&worker_space_allowed,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 64547bd9b9c..cd8be1c5bdb 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -53,6 +53,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		memory;			/* print planner's memory usage information */
+	bool		work_mem;		/* print work_mem estimates per node */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
@@ -69,6 +70,9 @@ typedef struct ExplainState
 	bool		hide_workers;	/* set if we find an invisible Gather */
 	int			rtable_size;	/* length of rtable excluding the RTE_GROUP
 								 * entry */
+	int			num_workers;	/* # of worker processes *planned* to use */
+	double		total_workmem_estimate; /* total working memory estimate */
+	double		total_workmem_limit;	/* total working memory limit */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 6cd9bffbee5..b346a270b67 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,7 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t worker_space_allowed,
+									size_t *worker_space_allowed,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
new file mode 100644
index 00000000000..ca8edde6d5f
--- /dev/null
+++ b/src/test/regress/expected/workmem.out
@@ -0,0 +1,653 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Unique -> hash agg
+set enable_hashagg = on;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                         workmem_filter                          
+-----------------------------------------------------------------
+ Sort  (work_mem=N kB) (limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(11 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Unique -> sort
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ Sort  (work_mem=N kB) (limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  Unique
+               ->  Sort  (work_mem=N kB) (limit=4096 kB)
+                     Sort Key: "*VALUES*".column1, "*VALUES*".column2
+                     ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 8192 kB
+(12 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+                     workmem_filter                      
+---------------------------------------------------------
+ Limit
+   ->  Incremental Sort  (work_mem=N kB) (limit=8192 kB)
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort  (work_mem=N kB) (limit=4096 kB)
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(9 rows)
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+    4220 |    5017 |   0 |    0 |   0 |      0 |      20 |      220 |         220 |      4220 |     4220 |  40 |   41 | IGAAAA   | ZKHAAA   | HHHHxx
+(1 row)
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+                                 workmem_filter                                 
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Hash Join
+               Hash Cond: (t3.thousand = t1.unique1)
+               ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+                     Group Key: t3.thousand, t3.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
+               ->  Hash  (work_mem=N kB) (limit=8192 kB)
+                     ->  Index Only Scan using onek_unique1 on onek t1
+                           Index Cond: (unique1 < 1)
+         ->  Index Only Scan using tenk1_hundred on tenk1 t2
+               Index Cond: (hundred = t3.tenthous)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(14 rows)
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+ count 
+-------
+   100
+(1 row)
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+                               workmem_filter                               
+----------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Nested Loop Left Join
+               Filter: (t4.f1 IS NULL)
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize  (work_mem=N kB) (limit=4096 kB)
+                     ->  Nested Loop Left Join
+                           Join Filter: (t3.f1 > 1)
+                           ->  Seq Scan on int4_tbl t3
+                                 Filter: (f1 > 0)
+                           ->  Materialize  (work_mem=N kB) (limit=4096 kB)
+                                 ->  Seq Scan on int4_tbl t4
+         ->  Seq Scan on int4_tbl t1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 8192 kB
+(15 rows)
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+ count 
+-------
+     0
+(1 row)
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=4096 kB)
+   ->  Sort  (work_mem=N kB) (limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=4096 kB)
+   ->  Sort  (work_mem=N kB) (limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=4096 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 20480 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+                           workmem_filter                            
+---------------------------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB) (limit=8192 kB)
+   Group Key: (length((stringu1)::text))
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial HashAggregate  (work_mem=N kB) (limit=40960 kB)
+               Group Key: length((stringu1)::text)
+               ->  Parallel Seq Scan on tenk1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 49152 kB
+(9 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length 
+--------
+      6
+(1 row)
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+             QUERY PLAN              
+-------------------------------------
+ Aggregate
+   ->  Seq Scan on tenk1
+ Total Working Memory Estimate: 0 kB
+ Total Working Memory Limit: 0 kB
+(4 rows)
+
+select MAX(length(stringu1)) from tenk1;
+ max 
+-----
+   6
+(1 row)
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                              workmem_filter                              
+--------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=12288 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+                              workmem_filter                              
+--------------------------------------------------------------------------
+ Nested Loop
+   ->  Seq Scan on workmem_xmldata
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(5 rows)
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+ id | _id | country_name | country_id | region_id | size | unit | premier_name 
+----+-----+--------------+------------+-----------+------+------+--------------
+(0 rows)
+
+drop table workmem_xmldata;
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ SetOp Except
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+ Total Working Memory Estimate: 0 kB
+ Total Working Memory Limit: 0 kB
+(6 rows)
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+ unique1 
+---------
+      10
+(1 row)
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   ->  HashSetOp Intersect  (work_mem=N kB) (limit=8192 kB)
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 8192 kB
+(6 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+ count 
+-------
+  5000
+(1 row)
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+                               workmem_filter                                
+-----------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  Memoize  (work_mem=N kB) (limit=8192 kB)
+               Cache Key: o.four
+               Cache Mode: binary
+               ->  CTE Scan on x  (work_mem=N kB) (limit=4096 kB)
+                     CTE x
+                       ->  Recursive Union  (work_mem=N kB) (limit=16384 kB)
+                             ->  Result
+                             ->  WorkTable Scan on x x_1
+                                   Filter: (a < 10)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 28672 kB
+(15 rows)
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+ sum  | sum  
+------+------
+ 1700 | 5350
+(1 row)
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+                           workmem_filter                           
+--------------------------------------------------------------------
+ Aggregate
+   CTE q1
+     ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+           Group Key: tenk1.hundred
+           ->  Seq Scan on tenk1
+   InitPlan 2
+     ->  Aggregate
+           ->  CTE Scan on q1 qsub  (work_mem=N kB) (limit=4096 kB)
+   ->  CTE Scan on q1  (work_mem=N kB) (limit=4096 kB)
+         Filter: ((y)::numeric > (InitPlan 2).col1)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(12 rows)
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+ count 
+-------
+    50
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                     
+---------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=4096 kB)
+         ->  Sort  (work_mem=N kB) (limit=4096 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+                                            workmem_filter                                             
+-------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (((a.unique1 = 1) AND (b.unique1 = 2)) OR ((a.unique2 = 3) AND (b.hundred = 4)))
+         ->  Bitmap Heap Scan on tenk1 b
+               Recheck Cond: ((hundred = 4) OR (unique1 = 2))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB) (limit=4096 kB)
+                           Index Cond: (hundred = 4)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB) (limit=4096 kB)
+                           Index Cond: (unique1 = 2)
+         ->  Materialize  (work_mem=N kB) (limit=4096 kB)
+               ->  Bitmap Heap Scan on tenk1 a
+                     Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
+                     ->  BitmapOr
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB) (limit=4096 kB)
+                                 Index Cond: (unique2 = 3)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB) (limit=4096 kB)
+                                 Index Cond: (unique1 = 1)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 20480 kB
+(20 rows)
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+ count 
+-------
+   101
+(1 row)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+              workmem_filter              
+------------------------------------------
+ Result  (work_mem=N kB) (limit=16384 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=16384 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=8192 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 24576 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..1089e3bdf96 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate workmem
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/workmem.sql b/src/test/regress/sql/workmem.sql
new file mode 100644
index 00000000000..2de22be0427
--- /dev/null
+++ b/src/test/regress/sql/workmem.sql
@@ -0,0 +1,307 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Unique -> hash agg
+set enable_hashagg = on;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Unique -> sort
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+
+select length(stringu1) from tenk1 group by length(stringu1);
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+
+select MAX(length(stringu1)) from tenk1;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+
+drop table workmem_xmldata;
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
-- 
2.47.1

0002-Add-workmem-estimates-to-Path-node-and-PlannedStmt.patchapplication/octet-stream; name=0002-Add-workmem-estimates-to-Path-node-and-PlannedStmt.patchDownload

From 04e7532bb9c387e748b38103c5ff57975898b84f Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Tue, 4 Mar 2025 23:03:19 +0000
Subject: [PATCH 2/4] Add "workmem" estimates to Path node and PlannedStmt

To allow for future optimizers to make decisions at Path time, this commit
aggregates the Path's total working memory onto the Path's "workmem" field,
normalized to a minimum of 64 KB and rounded up to the next whole KB.

To allow future hooks to override ExecAssignWorkMem(), this commit then
breaks that total working memory into per-data structure working memory,
and stores it, next to the workMemLimit, on the PlannedStmt.
---
 src/backend/executor/execParallel.c     |   2 +
 src/backend/executor/nodeHash.c         |  32 +-
 src/backend/nodes/tidbitmap.c           |  18 ++
 src/backend/optimizer/path/costsize.c   | 407 ++++++++++++++++++++++--
 src/backend/optimizer/plan/createplan.c | 267 +++++++++++++---
 src/backend/optimizer/plan/planner.c    |   2 +
 src/backend/optimizer/prep/prepagg.c    |  12 +
 src/backend/optimizer/util/pathnode.c   |  53 ++-
 src/include/executor/nodeHash.h         |   3 +-
 src/include/nodes/execnodes.h           |  12 +
 src/include/nodes/pathnodes.h           |  10 +-
 src/include/nodes/plannodes.h           |   7 +-
 src/include/nodes/tidbitmap.h           |   1 +
 src/include/optimizer/cost.h            |  13 +-
 src/include/optimizer/planmain.h        |   3 +-
 15 files changed, 763 insertions(+), 79 deletions(-)

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 97d83bae571..c247ce1e901 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -214,6 +214,8 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->stmt_location = -1;
 	pstmt->stmt_len = -1;
 	pstmt->workMemCategories = estate->es_plannedstmt->workMemCategories;
+	pstmt->workMemEstimates = estate->es_plannedstmt->workMemEstimates;
+	pstmt->workMemCounts = estate->es_plannedstmt->workMemCounts;
 	pstmt->workMemLimits = estate->es_plannedstmt->workMemLimits;
 
 	/* Return serialized copy of our dummy PlannedStmt. */
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index bb9af08dc5d..7d09ac8b5a3 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
@@ -453,6 +454,7 @@ ExecHashTableCreate(HashState *state)
 	int			nbuckets;
 	int			nbatch;
 	double		rows;
+	int			workmem;		/* ignored */
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	MemoryContext oldcxt;
@@ -482,7 +484,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state->nparticipants - 1 : 0,
 							worker_space_allowed,
 							&space_allowed,
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
 	/* nbuckets must be a power of 2 */
 	log2_nbuckets = my_log2(nbuckets);
@@ -668,7 +670,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
-						int *num_skew_mcvs)
+						int *num_skew_mcvs,
+						int *workmem)
 {
 	int			tupsize;
 	double		inner_rel_bytes;
@@ -769,6 +772,27 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		*num_skew_mcvs = 0;
 
 	/*
+	 * Set "workmem" to the amount of memory needed to hold the inner rel in a
+	 * single batch. So this calculation doesn't care about "max_pointers".
+	 */
+	dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
+	nbuckets = (int) dbuckets;
+	/* don't let nbuckets be really small, though ... */
+	nbuckets = Max(nbuckets, 1024);
+	/* ... and force it to be a power of 2. */
+	nbuckets = pg_nextpower2_32(nbuckets);
+	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	/* Don't forget the 2% overhead reserved for skew buckets! */
+	*workmem = useskew ?
+		normalize_work_bytes((inner_rel_bytes + bucket_bytes) *
+							 100.0 / (100.0 - SKEW_HASH_MEM_PERCENT)) :
+		normalize_work_bytes(inner_rel_bytes + bucket_bytes);
+
+	/*
+	 * Now redo the nbuckets and bucket_bytes calculations, taking memory
+	 * limits into account.
+	 *
 	 * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
 	 * memory is filled, assuming a single batch; but limit the value so that
 	 * the pointer arrays we'll try to allocate do not exceed hash_table_bytes
@@ -799,6 +823,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * the required bucket headers, we will need multiple batches.
 	 */
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
 	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
@@ -819,7 +844,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									total_space_allowed,
 									numbuckets,
 									numbatches,
-									num_skew_mcvs);
+									num_skew_mcvs,
+									workmem);
 			return;
 		}
 
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 3d835024caa..ac4c6b67350 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -1554,6 +1554,24 @@ tbm_calculate_entries(Size maxbytes)
 	return (int) nbuckets;
 }
 
+/*
+ * tbm_calculate_bytes
+ *
+ * Estimate number of bytes needed to store maxentries hashtable entries.
+ *
+ * This function is the inverse of tbm_calculate_entries(), and is used to
+ * estimate a work_mem limit, based on cardinality.
+ */
+double
+tbm_calculate_bytes(double maxentries)
+{
+	maxentries = Min(maxentries, INT_MAX - 1);	/* safety limit */
+	maxentries = Max(maxentries, 16);	/* sanity limit */
+
+	return maxentries * (sizeof(PagetableEntry) + sizeof(Pointer) +
+						 sizeof(Pointer));
+}
+
 /*
  * Create a shared or private bitmap iterator and start iteration.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ca4ab9bd315..12b1f1d82a9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -102,8 +102,10 @@
 #include "optimizer/paths.h"
 #include "optimizer/placeholder.h"
 #include "optimizer/plancat.h"
+#include "optimizer/planmain.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -200,9 +202,14 @@ static Cost append_nonpartial_cost(List *subpaths, int numpaths,
 								   int parallel_workers);
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static int32 get_expr_width(PlannerInfo *root, const Node *expr);
-static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static void compute_sort_output_sizes(double input_tuples, int input_width,
+									  double limit_tuples,
+									  double *output_tuples,
+									  double *output_bytes);
+static double compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+									 Cardinality max_ancestor_rows);
 
 
 /*
@@ -1112,6 +1119,18 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->disabled_nodes = enable_bitmapscan ? 0 : 1;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+
+	/*
+	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
+	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
+	 *
+	 * (When we convert this path into a BitmapHeapScan plan, we'll break this
+	 * overall estimate down into per-node estimates, just as we do for
+	 * AggPaths.)
+	 */
+	path->workmem = compute_bitmap_workmem(baserel, bitmapqual,
+										   0.0 /* max_ancestor_rows */ );
 }
 
 /*
@@ -1587,6 +1606,16 @@ cost_functionscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem = list_length(rte->functions) *
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1644,6 +1673,16 @@ cost_tablefuncscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1740,6 +1779,9 @@ cost_ctescan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1823,7 +1865,7 @@ cost_resultscan(Path *path, PlannerInfo *root,
  * We are given Paths for the nonrecursive and recursive terms.
  */
 void
-cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
+cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -1850,12 +1892,37 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	 */
 	total_cost += cpu_tuple_cost * total_rows;
 
-	runion->disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
-	runion->startup_cost = startup_cost;
-	runion->total_cost = total_cost;
-	runion->rows = total_rows;
-	runion->pathtarget->width = Max(nrterm->pathtarget->width,
-									rterm->pathtarget->width);
+	runion->path.disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
+	runion->path.startup_cost = startup_cost;
+	runion->path.total_cost = total_cost;
+	runion->path.rows = total_rows;
+	runion->path.pathtarget->width = Max(nrterm->pathtarget->width,
+										 rterm->pathtarget->width);
+
+	/*
+	 * Include memory for working and intermediate tables. Since we'll
+	 * repeatedly swap the two tables, use 2x whichever is larger as our
+	 * estimate.
+	 */
+	runion->path.workmem =
+		normalize_work_bytes(
+							 Max(relation_byte_size(nrterm->rows,
+													nrterm->pathtarget->width),
+								 relation_byte_size(rterm->rows,
+													rterm->pathtarget->width))
+							 * 2);
+
+	if (list_length(runion->distinctList) > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		hashentrysize;
+
+		hashentrysize = MAXALIGN(runion->path.pathtarget->width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		runion->path.workmem +=
+			normalize_work_bytes(runion->numGroups * hashentrysize);
+	}
 }
 
 /*
@@ -1895,7 +1962,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
  */
 static void
-cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+cost_tuplesort(Cost *startup_cost, Cost *run_cost, Cost *nbytes,
 			   double tuples, int width,
 			   Cost comparison_cost, int sort_mem,
 			   double limit_tuples)
@@ -1915,17 +1982,8 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	/* Include the default cost-per-comparison */
 	comparison_cost += 2.0 * cpu_operator_cost;
 
-	/* Do we have a useful LIMIT? */
-	if (limit_tuples > 0 && limit_tuples < tuples)
-	{
-		output_tuples = limit_tuples;
-		output_bytes = relation_byte_size(output_tuples, width);
-	}
-	else
-	{
-		output_tuples = tuples;
-		output_bytes = input_bytes;
-	}
+	compute_sort_output_sizes(tuples, width, limit_tuples,
+							  &output_tuples, &output_bytes);
 
 	if (output_bytes > sort_mem_bytes)
 	{
@@ -1982,6 +2040,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	 * counting the LIMIT otherwise.
 	 */
 	*run_cost = cpu_operator_cost * tuples;
+	*nbytes = output_bytes;
 }
 
 /*
@@ -2011,6 +2070,7 @@ cost_incremental_sort(Path *path,
 				input_groups;
 	Cost		group_startup_cost,
 				group_run_cost,
+				group_nbytes,
 				group_input_run_cost;
 	List	   *presortedExprs = NIL;
 	ListCell   *l;
@@ -2085,7 +2145,7 @@ cost_incremental_sort(Path *path,
 	 * Estimate the average cost of sorting of one group where presorted keys
 	 * are equal.
 	 */
-	cost_tuplesort(&group_startup_cost, &group_run_cost,
+	cost_tuplesort(&group_startup_cost, &group_run_cost, &group_nbytes,
 				   group_tuples, width, comparison_cost, sort_mem,
 				   limit_tuples);
 
@@ -2126,6 +2186,14 @@ cost_incremental_sort(Path *path,
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Incremental sort switches between two Tuplesortstates: one that sorts
+	 * all columns ("full"), and that sorts only suffix columns ("prefix").
+	 * We'll assume they're both around the same size: large enough to hold
+	 * one sort group.
+	 */
+	path->workmem = normalize_work_bytes(group_nbytes * 2.0);
 }
 
 /*
@@ -2150,8 +2218,9 @@ cost_sort(Path *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	Cost		nbytes;
 
-	cost_tuplesort(&startup_cost, &run_cost,
+	cost_tuplesort(&startup_cost, &run_cost, &nbytes,
 				   tuples, width,
 				   comparison_cost, sort_mem,
 				   limit_tuples);
@@ -2162,6 +2231,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_work_bytes(nbytes);
 }
 
 /*
@@ -2522,6 +2592,7 @@ cost_material(Path *path,
 	path->disabled_nodes = input_disabled_nodes + (enable_material ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_work_bytes(nbytes);
 }
 
 /*
@@ -2592,6 +2663,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	if ((estinfo.flags & SELFLAG_USED_DEFAULT) != 0)
 		ndistinct = calls;
 
+	/* How much working memory would we need, to store every distinct tuple? */
+	mpath->path.workmem = normalize_work_bytes(ndistinct * est_entry_bytes);
+
 	/*
 	 * Since we've already estimated the maximum number of entries we can
 	 * store at once and know the estimated number of distinct values we'll be
@@ -2867,6 +2941,19 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->disabled_nodes = disabled_nodes;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/* Include memory needed to produce output. */
+	path->workmem =
+		compute_agg_output_workmem(root, aggstrategy, numGroups,
+								   aggcosts->transitionSpace, input_tuples,
+								   input_width, false /* cost_sort */ );
+
+	/* Also include memory needed to sort inputs (if needed): */
+	if (aggcosts->numSortBuffers > 0)
+	{
+		path->workmem += (double) aggcosts->numSortBuffers *
+			compute_agg_input_workmem(input_tuples, input_width);
+	}
 }
 
 /*
@@ -3101,7 +3188,7 @@ cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, WindowClause *winclause,
 			   int input_disabled_nodes,
 			   Cost input_startup_cost, Cost input_total_cost,
-			   double input_tuples)
+			   double input_tuples, int width)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -3183,6 +3270,11 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	if (startup_tuples > 1.0)
 		path->startup_cost += (total_cost - startup_cost) / input_tuples *
 			(startup_tuples - 1.0);
+
+
+	/* We need to store a window of size "startup_tuples", in a Tuplestore. */
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(startup_tuples, width));
 }
 
 /*
@@ -3337,6 +3429,7 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost;
 	/* Save private data for final_cost_nestloop */
 	workspace->run_cost = run_cost;
+	workspace->workmem = 0;
 }
 
 /*
@@ -3800,6 +3893,14 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost + inner_run_cost;
 	/* Save private data for final_cost_mergejoin */
 	workspace->run_cost = run_cost;
+
+	/*
+	 * By itself, Merge Join requires no working memory. If it adds one or
+	 * more Sort or Material nodes, we'll track their working memory when we
+	 * create them, inside createplan.c.
+	 */
+	workspace->workmem = 0;
+
 	workspace->inner_run_cost = inner_run_cost;
 	workspace->outer_rows = outer_rows;
 	workspace->inner_rows = inner_rows;
@@ -4171,6 +4272,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	double		outer_path_rows = outer_path->rows;
 	double		inner_path_rows = inner_path->rows;
 	double		inner_path_rows_total = inner_path_rows;
+	int			workmem;
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
@@ -4229,7 +4331,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
-							&num_skew_mcvs);
+							&num_skew_mcvs,
+							&workmem);
 
 	/*
 	 * If inner relation is too big then we will need to "batch" the join,
@@ -4260,6 +4363,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->numbuckets = numbuckets;
 	workspace->numbatches = numbatches;
 	workspace->inner_rows_total = inner_path_rows_total;
+	workspace->workmem = workmem;
 }
 
 /*
@@ -4268,8 +4372,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
  *
  * Note: the numbatches estimate is also saved into 'path' for use later
  *
- * 'path' is already filled in except for the rows and cost fields and
- *		num_batches
+ * 'path' is already filled in except for the rows and cost fields,
+ *		num_batches, and workmem
  * 'workspace' is the result from initial_cost_hashjoin
  * 'extra' contains miscellaneous information about the join
  */
@@ -4286,6 +4390,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	List	   *hashclauses = path->path_hashclauses;
 	Cost		startup_cost = workspace->startup_cost;
 	Cost		run_cost = workspace->run_cost;
+	int			workmem = workspace->workmem;
 	int			numbuckets = workspace->numbuckets;
 	int			numbatches = workspace->numbatches;
 	Cost		cpu_per_tuple;
@@ -4512,6 +4617,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	path->jpath.path.startup_cost = startup_cost;
 	path->jpath.path.total_cost = startup_cost + run_cost;
+	path->jpath.path.workmem = workmem;
 }
 
 
@@ -4534,6 +4640,9 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 	if (subplan->useHashTable)
 	{
+		long		nbuckets;
+		Size		hashentrysize;
+
 		/*
 		 * If we are using a hash table for the subquery outputs, then the
 		 * cost of evaluating the query is a one-time cost.  We charge one
@@ -4545,13 +4654,37 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 		/*
 		 * Working memory needed for the hashtable (and hashnulls, if needed).
+		 * The logic below MUST match the logic in buildSubPlanHash() and
+		 * ExecInitSubPlan().
 		 */
-		subplan->hashtab_workmem_id = add_hash_workmem(root->glob);
+		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
+		if (nbuckets < 1)
+			nbuckets = 1;
+
+		hashentrysize = MAXALIGN(plan->plan_width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		subplan->hashtab_workmem_id =
+			add_hash_workmem(root->glob,
+							 normalize_work_bytes((double) nbuckets *
+												  hashentrysize));
 
 		if (!subplan->unknownEqFalse)
 		{
 			/* Also needs a hashnulls table.  */
-			subplan->hashnul_workmem_id = add_hash_workmem(root->glob);
+			if (IsA(subplan->testexpr, OpExpr))
+				nbuckets = 1;	/* there can be only one entry */
+			else
+			{
+				nbuckets /= 16;
+				if (nbuckets < 1)
+					nbuckets = 1;
+			}
+
+			subplan->hashnul_workmem_id =
+				add_hash_workmem(root->glob,
+								 normalize_work_bytes((double) nbuckets *
+													  hashentrysize));
 		}
 
 		/*
@@ -6437,7 +6570,7 @@ get_expr_width(PlannerInfo *root, const Node *expr)
  *	  Estimate the storage space in bytes for a given number of tuples
  *	  of a given width (size in bytes).
  */
-static double
+double
 relation_byte_size(double tuples, int width)
 {
 	return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
@@ -6616,3 +6749,219 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * compute_sort_output_sizes
+ *	  Estimate amount of memory and rows needed to hold a Sort operator's output
+ */
+static void
+compute_sort_output_sizes(double input_tuples, int input_width,
+						  double limit_tuples,
+						  double *output_tuples, double *output_bytes)
+{
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Do we have a useful LIMIT? */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+		*output_tuples = limit_tuples;
+	else
+		*output_tuples = input_tuples;
+
+	*output_bytes = relation_byte_size(*output_tuples, input_width);
+}
+
+/*
+ * compute_agg_input_workmem
+ *	  Estimate memory (in KB) needed to hold a sort buffer for aggregate's input
+ *
+ * Some aggregates involve DISTINCT or ORDER BY, so they need to sort their
+ * input, before they can process it. We need one sort buffer per such
+ * aggregate, and this function returns that sort buffer's (estimated) size (in
+ * KB).
+ */
+int
+compute_agg_input_workmem(double input_tuples, double input_width)
+{
+	double		output_tuples;	/* ignored */
+	double		output_bytes;
+
+	/* Account for size of one buffer needed to sort the input. */
+	compute_sort_output_sizes(input_tuples, input_width,
+							  0.0 /* limit_tuples */ ,
+							  &output_tuples, &output_bytes);
+	return normalize_work_bytes(output_bytes);
+}
+
+/*
+ * compute_agg_output_workmem
+ *	  Estimate amount of memory needed (in KB) to hold an aggregate's output
+ *
+ * In a Hash aggregate, we need space for the hash table that holds the
+ * aggregated data.
+ *
+ * Sort aggregates require output space only if they are part of a Grouping
+ * Sets chain: the first aggregate writes to its "sort_out" buffer, which the
+ * second aggregate uses as its "sort_in" buffer, and sorts.
+ *
+ * In the latter case, the "Path" code already costs the sort by calling
+ * cost_sort(), so it passes "cost_sort = false" to this function, to avoid
+ * double-counting.
+ */
+int
+compute_agg_output_workmem(PlannerInfo *root, AggStrategy aggstrategy,
+						   double numGroups, uint64 transitionSpace,
+						   double input_tuples, double input_width,
+						   bool cost_sort)
+{
+	/* Account for size of hash table to hold the output. */
+	if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+	{
+		double		hashentrysize;
+
+		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
+											input_width, transitionSpace);
+		return normalize_work_bytes(numGroups * hashentrysize);
+	}
+
+	/* Account for the size of the "sort_out" buffer. */
+	if (cost_sort && aggstrategy == AGG_SORTED)
+	{
+		double		output_tuples;	/* ignored */
+		double		output_bytes;
+
+		Assert(aggstrategy == AGG_SORTED);
+
+		compute_sort_output_sizes(numGroups, input_width,
+								  0.0 /* limit_tuples */ ,
+								  &output_tuples, &output_bytes);
+		return normalize_work_bytes(output_bytes);
+	}
+
+	return 0;
+}
+
+/*
+ * compute_bitmap_workmem
+ *	  Estimate total working memory (in KB) needed by bitmapqual
+ *
+ * Although we don't fill in the workmem_est or rows fields on the bitmapqual's
+ * paths, we fill them in on the owning BitmapHeapPath. This function estimates
+ * the total work_mem needed by all BitmapOrPaths and IndexPaths inside
+ * bitmapqual.
+ */
+static double
+compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+					   Cardinality max_ancestor_rows)
+{
+	double		workmem = 0.0;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * baserel->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
+
+	if (IsA(bitmapqual, BitmapAndPath))
+	{
+		BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, apath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, BitmapOrPath))
+	{
+		BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, opath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, IndexPath))
+	{
+		/* Working memory needed for 1 TID bitmap. */
+		workmem +=
+			normalize_work_bytes(tbm_calculate_bytes(max_ancestor_rows));
+	}
+
+	return workmem;
+}
+
+/*
+ * normalize_work_kb
+ *	  Convert a double, "KB" working-memory estimate to an int, "KB" value
+ *
+ * Normalizes non-zero input to a minimum of 64 (KB), rounding up to the
+ * nearest whole KB.
+ */
+int
+normalize_work_kb(double nkb)
+{
+	double		workmem;
+
+	if (nkb == 0.0)
+		return 0;				/* caller apparently doesn't need any workmem */
+
+	/*
+	 * We'll assign working-memory to SQL operators in 1 KB increments, so
+	 * round up to the next whole KB.
+	 */
+	workmem = ceil(nkb);
+
+	/*
+	 * Although some components can probably work with < 64 KB of working
+	 * memory, PostgreSQL has imposed a hard minimum of 64 KB on the
+	 * "work_mem" GUC, for a long time; so, by now, some components probably
+	 * rely on this minimum, implicitly, and would fail if we tried to assign
+	 * them < 64 KB.
+	 *
+	 * Perhaps this minimum can be relaxed, in the future; but memory sizes
+	 * keep increasing, and right now the minimum of 64 KB = 1.6 percent of
+	 * the default "work_mem" of 4 MB.
+	 *
+	 * So, even with this (overly?) cautious normalization, with the default
+	 * GUC settings, we can still achieve a working-memory reduction of
+	 * 64-to-1.
+	 */
+	workmem = Max((double) 64, workmem);
+
+	/* And clamp to MAX_KILOBYTES. */
+	workmem = Min(workmem, (double) MAX_KILOBYTES);
+
+	return (int) workmem;
+}
+
+/*
+ * normalize_work_bytes
+ *	  Convert a double, "bytes" working-memory estimate to an int, "KB" value
+ *
+ * Same as above, but takes input in bytes rather than in KB.
+ */
+int
+normalize_work_bytes(double nbytes)
+{
+	return normalize_work_kb(nbytes / 1024.0);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 97e43d49d1f..263c7e4eb9d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -130,6 +130,7 @@ static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
 											   BitmapHeapPath *best_path,
 											   List *tlist, List *scan_clauses);
 static Plan *create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+								   Cardinality max_ancestor_rows,
 								   List **qual, List **indexqual, List **indexECs);
 static void bitmap_subplan_mark_shared(Plan *plan);
 static TidScan *create_tidscan_plan(PlannerInfo *root, TidPath *best_path,
@@ -319,6 +320,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
+static int	add_workmem(PlannerGlobal *glob, int estimate);
+static int	add_workmems(PlannerGlobal *glob, int estimate, int count);
 
 
 /*
@@ -1656,7 +1659,8 @@ create_material_plan(PlannerInfo *root, MaterialPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -1712,7 +1716,9 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_hash_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -1861,7 +1867,9 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 0,
 								 subplan);
 
-		plan->workmem_id = add_hash_workmem(root->glob);
+		plan->workmem_id =
+			add_hash_workmem(root->glob,
+							 normalize_work_kb(best_path->path.workmem));
 	}
 	else
 	{
@@ -2208,7 +2216,9 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob,
+					normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2236,7 +2246,13 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
-	plan->sort.plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * IncrementalSort creates two sort buffers, which the Path's "workmem"
+	 * estimate combined into a single value. Split it into two now.
+	 */
+	plan->sort.plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_kb(best_path->spath.path.workmem / 2), 2);
 
 	return plan;
 }
@@ -2349,11 +2365,32 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace the AggPath's overall workmem estimate with finer-grained
+	 * estimates.
+	 */
 	if (plan->aggstrategy == AGG_HASHED)
-		plan->plan.workmem_id = add_hash_workmem(root->glob);
+	{
+		int			workmem =
+			compute_agg_output_workmem(root, AGG_HASHED,
+									   plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows,
+									   subplan->plan_width,
+									   false /* cost_sort */ );
+
+		plan->plan.workmem_id = add_hash_workmem(root->glob, workmem);
+	}
+
+	/* Also include estimated memory needed to sort the input: */
+	if (best_path->numSortBuffers > 0)
+	{
+		int			workmem = compute_agg_input_workmem(subplan->plan_rows,
+														subplan->plan_width);
 
-	/* Also include working memory needed to sort the input: */
-	plan->sortWorkMemId = add_workmem(root->glob);
+		plan->sortWorkMemId =
+			add_workmems(root->glob, workmem, best_path->numSortBuffers);
+	}
 
 	return plan;
 }
@@ -2415,6 +2452,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	int			maxref;
 	List	   *chain;
 	ListCell   *lc;
+	int			num_sort_aggs = 0;
+	int			max_sort_agg_workmem = 0.0;
+	double		sum_hash_agg_workmem = 0.0;
 
 	/* Shouldn't get here without grouping sets */
 	Assert(root->parse->groupingSets);
@@ -2476,6 +2516,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			Plan	   *sort_plan = NULL;
 			Agg		   *agg_plan;
 			AggStrategy strat;
+			bool		cost_sort;
+			int			workmem;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2526,6 +2568,33 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				first_sort_agg = agg_plan;
 			}
 
+			/*
+			 * If we're an AGG_SORTED, but not the last, we need to cost
+			 * working memory needed to produce our "sort_out" buffer.
+			 */
+			cost_sort = foreach_current_index(lc) < list_length(rollups) - 1;
+
+			/* Estimated memory needed to hold the output: */
+			workmem =
+				compute_agg_output_workmem(root, agg_plan->aggstrategy,
+										   agg_plan->numGroups,
+										   agg_plan->transitionSpace,
+										   subplan->plan_rows,
+										   subplan->plan_width,
+										   cost_sort);
+
+			if (agg_plan->aggstrategy == AGG_HASHED)
+			{
+				/* All Hash Grouping Sets share the same workmem limit. */
+				sum_hash_agg_workmem += workmem;
+			}
+			else if (agg_plan->aggstrategy == AGG_SORTED)
+			{
+				/* Every Sort Grouping Set gets its own workmem limit. */
+				max_sort_agg_workmem = Max(max_sort_agg_workmem, workmem);
+				++num_sort_aggs;
+			}
+
 			chain = lappend(chain, agg_plan);
 		}
 	}
@@ -2537,6 +2606,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
 		int			numGroupCols;
+		bool		cost_sort;
+		int			workmem;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2559,6 +2630,27 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
 
+		/*
+		 * If we're an AGG_SORTED, but not the last, we need to cost working
+		 * memory needed to produce our "sort_out" buffer.
+		 */
+		cost_sort = list_length(rollups) > 1;
+
+		/*
+		 * Replace the overall workmem estimate that we copied from the Path
+		 * with finer-grained estimates.
+		 *
+		 */
+
+		/* Estimated memory needed to hold the output: */
+		workmem =
+			compute_agg_output_workmem(root, plan->aggstrategy,
+									   plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows,
+									   subplan->plan_width,
+									   cost_sort);
+
 		/*
 		 * NOTE: We will place the workmem needed to sort the input (if any)
 		 * on the first agg, the Hash workmem on the first Hash agg, and the
@@ -2567,20 +2659,37 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		if (plan->aggstrategy == AGG_HASHED || plan->aggstrategy == AGG_MIXED)
 		{
 			/* All Hash Grouping Sets share the same workmem limit. */
-			plan->plan.workmem_id = add_hash_workmem(root->glob);
+			sum_hash_agg_workmem += workmem;
+			plan->plan.workmem_id = add_hash_workmem(root->glob,
+													 sum_hash_agg_workmem);
 		}
 		else if (plan->aggstrategy == AGG_SORTED)
 		{
 			/* Every Sort Grouping Set gets its own workmem limit. */
+			max_sort_agg_workmem = Max(max_sort_agg_workmem, workmem);
+			++num_sort_aggs;
+
 			first_sort_agg = plan;
 		}
 
 		/* Store the workmem limit, for all Sorts, on the first Sort. */
-		if (first_sort_agg)
-			first_sort_agg->plan.workmem_id = add_workmem(root->glob);
+		if (num_sort_aggs > 1)
+		{
+			first_sort_agg->plan.workmem_id =
+				add_workmems(root->glob, max_sort_agg_workmem,
+							 num_sort_aggs > 2 ? 2 : 1);
+		}
 
 		/* Also include working memory needed to sort the input: */
-		plan->sortWorkMemId = add_workmem(root->glob);
+		if (best_path->numSortBuffers > 0)
+		{
+			workmem = compute_agg_input_workmem(subplan->plan_rows,
+												subplan->plan_width);
+
+			plan->sortWorkMemId =
+				add_workmems(root->glob, workmem,
+							 best_path->numSortBuffers * list_length(rollups));
+		}
 	}
 
 	return (Plan *) plan;
@@ -2753,7 +2862,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2795,7 +2905,9 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_hash_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2833,11 +2945,38 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * Replace our overall "workmem" estimate with estimates at finer
+	 * granularity.
+	 */
+
+	/*
+	 * Include memory for working and intermediate tables.  Since we'll
+	 * repeatedly swap the two tables, use the larger of the two as our
+	 * working- memory estimate.
+	 *
+	 * NOTE: The Path's "workmem" estimate is for the whole Path, but the
+	 * Plan's "workmem" estimates are *per data structure*. So, this value is
+	 * half of the corresponding Path's value.
+	 */
+	plan->plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_bytes(Max(relation_byte_size(leftplan->plan_rows,
+																 leftplan->plan_width),
+											  relation_byte_size(rightplan->plan_rows,
+																 rightplan->plan_width))),
+					 2);
 
 	/* Also include working memory for hash table. */
 	if (plan->numCols > 0)
-		plan->hashWorkMemId = add_hash_workmem(root->glob);
+	{
+		Size		entrysize =
+			sizeof(TupleHashEntryData) + plan->plan.plan_width;
+
+		plan->hashWorkMemId =
+			add_hash_workmem(root->glob,
+							 normalize_work_bytes(plan->numGroups * entrysize));
+	}
 
 	return plan;
 }
@@ -3279,6 +3418,7 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	/* Process the bitmapqual tree into a Plan tree and qual lists */
 	bitmapqualplan = create_bitmap_subplan(root, best_path->bitmapqual,
+										   0.0 /* max_ancestor_rows */ ,
 										   &bitmapqualorig, &indexquals,
 										   &indexECs);
 
@@ -3390,9 +3530,24 @@ create_bitmap_scan_plan(PlannerInfo *root,
  */
 static Plan *
 create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+					  Cardinality max_ancestor_rows,
 					  List **qual, List **indexqual, List **indexECs)
 {
 	Plan	   *plan;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * bitmapqual->parent->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
 
 	if (IsA(bitmapqual, BitmapAndPath))
 	{
@@ -3418,6 +3573,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3429,8 +3586,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_and(subplans);
 		plan->startup_cost = apath->path.startup_cost;
 		plan->total_cost = apath->path.total_cost;
-		plan->plan_rows =
-			clamp_row_est(apath->bitmapselectivity * apath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = apath->path.parallel_safe;
@@ -3465,6 +3621,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3493,8 +3651,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			plan = (Plan *) make_bitmap_or(subplans);
 			plan->startup_cost = opath->path.startup_cost;
 			plan->total_cost = opath->path.total_cost;
-			plan->plan_rows =
-				clamp_row_est(opath->bitmapselectivity * opath->path.parent->tuples);
+			plan->plan_rows = plan_rows;
 			plan->plan_width = 0;	/* meaningless */
 			plan->parallel_aware = false;
 			plan->parallel_safe = opath->path.parallel_safe;
@@ -3540,13 +3697,14 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
-		plan->plan_rows =
-			clamp_row_est(ipath->indexselectivity * ipath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
 
-		plan->workmem_id = add_workmem(root->glob);
+		plan->workmem_id =
+			add_workmem(root->glob,
+						normalize_work_bytes(tbm_calculate_bytes(max_ancestor_rows)));
 
 		/* Extract original index clauses, actual index quals, relevant ECs */
 		subquals = NIL;
@@ -3855,7 +4013,15 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * Replace the path's total working-memory estimate with a per-function
+	 * estimate.
+	 */
+	scan_plan->scan.plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_bytes(relation_byte_size(scan_plan->scan.plan.plan_rows,
+															 scan_plan->scan.plan.plan_width)),
+					 list_length(functions));
 
 	return scan_plan;
 }
@@ -3900,7 +4066,8 @@ create_tablefuncscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	scan_plan->scan.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->workmem));
 
 	return scan_plan;
 }
@@ -4040,7 +4207,8 @@ create_ctescan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	scan_plan->scan.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->workmem));
 
 	return scan_plan;
 }
@@ -4680,8 +4848,10 @@ create_mergejoin_plan(PlannerInfo *root,
 		 */
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
-
-		matplan->workmem_id = add_workmem(root->glob);
+		matplan->workmem_id =
+			add_workmem(root->glob,
+						normalize_work_bytes(relation_byte_size(matplan->plan_rows,
+																matplan->plan_width)));
 
 		inner_plan = matplan;
 	}
@@ -5029,7 +5199,9 @@ create_hashjoin_plan(PlannerInfo *root,
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
 	/* Assign workmem to the Hash subnode, not its parent HashJoin node. */
-	hash_plan->plan.workmem_id = add_hash_workmem(root->glob);
+	hash_plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->jpath.path.workmem));
 
 	return join_plan;
 }
@@ -5584,7 +5756,8 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	plan->plan.parallel_aware = false;
 	plan->plan.parallel_safe = lefttree->parallel_safe;
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(sort_path.workmem));
 }
 
 /*
@@ -5617,7 +5790,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 	plan->sort.plan.parallel_aware = false;
 	plan->sort.plan.parallel_safe = lefttree->parallel_safe;
 
-	plan->sort.plan.workmem_id = add_workmem(root->glob);
+	plan->sort.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(sort_path.workmem));
 }
 
 /*
@@ -6715,7 +6889,8 @@ materialize_finished_plan(PlannerGlobal *glob, Plan *subplan)
 	matplan->parallel_aware = false;
 	matplan->parallel_safe = subplan->parallel_safe;
 
-	matplan->workmem_id = add_workmem(glob);
+	matplan->workmem_id =
+		add_workmem(glob, normalize_work_kb(matpath.workmem));
 
 	return matplan;
 }
@@ -7481,12 +7656,22 @@ is_projection_capable_plan(Plan *plan)
 }
 
 static int
-add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
+add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category,
+					 int estimate, int count)
 {
+	if (estimate == 0 || count == 0)
+		return 0;
+
 	glob->workMemCategories = lappend_int(glob->workMemCategories, category);
+	glob->workMemEstimates = lappend_int(glob->workMemEstimates, estimate);
+	glob->workMemCounts = lappend_int(glob->workMemCounts, count);
 	/* the executor will fill this in later: */
 	glob->workMemLimits = lappend_int(glob->workMemLimits, 0);
 
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemEstimates));
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemCounts));
 	Assert(list_length(glob->workMemCategories) ==
 		   list_length(glob->workMemLimits));
 
@@ -7499,10 +7684,10 @@ add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
  *
  * This data structure will have its working-memory limit set to work_mem.
  */
-int
-add_workmem(PlannerGlobal *glob)
+static int
+add_workmem(PlannerGlobal *glob, int estimate)
 {
-	return add_workmem_internal(glob, WORKMEM_NORMAL);
+	return add_workmem_internal(glob, WORKMEM_NORMAL, estimate, 1);
 }
 
 /*
@@ -7513,7 +7698,13 @@ add_workmem(PlannerGlobal *glob)
  * hash_mem_multiplier.
  */
 int
-add_hash_workmem(PlannerGlobal *glob)
+add_hash_workmem(PlannerGlobal *glob, int estimate)
+{
+	return add_workmem_internal(glob, WORKMEM_HASH, estimate, 1);
+}
+
+static int
+add_workmems(PlannerGlobal *glob, int estimate, int count)
 {
-	return add_workmem_internal(glob, WORKMEM_HASH);
+	return add_workmem_internal(glob, WORKMEM_NORMAL, estimate, count);
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 56846fdeaab..f7606e513b8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -574,6 +574,8 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->stmt_len = parse->stmt_len;
 
 	result->workMemCategories = glob->workMemCategories;
+	result->workMemEstimates = glob->workMemEstimates;
+	result->workMemCounts = glob->workMemCounts;
 	result->workMemLimits = glob->workMemLimits;
 
 	result->jitFlags = PGJIT_NONE;
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index c0a2f04a8c3..0d0fb5cf8ed 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -691,5 +691,17 @@ get_agg_clause_costs(PlannerInfo *root, AggSplit aggsplit, AggClauseCosts *costs
 			costs->finalCost.startup += argcosts.startup;
 			costs->finalCost.per_tuple += argcosts.per_tuple;
 		}
+
+		/*
+		 * How many aggrefs need to sort their input? (Each such aggref gets
+		 * its own sort buffer. The logic here MUST match the corresponding
+		 * logic in function build_pertrans_for_aggref().)
+		 */
+		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind) &&
+			!aggref->aggpresorted &&
+			(aggref->aggdistinct || aggref->aggorder))
+		{
+			++costs->numSortBuffers;
+		}
 	}
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..e3242698789 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1709,6 +1709,13 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
 	pathnode->path.rows = subpath->rows;
 
+	/*
+	 * For now, set workmem at hash memory limit. Function
+	 * cost_memoize_rescan() will adjust this field, same as it does for field
+	 * "est_entries".
+	 */
+	pathnode->path.workmem = normalize_work_bytes(get_hash_memory_limit());
+
 	return pathnode;
 }
 
@@ -1937,12 +1944,14 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		pathnode->path.disabled_nodes = agg_path.disabled_nodes;
 		pathnode->path.startup_cost = agg_path.startup_cost;
 		pathnode->path.total_cost = agg_path.total_cost;
+		pathnode->path.workmem = agg_path.workmem;
 	}
 	else
 	{
 		pathnode->path.disabled_nodes = sort_path.disabled_nodes;
 		pathnode->path.startup_cost = sort_path.startup_cost;
 		pathnode->path.total_cost = sort_path.total_cost;
+		pathnode->path.workmem = sort_path.workmem;
 	}
 
 	rel->cheapest_unique_path = (Path *) pathnode;
@@ -2289,6 +2298,13 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	/* Cost is the same as for a regular CTE scan */
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
+	/*
+	 * But working memory used is 0, since the worktable scan doesn't create a
+	 * tuplestore -- it just reuses a tuplestore already created by a
+	 * recursive union.
+	 */
+	pathnode->workmem = 0;
+
 	return pathnode;
 }
 
@@ -3283,6 +3299,7 @@ create_agg_path(PlannerInfo *root,
 
 	pathnode->aggstrategy = aggstrategy;
 	pathnode->aggsplit = aggsplit;
+	pathnode->numSortBuffers = aggcosts ? aggcosts->numSortBuffers : 0;
 	pathnode->numGroups = numGroups;
 	pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
 	pathnode->groupClause = groupClause;
@@ -3333,6 +3350,8 @@ create_groupingsets_path(PlannerInfo *root,
 	ListCell   *lc;
 	bool		is_first = true;
 	bool		is_first_sort = true;
+	int			num_sort_nodes = 0;
+	double		max_sort_workmem = 0.0;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3369,6 +3388,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->numSortBuffers = agg_costs ? agg_costs->numSortBuffers : 0;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 	pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
@@ -3432,6 +3452,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 subpath->pathtarget->width);
 				if (!rollup->is_hashed)
 					is_first_sort = false;
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 			else
 			{
@@ -3444,6 +3466,12 @@ create_groupingsets_path(PlannerInfo *root,
 						  work_mem,
 						  -1.0);
 
+				/*
+				 * We costed sorting the previous "sort" rollup's "sort_out"
+				 * buffer. How much memory did it need?
+				 */
+				max_sort_workmem = Max(max_sort_workmem, sort_path.workmem);
+
 				/* Account for cost of aggregation */
 
 				cost_agg(&agg_path, root,
@@ -3457,12 +3485,17 @@ create_groupingsets_path(PlannerInfo *root,
 						 sort_path.total_cost,
 						 sort_path.rows,
 						 subpath->pathtarget->width);
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 
 			pathnode->path.disabled_nodes += agg_path.disabled_nodes;
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
+
+		if (!rollup->is_hashed)
+			++num_sort_nodes;
 	}
 
 	/* add tlist eval cost for each output row */
@@ -3470,6 +3503,17 @@ create_groupingsets_path(PlannerInfo *root,
 	pathnode->path.total_cost += target->cost.startup +
 		target->cost.per_tuple * pathnode->path.rows;
 
+	/*
+	 * Include working memory needed to sort agg output. If there's only 1
+	 * sort rollup, then we don't need any memory. If there are 2 sort
+	 * rollups, we need enough memory for 1 sort buffer. If there are >= 3
+	 * sort rollups, we need only 2 sort buffers, since we're
+	 * double-buffering.
+	 */
+	pathnode->path.workmem += num_sort_nodes > 2 ?
+		max_sort_workmem * 2.0 :
+		max_sort_workmem;
+
 	return pathnode;
 }
 
@@ -3619,7 +3663,8 @@ create_windowagg_path(PlannerInfo *root,
 				   subpath->disabled_nodes,
 				   subpath->startup_cost,
 				   subpath->total_cost,
-				   subpath->rows);
+				   subpath->rows,
+				   subpath->pathtarget->width);
 
 	/* add tlist eval cost for each output row */
 	pathnode->path.startup_cost += target->cost.startup;
@@ -3744,7 +3789,11 @@ create_setop_path(PlannerInfo *root,
 			MAXALIGN(SizeofMinimalTupleHeader);
 		if (hashentrysize * numGroups > get_hash_memory_limit())
 			pathnode->path.disabled_nodes++;
+
+		pathnode->path.workmem =
+			normalize_work_bytes(numGroups * hashentrysize);
 	}
+
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
@@ -3795,7 +3844,7 @@ create_recursiveunion_path(PlannerInfo *root,
 	pathnode->wtParam = wtParam;
 	pathnode->numGroups = numGroups;
 
-	cost_recursive_union(&pathnode->path, leftpath, rightpath);
+	cost_recursive_union(pathnode, leftpath, rightpath);
 
 	return pathnode;
 }
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index e4e9e0d1de1..6cd9bffbee5 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -63,7 +63,8 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
-									int *num_skew_mcvs);
+									int *num_skew_mcvs,
+									int *workmem);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 461db7a8822..1091e884ef7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1272,6 +1272,18 @@ typedef struct PlanState
 #define workMemField(node, field)   \
 	(workMemFieldFromId((node), field, ((PlanState *)(node))->plan->workmem_id))
 
+/* workmem estimate: */
+#define workMemEstimateFromId(node, id) \
+	(workMemFieldFromId(node, workMemEstimates, id))
+#define workMemEstimate(node) \
+	(workMemField(node, workMemEstimates))
+
+/* workmem count: */
+#define workMemCountFromId(node, id) \
+	(workMemFieldFromId(node, workMemCounts, id))
+#define workMemCount(node) \
+	(workMemField(node, workMemCounts))
+
 /* workmem limit: */
 #define workMemLimitFromId(node, id) \
 	(workMemFieldFromId(node, workMemLimits, id))
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index b2901568ceb..98a0c1f6778 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -60,6 +60,7 @@ typedef struct AggClauseCosts
 	QualCost	transCost;		/* total per-input-row execution costs */
 	QualCost	finalCost;		/* total per-aggregated-row costs */
 	Size		transitionSpace;	/* space for pass-by-ref transition data */
+	int			numSortBuffers; /* # of required input-sort buffers */
 } AggClauseCosts;
 
 /*
@@ -185,9 +186,12 @@ typedef struct PlannerGlobal
 	 * needs working memory for a data structure maintains a "workmem_id"
 	 * index into the following lists (all kept in sync).
 	 */
-
 	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
 	List	   *workMemCategories;
+	/* - IntList: estimate (in KB) of memory needed to avoid spilling */
+	List	   *workMemEstimates;
+	/* - IntList: how many data structures get a copy of this info */
+	List	   *workMemCounts;
 	/* - IntList: limit (in KB), after which data structure must spill */
 	List	   *workMemLimits;
 } PlannerGlobal;
@@ -1707,6 +1711,7 @@ typedef struct Path
 	int			disabled_nodes; /* count of disabled nodes */
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
 	Cost		total_cost;		/* total cost (assuming all tuples fetched) */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* sort ordering of path's output; a List of PathKey nodes; see above */
 	List	   *pathkeys;
@@ -2301,6 +2306,7 @@ typedef struct AggPath
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy, see nodes.h */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+	int			numSortBuffers; /* number of inputs that require sorting */
 	Cardinality numGroups;		/* estimated number of groups in input */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	List	   *groupClause;	/* a list of SortGroupClause's */
@@ -2342,6 +2348,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	int			numSortBuffers; /* number of inputs that require sorting */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
@@ -3385,6 +3392,7 @@ typedef struct JoinCostWorkspace
 
 	/* Fields below here should be treated as private to costsize.c */
 	Cost		run_cost;		/* non-startup cost components */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* private for cost_nestloop code */
 	Cost		inner_run_cost; /* also used by cost_mergejoin code */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 9f86f37e6ea..44145a51567 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -139,9 +139,12 @@ typedef struct PlannedStmt
 	 * needs working memory for a data structure maintains a "workmem_id"
 	 * index into the following lists (all kept in sync).
 	 */
-
 	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
 	List	   *workMemCategories;
+	/* - IntList: estimate (in KB) of memory needed to avoid spilling */
+	List	   *workMemEstimates;
+	/* - IntList: how many data structures get a copy of this info */
+	List	   *workMemCounts;
 	/* - IntList: limit (in KB), after which data structure must spill */
 	List	   *workMemLimits;
 } PlannedStmt;
@@ -1160,6 +1163,8 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* number of inputs that require sorting */
+	int			numSorts;
 	/* 1-based id of workMem to use to sort inputs, or else zero */
 	int			sortWorkMemId;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index e185635c10b..b5c98a39af7 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -108,6 +108,7 @@ extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
 													dsa_pointer dp);
 extern int	tbm_calculate_entries(Size maxbytes);
+extern double tbm_calculate_bytes(double maxentries);
 
 extern TBMIterator tbm_begin_iterate(TIDBitmap *tbm,
 									 dsa_area *dsa, dsa_pointer dsp);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 3aa3c16e442..587ea412bda 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -106,7 +106,7 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 									 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_resultscan(Path *path, PlannerInfo *root,
 							RelOptInfo *baserel, ParamPathInfo *param_info);
-extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
+extern void cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, int disabled_nodes,
 					  Cost input_cost, double tuples, int width,
@@ -139,7 +139,7 @@ extern void cost_windowagg(Path *path, PlannerInfo *root,
 						   List *windowFuncs, WindowClause *winclause,
 						   int input_disabled_nodes,
 						   Cost input_startup_cost, Cost input_total_cost,
-						   double input_tuples);
+						   double input_tuples, int width);
 extern void cost_group(Path *path, PlannerInfo *root,
 					   int numGroupCols, double numGroups,
 					   List *quals,
@@ -217,9 +217,18 @@ extern void set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *re
 extern void set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double relation_byte_size(double tuples, int width);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 								   Path *bitmapqual, double loop_count,
 								   Cost *cost_p, double *tuples_p);
 extern double compute_gather_rows(Path *path);
+extern int	compute_agg_input_workmem(double input_tuples, double input_width);
+extern int	compute_agg_output_workmem(PlannerInfo *root,
+									   AggStrategy aggstrategy,
+									   double numGroups, uint64 transitionSpace,
+									   double input_tuples, double input_width,
+									   bool cost_sort);
+extern int	normalize_work_kb(double nkb);
+extern int	normalize_work_bytes(double nbytes);
 
 #endif							/* COST_H */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index bf5e89e8415..91edfe96e27 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -49,8 +49,7 @@ extern Plan *change_plan_targetlist(Plan *subplan, List *tlist,
 extern Plan *materialize_finished_plan(PlannerGlobal *glob, Plan *subplan);
 extern bool is_projection_capable_path(Path *path);
 extern bool is_projection_capable_plan(Plan *plan);
-extern int	add_workmem(PlannerGlobal *glob);
-extern int	add_hash_workmem(PlannerGlobal *glob);
+extern int	add_hash_workmem(PlannerGlobal *glob, int estimate);
 
 /* External use of these functions is deprecated: */
 extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
-- 
2.47.1

0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchapplication/octet-stream; name=0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchDownload

From 35654b8c4bbf19a877089353af277bfe9a9c8d5c Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 5 Mar 2025 01:21:20 +0000
Subject: [PATCH 4/4] Add "workmem_hook" to allow extensions to override
 per-node work_mem

---
 contrib/Makefile                     |   3 +-
 contrib/meson.build                  |   1 +
 contrib/workmem/Makefile             |  20 +
 contrib/workmem/expected/workmem.out | 676 +++++++++++++++++++++++++++
 contrib/workmem/meson.build          |  28 ++
 contrib/workmem/sql/workmem.sql      | 304 ++++++++++++
 contrib/workmem/workmem.c            | 409 ++++++++++++++++
 src/backend/executor/execWorkmem.c   |  40 +-
 src/include/executor/executor.h      |   4 +
 9 files changed, 1474 insertions(+), 11 deletions(-)
 create mode 100644 contrib/workmem/Makefile
 create mode 100644 contrib/workmem/expected/workmem.out
 create mode 100644 contrib/workmem/meson.build
 create mode 100644 contrib/workmem/sql/workmem.sql
 create mode 100644 contrib/workmem/workmem.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 952855d9b61..b4880ab7067 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,7 +50,8 @@ SUBDIRS = \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
-		vacuumlo
+		vacuumlo	\
+		workmem
 
 ifeq ($(with_ssl),openssl)
 SUBDIRS += pgcrypto sslinfo
diff --git a/contrib/meson.build b/contrib/meson.build
index 1ba73ebd67a..fa596ef426f 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -69,4 +69,5 @@ subdir('tsm_system_time')
 subdir('unaccent')
 subdir('uuid-ossp')
 subdir('vacuumlo')
+subdir('workmem')
 subdir('xml2')
diff --git a/contrib/workmem/Makefile b/contrib/workmem/Makefile
new file mode 100644
index 00000000000..f920cdb9964
--- /dev/null
+++ b/contrib/workmem/Makefile
@@ -0,0 +1,20 @@
+# contrib/workmem/Makefile
+
+MODULE_big = workmem
+OBJS = \
+	$(WIN32RES) \
+	workmem.o
+PGFILEDESC = "workmem - extension that adjusts PostgreSQL work_mem per node"
+
+REGRESS = workmem
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/workmem
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/workmem/expected/workmem.out b/contrib/workmem/expected/workmem.out
new file mode 100644
index 00000000000..f69883b0005
--- /dev/null
+++ b/contrib/workmem/expected/workmem.out
@@ -0,0 +1,676 @@
+load 'workmem';
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=25600 kB)
+   ->  Sort  (work_mem=N kB) (limit=25600 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=51200 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=20480 kB)
+   ->  Sort  (work_mem=N kB) (limit=20480 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=40960 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=20480 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                               workmem_filter                                
+-----------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=102400 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=102399 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102399 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                     workmem_filter                                     
+----------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=34134 kB)
+         ->  Sort  (work_mem=N kB) (limit=34133 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=34133 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+              workmem_filter               
+-------------------------------------------
+ Result  (work_mem=N kB) (limit=102400 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=68267 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=34133 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=1024 kB)
+   ->  Sort  (work_mem=N kB) (limit=1024 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=2048 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=820 kB)
+   ->  Sort  (work_mem=N kB) (limit=819 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=1638 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=819 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=4095 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4095 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                     
+---------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=1366 kB)
+         ->  Sort  (work_mem=N kB) (limit=1365 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=1365 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter              
+-----------------------------------------
+ Result  (work_mem=N kB) (limit=4096 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=2731 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=1365 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=20 kB)
+   ->  Sort  (work_mem=N kB) (limit=20 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=40 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=16 kB)
+   ->  Sort  (work_mem=N kB) (limit=16 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=32 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=16 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=80 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=78 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 78 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=27 kB)
+         ->  Sort  (work_mem=N kB) (limit=27 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=26 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+            workmem_filter             
+---------------------------------------
+ Result  (work_mem=N kB) (limit=80 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=54 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=26 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/meson.build b/contrib/workmem/meson.build
new file mode 100644
index 00000000000..fce8030ba45
--- /dev/null
+++ b/contrib/workmem/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+workmem_sources = files(
+  'workmem.c',
+)
+
+if host_system == 'windows'
+  workmem_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'workmem',
+    '--FILEDESC', 'workmem - extension that adjusts PostgreSQL work_mem per node',])
+endif
+
+workmem = shared_module('workmem',
+  workmem_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += workmem
+
+tests += {
+  'name': 'workmem',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'workmem',
+    ],
+  },
+}
diff --git a/contrib/workmem/sql/workmem.sql b/contrib/workmem/sql/workmem.sql
new file mode 100644
index 00000000000..4e1ec056b80
--- /dev/null
+++ b/contrib/workmem/sql/workmem.sql
@@ -0,0 +1,304 @@
+load 'workmem';
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
+
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/workmem.c b/contrib/workmem/workmem.c
new file mode 100644
index 00000000000..d78f60c7d8d
--- /dev/null
+++ b/contrib/workmem/workmem.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * workmem.c
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/workmem/workmem.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "common/int.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+
+PG_MODULE_MAGIC;
+
+/* Local variables */
+
+/*
+ * A Target represents a collection of data structures, belonging to an
+ * execution node, that all share the same memory limit.
+ *
+ * For example, in parallel query, every parallel worker (plus the leader)
+ * gets a copy of the execution node, and therefore a copy of all of that
+ * node's work_mem limits. In this case, we'll track a single Target, but its
+ * count will include (1 + num_workers), because this Target gets "applied"
+ * to (1 + num_workers) execution nodes.
+ */
+typedef struct Target
+{
+	/* # of data structures to which target applies: */
+	int			count;
+	/* workmem estimate for each of these data structures: */
+	int			workmem;
+	/* (original) workmem limit for each of these data structures: */
+	int			limit;
+	/* workmem estimate, but capped at (original) workmem limit: */
+	int			priority;
+	/* ratio of (priority / limit); measure's Target's "greediness": */
+	double		ratio;
+	/* link to target's actual limit, so we can set it: */
+	int		   *target_limit;
+}			Target;
+
+typedef struct WorkMemStats
+{
+	/* total # of data structures that get working memory: */
+	uint64		count;
+	/* total working memory estimated for this query: */
+	uint64		workmem;
+	/* total working memory (currently) reserved for this query: */
+	uint64		limit;
+	/* total "capped" working memory estimate: */
+	uint64		priority;
+	/* list of Targets, used to update actual workmem limits: */
+	List	   *targets;
+}			WorkMemStats;
+
+/* GUC variables */
+static int	workmem_query_work_mem = 100 * 1024;	/* kB */
+
+/* internal functions */
+static void workmem_fn(PlannedStmt *plannedstmt);
+
+static int	clamp_priority(int workmem, int limit);
+static Target * make_target(int workmem, int *target_limit, int count);
+static void add_target(WorkMemStats * workmem_stats, Target * target);
+
+/* Sort comparators: sort by ratio, ascending or descending. */
+static int	target_compare_asc(const ListCell *a, const ListCell *b);
+static int	target_compare_desc(const ListCell *a, const ListCell *b);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Define custom GUC variable. */
+	DefineCustomIntVariable("workmem.query_work_mem",
+							"Amount of working-memory (in kB) to provide each "
+							"query.",
+							NULL,
+							&workmem_query_work_mem,
+							100 * 1024, /* default to 100 MB */
+							64,
+							INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_KB,
+							NULL,
+							NULL,
+							NULL);
+
+	MarkGUCPrefixReserved("workmem");
+
+	/* Install hooks. */
+	ExecAssignWorkMem_hook = workmem_fn;
+}
+
+static void
+workmem_analyze(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			idx;
+
+	for (idx = 0; idx < list_length(plannedstmt->workMemCategories); ++idx)
+	{
+		WorkMemCategory category;
+		int			count;
+		int			estimate;
+		ListCell   *limit_cell;
+		int			limit;
+		Target	   *target;
+
+		category =
+			(WorkMemCategory) list_nth_int(plannedstmt->workMemCategories, idx);
+		count = list_nth_int(plannedstmt->workMemCounts, idx);
+		estimate = list_nth_int(plannedstmt->workMemEstimates, idx);
+
+		limit = category == WORKMEM_HASH ?
+			get_hash_memory_limit() / 1024 : work_mem;
+		limit_cell = list_nth_cell(plannedstmt->workMemLimits, idx);
+		lfirst_int(limit_cell) = limit;
+
+		target = make_target(estimate, &lfirst_int(limit_cell), count);
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_set(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			remaining = workmem_query_work_mem;
+
+	if (workmem_stats->limit <= remaining)
+	{
+		/*
+		 * "High memory" case: we have more than enough query_work_mem; now
+		 * hand out the excess.
+		 */
+
+		/* This is memory that exceeds workmem limits. */
+		remaining -= workmem_stats->limit;
+
+		/*
+		 * Sort targets from highest ratio to lowest. When we assign memory to
+		 * a Target, we'll truncate fractional KB; so by going through the
+		 * list from highest to lowest ratio, we ensure that the lowest ratios
+		 * get the leftover fractional KBs.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/* NOTE: This is extra workmem *per data structure*. */
+			extra_workmem = (int) (fraction * remaining);
+
+			*target->target_limit += extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else if (workmem_stats->priority <= remaining)
+	{
+		/*
+		 * "Medium memory" case: we don't have enough query_work_mem to give
+		 * every target its full allotment, but we do have enough to give it
+		 * as much as (we estimate) it needs.
+		 *
+		 * So, just take some memory away from nodes that (we estimate) won't
+		 * need it.
+		 */
+
+		/* This is memory that exceeds workmem estimates. */
+		remaining -= workmem_stats->priority;
+
+		/*
+		 * Sort targets from highest ratio to lowest. We'll skip any Target
+		 * with ratio > 1.0, because (we estimate) they already need their
+		 * full allotment. Also, once a target reaches its workmem limit,
+		 * we'll stop giving it more workmem, leaving the surplus memory to be
+		 * assigned to targets with smaller ratios.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/*
+			 * Don't give the target more than its (original) limit.
+			 *
+			 * NOTE: This is extra workmem *per data structure*.
+			 */
+			extra_workmem = Min((int) (fraction * remaining),
+								target->limit - target->priority);
+
+			*target->target_limit = target->priority + extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else
+	{
+		uint64		limit = workmem_stats->limit;
+
+		/*
+		 * "Low memory" case: we are severely memory constrained, and need to
+		 * take "priority" memory away from targets that (we estimate)
+		 * actually need it. We'll do this by (effectively) reducing the
+		 * global "work_mem" limit, uniformly, for all targets, until we're
+		 * under the query_work_mem limit.
+		 */
+		elog(WARNING,
+			 "not enough working memory for query: increase "
+			 "workmem.query_work_mem");
+
+		/*
+		 * Sort targets from lowest ratio to highest. For any target whose
+		 * ratio is < the target_ratio, we'll just assign it its priority (=
+		 * workmem) as limit, and return the excess workmem to our "limit",
+		 * for use by subsequent, greedier, targets.
+		 */
+		list_sort(workmem_stats->targets, target_compare_asc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		target_ratio;
+			int			target_limit;
+
+			/*
+			 * If we restrict our targets to this ratio, we'll stay within the
+			 * query_work_mem limit.
+			 */
+			target_ratio = (double) remaining / limit;
+
+			/*
+			 * Don't give this target more than its priority request (but we
+			 * might give it less).
+			 */
+			target_limit = Min(target->priority,
+							   target_ratio * target->limit);
+			*target->target_limit = target_limit;
+
+			/* "Remaining" decreases by memory we actually assigned. */
+			remaining -= (target_limit * target->count);
+
+			/*
+			 * "Limit" decreases by target's original memory limit.
+			 *
+			 * If target_limit <= target->priority, so we restricted this
+			 * target to less memory than (we estimate) it needs, then the
+			 * target_ratio will stay the same, since, letting A = remaining,
+			 * B = limit, and R = ratio, we'll have:
+			 *
+			 * R=A/B <=> A=R*B <=> A-R*X = R*B - R*X <=> A-R*X = R * (B-X) <=>
+			 * R = (A-R*X) / (B-X)
+			 *
+			 * -- which is what we wanted to prove.
+			 *
+			 * And if target_limit > target->priority, so we didn't need to
+			 * restrict this target beyond its priority estimate, then the
+			 * target_ratio will increase. This means more memory for the
+			 * remaining, greedier, targets.
+			 */
+			limit -= (target->limit * target->count);
+
+			target_ratio = (double) remaining / limit;
+		}
+	}
+}
+
+/*
+ * workmem_fn: updates the query plan's work_mem based on query_work_mem
+ */
+static void
+workmem_fn(PlannedStmt *plannedstmt)
+{
+	WorkMemStats workmem_stats;
+	MemoryContext context,
+				oldcontext;
+
+	/*
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 *
+	 * We could re-assign working-memory limits on the parallel worker, to
+	 * only those Plan nodes that got sent to the worker, but for now we don't
+	 * bother.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	if (workmem_query_work_mem == -1)
+		return;					/* disabled */
+
+	memset(&workmem_stats, 0, sizeof(workmem_stats));
+
+	/*
+	 * Set up our own memory context, so we can drop the metadata we generate,
+	 * all at once.
+	 */
+	context = AllocSetContextCreate(CurrentMemoryContext,
+									"workmem_fn context",
+									ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	/* Figure out how much total working memory this query wants/needs. */
+	workmem_analyze(plannedstmt, &workmem_stats);
+
+	/* Now restrict the query to workmem.query_work_mem. */
+	workmem_set(plannedstmt, &workmem_stats);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Drop all our metadata. */
+	MemoryContextDelete(context);
+}
+
+static int
+clamp_priority(int workmem, int limit)
+{
+	return Min(workmem, limit);
+}
+
+static Target *
+make_target(int workmem, int *target_limit, int count)
+{
+	Target	   *result = palloc_object(Target);
+
+	result->count = count;
+	result->workmem = workmem;
+	result->limit = *target_limit;
+	result->priority = clamp_priority(result->workmem, result->limit);
+	result->ratio = (double) result->priority / result->limit;
+	result->target_limit = target_limit;
+
+	return result;
+}
+
+static void
+add_target(WorkMemStats * workmem_stats, Target * target)
+{
+	workmem_stats->count += target->count;
+	workmem_stats->workmem += target->count * target->workmem;
+	workmem_stats->limit += target->count * target->limit;
+	workmem_stats->priority += target->count * target->priority;
+	workmem_stats->targets = lappend(workmem_stats->targets, target);
+}
+
+/* This "ascending" comparator sorts least-greedy Targets first. */
+static int
+target_compare_asc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in ascending order: smallest ratio first, then (if ratios equal)
+	 * smallest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return pg_cmp_s32(((Target *) a->ptr_value)->workmem,
+						  ((Target *) b->ptr_value)->workmem);
+	}
+	else
+		return a_val > b_val ? 1 : -1;
+}
+
+/* This "descending" comparator sorts most-greedy Targets first. */
+static int
+target_compare_desc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in descending order: largest ratio first, then (if ratios equal)
+	 * largest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return pg_cmp_s32(((Target *) b->ptr_value)->workmem,
+						  ((Target *) a->ptr_value)->workmem);
+	}
+	else
+		return b_val > a_val ? 1 : -1;
+}
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
index d8a19a58ebe..37420666065 100644
--- a/src/backend/executor/execWorkmem.c
+++ b/src/backend/executor/execWorkmem.c
@@ -52,6 +52,10 @@
 #include "nodes/plannodes.h"
 
 
+/* Hook for plugins to get control in ExecAssignWorkMem */
+ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook = NULL;
+
+
 /* ------------------------------------------------------------------------
  *		ExecAssignWorkMem
  *
@@ -64,20 +68,36 @@
  */
 void
 ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	if (ExecAssignWorkMem_hook)
+		(*ExecAssignWorkMem_hook) (plannedstmt);
+	else
+	{
+		/*
+		 * No need to re-assign working memory on parallel workers, since
+		 * workers have the same work_mem and hash_mem_multiplier GUCs as the
+		 * leader.
+		 *
+		 * We already assigned working-memory limits on the leader, and those
+		 * limits were sent to the workers inside the serialized Plan.
+		 *
+		 * We bail out here, in case the hook wants to re-assign memory on
+		 * parallel workers, and maybe wants to call
+		 * standard_ExecAssignWorkMem() first, as well.
+		 */
+		if (IsParallelWorker())
+			return;
+
+		standard_ExecAssignWorkMem(plannedstmt);
+	}
+}
+
+void
+standard_ExecAssignWorkMem(PlannedStmt *plannedstmt)
 {
 	ListCell   *lc_category;
 	ListCell   *lc_limit;
 
-	/*
-	 * No need to re-assign working memory on parallel workers, since workers
-	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
-	 *
-	 * We already assigned working-memory limits on the leader, and those
-	 * limits were sent to the workers inside the serialized Plan.
-	 */
-	if (IsParallelWorker())
-		return;
-
 	forboth(lc_category, plannedstmt->workMemCategories,
 			lc_limit, plannedstmt->workMemLimits)
 	{
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4147876d55..c12625d2061 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -96,6 +96,9 @@ typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
 											  bool ereport_on_violation);
 extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
+/* Hook for plugins to get control in ExecAssignWorkMem() */
+typedef void (*ExecAssignWorkMem_hook_type) (PlannedStmt *plannedstmt);
+extern PGDLLIMPORT ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook;
 
 /*
  * prototypes from functions in execAmi.c
@@ -730,5 +733,6 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
  * prototypes from functions in execWorkmem.c
  */
 extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+extern void standard_ExecAssignWorkMem(PlannedStmt *plannedstmt);
 
 #endif							/* EXECUTOR_H  */
-- 
2.47.1

0001-Store-working-memory-limit-per-Plan-SubPlan-rather-t.patchapplication/octet-stream; name=0001-Store-working-memory-limit-per-Plan-SubPlan-rather-t.patchDownload

From e944b71823db20666ea691d38c7d354a7130cd1b Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Tue, 25 Feb 2025 22:44:01 +0000
Subject: [PATCH 1/4] Store working memory limit per Plan/SubPlan, rather than
 in GUC

This commit moves the working-memory limit that an executor node checks, at
runtime, from the "work_mem" and "hash_mem_multiplier" GUCs, to a new
list, "workMemLimits", added to the PlannedStmt node. At runtimem an exec
node checks its limit by looking up the list element corresponding to its
plan->workmem_id field.

Indirecting the workMemLimit via a List index allows us to handle SubPlans
as well as Plans. It also allows a future extension to set limits on
individual Plans/SubPlans, without needing to re-traverse the Plan +
Expr tree.

To preserve backward, this commit also copies the "work_mem", etc., values
from the existing GUCs to the new field. This means that this commit is
just a refactoring, and doesn't change any behavior.

This "workmem_id" field is on the Plan node, instead of the corresponding
PlanState, because the workMemLimit needs to be set before we can call
ExecInitNode().
---
 src/backend/executor/Makefile              |   1 +
 src/backend/executor/execGrouping.c        |  10 +-
 src/backend/executor/execMain.c            |   6 +
 src/backend/executor/execParallel.c        |   2 +
 src/backend/executor/execSRF.c             |   5 +-
 src/backend/executor/execWorkmem.c         |  87 ++++++++++++
 src/backend/executor/meson.build           |   1 +
 src/backend/executor/nodeAgg.c             |  64 ++++++---
 src/backend/executor/nodeBitmapIndexscan.c |   2 +-
 src/backend/executor/nodeBitmapOr.c        |   2 +-
 src/backend/executor/nodeCtescan.c         |   3 +-
 src/backend/executor/nodeFunctionscan.c    |   2 +
 src/backend/executor/nodeHash.c            |  22 +++-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeMaterial.c        |   3 +-
 src/backend/executor/nodeMemoize.c         |   2 +-
 src/backend/executor/nodeRecursiveunion.c  |  14 +-
 src/backend/executor/nodeSetOp.c           |   1 +
 src/backend/executor/nodeSort.c            |   4 +-
 src/backend/executor/nodeSubplan.c         |  16 +++
 src/backend/executor/nodeTableFuncscan.c   |   3 +-
 src/backend/executor/nodeWindowAgg.c       |   3 +-
 src/backend/optimizer/path/costsize.c      |  15 ++-
 src/backend/optimizer/plan/createplan.c    | 146 ++++++++++++++++++---
 src/backend/optimizer/plan/planner.c       |   5 +-
 src/backend/optimizer/plan/subselect.c     |   2 +-
 src/include/executor/executor.h            |   7 +
 src/include/executor/hashjoin.h            |   3 +-
 src/include/executor/nodeAgg.h             |   3 +-
 src/include/executor/nodeHash.h            |   3 +-
 src/include/nodes/execnodes.h              |  13 ++
 src/include/nodes/pathnodes.h              |  11 ++
 src/include/nodes/plannodes.h              |  27 +++-
 src/include/nodes/primnodes.h              |   3 +
 src/include/optimizer/planmain.h           |   4 +-
 35 files changed, 433 insertions(+), 66 deletions(-)
 create mode 100644 src/backend/executor/execWorkmem.c

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..8aa9580558f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -30,6 +30,7 @@ OBJS = \
 	execScan.o \
 	execTuples.o \
 	execUtils.o \
+	execWorkmem.o \
 	functions.o \
 	instrument.o \
 	nodeAgg.o \
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 33b124fbb0a..bcd1822da80 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -168,6 +168,7 @@ BuildTupleHashTable(PlanState *parent,
 					Oid *collations,
 					long nbuckets,
 					Size additionalsize,
+					Size hash_mem_limit,
 					MemoryContext metacxt,
 					MemoryContext tablecxt,
 					MemoryContext tempcxt,
@@ -175,15 +176,18 @@ BuildTupleHashTable(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
-	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
 
 	Assert(nbuckets > 0);
 
-	/* Limit initial table size request to not more than hash_mem */
-	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	/*
+	 * Limit initial table size request to not more than hash_mem
+	 *
+	 * XXX - we should also limit the *maximum* table size to hash_mem.
+	 */
+	hash_mem_limit = hash_mem_limit / entrysize;
 	if (nbuckets > hash_mem_limit)
 		nbuckets = hash_mem_limit;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0493b7d5365..78fd887a84d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1050,6 +1050,12 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/* signal that this EState is not used for EPQ */
 	estate->es_epq_active = NULL;
 
+	/*
+	 * Assign working memory to SubPlan and Plan nodes, before initializing
+	 * their states.
+	 */
+	ExecAssignWorkMem(plannedstmt);
+
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 1bedb808368..97d83bae571 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -213,6 +213,8 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->utilityStmt = NULL;
 	pstmt->stmt_location = -1;
 	pstmt->stmt_len = -1;
+	pstmt->workMemCategories = estate->es_plannedstmt->workMemCategories;
+	pstmt->workMemLimits = estate->es_plannedstmt->workMemLimits;
 
 	/* Return serialized copy of our dummy PlannedStmt. */
 	return nodeToString(pstmt);
diff --git a/src/backend/executor/execSRF.c b/src/backend/executor/execSRF.c
index a03fe780a02..4b1e7e0ad1e 100644
--- a/src/backend/executor/execSRF.c
+++ b/src/backend/executor/execSRF.c
@@ -102,6 +102,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 							ExprContext *econtext,
 							MemoryContext argContext,
 							TupleDesc expectedDesc,
+							int workMem,
 							bool randomAccess)
 {
 	Tuplestorestate *tupstore = NULL;
@@ -261,7 +262,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 				MemoryContext oldcontext =
 					MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-				tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+				tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 				rsinfo.setResult = tupstore;
 				if (!returnsTuple)
 				{
@@ -396,7 +397,7 @@ no_function_result:
 		MemoryContext oldcontext =
 			MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-		tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+		tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 		rsinfo.setResult = tupstore;
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
new file mode 100644
index 00000000000..d8a19a58ebe
--- /dev/null
+++ b/src/backend/executor/execWorkmem.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * execWorkmem.c
+ *	 routine to set the "workmem_limit" field(s) on Plan nodes that need
+ *   workimg memory.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execWorkmem.c
+ *
+ * INTERFACE ROUTINES
+ *		ExecAssignWorkMem	- assign working memory to Plan nodes
+ *
+ *	 NOTES
+ *		Historically, every PlanState node, during initialization, looked at
+ *		the "work_mem" (plus maybe "hash_mem_multiplier") GUC, to determine
+ *		its working-memory limit.
+ *
+ *		Now, to allow different PlanState nodes to be restricted to different
+ *		amounts of memory, each PlanState node reads this limit off the
+ *		PlannedStmt's workMemLimits List, at the (1-based) position indicated
+ *		by the PlanState's Plan node's "workmem_id" field.
+ *
+ *		We assign the workmem_id and expand the workMemLimits List, when
+ *		creating the Plan node; and then we set this limit by calling
+ *		ExecAssignWorkMem(), from InitPlan(), before we initialize the PlanState
+ *		nodes.
+ *
+ * 		The workMemLimit has always applied "per data structure," rather than
+ *		"per PlanState". So a single SQL operator (e.g., RecursiveUnion) can
+ *		use more than the workMemLimit, even though each of its data
+ *		structures is restricted to it.
+ *
+ *		We store the "workmem_id" field(s) on the Plan, instead of the
+ *		PlanState, even though it conceptually belongs to execution rather than
+ *		to planning, because we need it to be set before initializing the
+ *		corresponding PlanState. This is a chicken-and-egg problem. We could,
+ *		of course, make ExecInitNode() a two-phase operation, but that seems
+ *		like overkill. Instead, we store these "workmem_id" fields on the Plan,
+ *		but set the workMemLimit when we start execution, as part of
+ *		InitPlan().
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "nodes/plannodes.h"
+
+
+/* ------------------------------------------------------------------------
+ *		ExecAssignWorkMem
+ *
+ *		Assigns working memory to any Plans or SubPlans that need it.
+ *
+ *		Inputs:
+ *		  'plannedstmt' is the statement to which we assign working memory
+ *
+ * ------------------------------------------------------------------------
+ */
+void
+ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	ListCell   *lc_category;
+	ListCell   *lc_limit;
+
+	/*
+	 * No need to re-assign working memory on parallel workers, since workers
+	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
+	 *
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	forboth(lc_category, plannedstmt->workMemCategories,
+			lc_limit, plannedstmt->workMemLimits)
+	{
+		lfirst_int(lc_limit) = lfirst_int(lc_category) == WORKMEM_HASH ?
+			get_hash_memory_limit() / 1024 : work_mem;
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..4e65974f5f3 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -18,6 +18,7 @@ backend_sources += files(
   'execScan.c',
   'execTuples.c',
   'execUtils.c',
+  'execWorkmem.c',
   'functions.c',
   'instrument.c',
   'nodeAgg.c',
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ceb8c8a8039..b06306d4961 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -258,6 +258,7 @@
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "executor/nodeHash.h"
 #include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -403,7 +404,8 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
 					  Bitmapset **unaggregated);
 static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets,
+							 Size hash_mem_limit);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
 static long hash_choose_num_buckets(double hashentrysize,
@@ -411,6 +413,7 @@ static long hash_choose_num_buckets(double hashentrysize,
 static int	hash_choose_num_partitions(double input_groups,
 									   double hashentrysize,
 									   int used_bits,
+									   Size hash_mem_limit,
 									   int *log2_npartitions);
 static void initialize_hash_entry(AggState *aggstate,
 								  TupleHashTable hashtable,
@@ -433,7 +436,7 @@ static HashAggBatch *hashagg_batch_new(LogicalTape *input_tape, int setno,
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset,
 							   int used_bits, double input_groups,
-							   double hashentrysize);
+							   double hashentrysize, Size hash_mem_limit);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 								TupleTableSlot *inputslot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
@@ -521,6 +524,15 @@ initialize_phase(AggState *aggstate, int newphase)
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
+		int			workmem_limit;
+
+		/*
+		 * Read the sort-output workmem limit off the first AGG_SORTED node.
+		 * Since phase 0 is always AGG_HASHED, this will always be phase 1.
+		 */
+		workmem_limit =
+			workMemLimitFromId(aggstate,
+							   aggstate->phases[1].aggnode->plan.workmem_id);
 
 		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
 												  sortnode->numCols,
@@ -528,7 +540,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->sortOperators,
 												  sortnode->collations,
 												  sortnode->nullsFirst,
-												  work_mem,
+												  workmem_limit,
 												  NULL, TUPLESORT_NONE);
 	}
 
@@ -584,6 +596,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 	 */
 	if (pertrans->aggsortrequired)
 	{
+		int			workmem_limit;
+
 		/*
 		 * In case of rescan, maybe there could be an uncompleted sort
 		 * operation?  Clean it up if so.
@@ -591,6 +605,12 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 		if (pertrans->sortstates[aggstate->current_set])
 			tuplesort_end(pertrans->sortstates[aggstate->current_set]);
 
+		/*
+		 * Read the sort-input workmem limit off the first Agg node.
+		 */
+		workmem_limit =
+			workMemLimitFromId(aggstate,
+							   ((Agg *) aggstate->ss.ps.plan)->sortWorkMemId);
 
 		/*
 		 * We use a plain Datum sorter when there's a single input column;
@@ -606,7 +626,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, TUPLESORT_NONE);
+									  workmem_limit, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -616,7 +636,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, TUPLESORT_NONE);
+									 workmem_limit, NULL, TUPLESORT_NONE);
 	}
 
 	/*
@@ -1498,7 +1518,7 @@ build_hash_tables(AggState *aggstate)
 		}
 #endif
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, setno, nbuckets, memory);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1508,7 +1528,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, long nbuckets,
+				 Size hash_mem_limit)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
@@ -1537,6 +1558,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 perhash->aggnode->grpCollations,
 											 nbuckets,
 											 additionalsize,
+											 hash_mem_limit,
 											 metacxt,
 											 hashcxt,
 											 tmpcxt,
@@ -1805,12 +1827,11 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  */
 void
 hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
-					Size *mem_limit, uint64 *ngroups_limit,
+					Size hash_mem_limit, Size *mem_limit, uint64 *ngroups_limit,
 					int *num_partitions)
 {
 	int			npartitions;
 	Size		partition_mem;
-	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
 	if (input_groups * hashentrysize <= hash_mem_limit)
@@ -1830,6 +1851,7 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups,
 											 hashentrysize,
 											 used_bits,
+											 hash_mem_limit,
 											 NULL);
 	if (num_partitions != NULL)
 		*num_partitions = npartitions;
@@ -1927,7 +1949,8 @@ hash_agg_enter_spill_mode(AggState *aggstate)
 
 			hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
+							   aggstate->hashentrysize,
+							   (Size) workMemLimit(aggstate) * 1024);
 		}
 	}
 }
@@ -2014,9 +2037,9 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  */
 static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
-						   int used_bits, int *log2_npartitions)
+						   int used_bits, Size hash_mem_limit,
+						   int *log2_npartitions)
 {
-	Size		hash_mem_limit = get_hash_memory_limit();
 	double		partition_limit;
 	double		mem_wanted;
 	double		dpartitions;
@@ -2156,7 +2179,8 @@ lookup_hash_entries(AggState *aggstate)
 			if (spill->partitions == NULL)
 				hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 								   perhash->aggnode->numGroups,
-								   aggstate->hashentrysize);
+								   aggstate->hashentrysize,
+								   (Size) workMemLimit(aggstate) * 1024);
 
 			hashagg_spill_tuple(aggstate, spill, slot, hash);
 			pergroup[setno] = NULL;
@@ -2630,7 +2654,9 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
-						batch->used_bits, &aggstate->hash_mem_limit,
+						batch->used_bits,
+						(Size) workMemLimit(aggstate) * 1024,
+						&aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
 	/*
@@ -2718,7 +2744,8 @@ agg_refill_hash_table(AggState *aggstate)
 				 */
 				spill_initialized = true;
 				hashagg_spill_init(&spill, tapeset, batch->used_bits,
-								   batch->input_card, aggstate->hashentrysize);
+								   batch->input_card, aggstate->hashentrysize,
+								   (Size) workMemLimit(aggstate) * 1024);
 			}
 			/* no memory for a new group, spill */
 			hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
@@ -2916,13 +2943,15 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
  */
 static void
 hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
-				   double input_groups, double hashentrysize)
+				   double input_groups, double hashentrysize,
+				   Size hash_mem_limit)
 {
 	int			npartitions;
 	int			partition_bits;
 
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
-											 used_bits, &partition_bits);
+											 used_bits, hash_mem_limit,
+											 &partition_bits);
 
 #ifdef USE_INJECTION_POINTS
 	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
@@ -3649,6 +3678,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			totalGroups += aggstate->perhash[k].aggnode->numGroups;
 
 		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							(Size) workMemLimit(aggstate) * 1024,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 0b32c3a022f..0b33a1f4533 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -91,7 +91,7 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	else
 	{
 		/* XXX should we use less than work_mem for this? */
-		tbm = tbm_create(work_mem * (Size) 1024,
+		tbm = tbm_create(workMemLimit(node) * (Size) 1024,
 						 ((BitmapIndexScan *) node->ss.ps.plan)->isshared ?
 						 node->ss.ps.state->es_query_dsa : NULL);
 	}
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 231760ec93d..16d0a164292 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -143,7 +143,7 @@ MultiExecBitmapOr(BitmapOrState *node)
 			if (result == NULL) /* first subplan */
 			{
 				/* XXX should we use less than work_mem for this? */
-				result = tbm_create(work_mem * (Size) 1024,
+				result = tbm_create(workMemLimit(subnode) * (Size) 1024,
 									((BitmapOr *) node->ps.plan)->isshared ?
 									node->ps.state->es_query_dsa : NULL);
 			}
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index e1675f66b43..08f48f88e65 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -232,7 +232,8 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 		/* I am the leader */
 		prmdata->value = PointerGetDatum(scanstate);
 		scanstate->leader = scanstate;
-		scanstate->cte_table = tuplestore_begin_heap(true, false, work_mem);
+		scanstate->cte_table =
+			tuplestore_begin_heap(true, false, workMemLimit(scanstate));
 		tuplestore_set_eflags(scanstate->cte_table, scanstate->eflags);
 		scanstate->readptr = 0;
 	}
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 644363582d9..fda42a278b8 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -95,6 +95,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											node->funcstates[0].tupdesc,
+											workMemLimit(node),
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
@@ -154,6 +155,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											fs->tupdesc,
+											workMemLimit(node),
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..bb9af08dc5d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -448,6 +448,7 @@ ExecHashTableCreate(HashState *state)
 	Hash	   *node;
 	HashJoinTable hashtable;
 	Plan	   *outerNode;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;
 	int			nbuckets;
 	int			nbatch;
@@ -471,11 +472,15 @@ ExecHashTableCreate(HashState *state)
 	 */
 	rows = node->plan.parallel_aware ? node->rows_total : outerNode->plan_rows;
 
+	worker_space_allowed = (size_t) workMemLimit(state) * 1024;
+	Assert(worker_space_allowed > 0);
+
 	ExecChooseHashTableSize(rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
+							worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs);
 
@@ -599,6 +604,7 @@ ExecHashTableCreate(HashState *state)
 		{
 			pstate->nbatch = nbatch;
 			pstate->space_allowed = space_allowed;
+			pstate->worker_space_allowed = worker_space_allowed;
 			pstate->growth = PHJ_GROWTH_OK;
 
 			/* Set up the shared state for coordinating batches. */
@@ -658,7 +664,8 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t *space_allowed,
+						size_t worker_space_allowed,
+						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
 						int *num_skew_mcvs)
@@ -687,9 +694,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
-	 * Compute in-memory hashtable size limit from GUCs.
+	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = get_hash_memory_limit();
+	hash_table_bytes = worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -706,7 +713,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		hash_table_bytes = (size_t) newlimit;
 	}
 
-	*space_allowed = hash_table_bytes;
+	*total_space_allowed = hash_table_bytes;
 
 	/*
 	 * If skew optimization is possible, estimate the number of skew buckets
@@ -808,7 +815,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		{
 			ExecChooseHashTableSize(ntuples, tupwidth, useskew,
 									false, parallel_workers,
-									space_allowed,
+									worker_space_allowed,
+									total_space_allowed,
 									numbuckets,
 									numbatches,
 									num_skew_mcvs);
@@ -929,7 +937,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
 	Assert(nbuckets > 0);
@@ -1235,7 +1243,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					 * to switch from one large combined memory budget to the
 					 * regular hash_mem budget.
 					 */
-					pstate->space_allowed = get_hash_memory_limit();
+					pstate->space_allowed = pstate->worker_space_allowed;
 
 					/*
 					 * The combined hash_mem of all participants wasn't
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 975b0397e7a..7a92c1eb2c0 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -312,7 +312,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.sortOperators[nPresortedCols]),
 												&(plannode->sort.collations[nPresortedCols]),
 												&(plannode->sort.nullsFirst[nPresortedCols]),
-												work_mem,
+												workMemLimit(pstate),
 												NULL,
 												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
@@ -613,7 +613,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.sortOperators,
 												  plannode->sort.collations,
 												  plannode->sort.nullsFirst,
-												  work_mem,
+												  workMemLimit(pstate),
 												  NULL,
 												  node->bounded ?
 												  TUPLESORT_ALLOWBOUNDED :
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9798bb75365..bf5e921a205 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -61,7 +61,8 @@ ExecMaterial(PlanState *pstate)
 	 */
 	if (tuplestorestate == NULL && node->eflags != 0)
 	{
-		tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
+		tuplestorestate =
+			tuplestore_begin_heap(true, false, workMemLimit(node));
 		tuplestore_set_eflags(tuplestorestate, node->eflags);
 		if (node->eflags & EXEC_FLAG_MARK)
 		{
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 609deb12afb..4e3da4aab6b 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -1036,7 +1036,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
 	mstate->mem_used = 0;
 
 	/* Limit the total memory consumed by the cache to this */
-	mstate->mem_limit = get_hash_memory_limit();
+	mstate->mem_limit = (Size) workMemLimit(mstate) * 1024;
 
 	/* A memory context dedicated for the cache */
 	mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..5ffffd327d2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -33,6 +33,8 @@ build_hash_table(RecursiveUnionState *rustate)
 {
 	RecursiveUnion *node = (RecursiveUnion *) rustate->ps.plan;
 	TupleDesc	desc = ExecGetResultType(outerPlanState(rustate));
+	int			workmem_limit = workMemLimitFromId(rustate,
+												   node->hashWorkMemId);
 
 	Assert(node->numCols > 0);
 	Assert(node->numGroups > 0);
@@ -52,6 +54,7 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->dupCollations,
 											 node->numGroups,
 											 0,
+											 (Size) workmem_limit * 1024,
 											 rustate->ps.state->es_query_cxt,
 											 rustate->tableContext,
 											 rustate->tempContext,
@@ -202,8 +205,15 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/* initialize processing state */
 	rustate->recursing = false;
 	rustate->intermediate_empty = true;
-	rustate->working_table = tuplestore_begin_heap(false, false, work_mem);
-	rustate->intermediate_table = tuplestore_begin_heap(false, false, work_mem);
+
+	/*
+	 * NOTE: each of our working tables gets the same workmem_limit, since
+	 * we're going to swap them repeatedly.
+	 */
+	rustate->working_table = tuplestore_begin_heap(false, false,
+												   workMemLimit(rustate));
+	rustate->intermediate_table = tuplestore_begin_heap(false, false,
+														workMemLimit(rustate));
 
 	/*
 	 * If hashing, we need a per-tuple memory context for comparisons, and a
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5b7ff9c3748..2e256f634c8 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -105,6 +105,7 @@ build_hash_table(SetOpState *setopstate)
 												node->cmpCollations,
 												node->numGroups,
 												sizeof(SetOpStatePerGroupData),
+												(Size) workMemLimit(setopstate) * 1024,
 												setopstate->ps.state->es_query_cxt,
 												setopstate->tableContext,
 												econtext->ecxt_per_tuple_memory,
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f603337ecd3..8ec939e25d7 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -107,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->sortOperators[0],
 												   plannode->collations[0],
 												   plannode->nullsFirst[0],
-												   work_mem,
+												   workMemLimit(pstate),
 												   NULL,
 												   tuplesortopts);
 		else
@@ -117,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->sortOperators,
 												  plannode->collations,
 												  plannode->nullsFirst,
-												  work_mem,
+												  workMemLimit(pstate),
 												  NULL,
 												  tuplesortopts);
 		if (node->bounded)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 49767ed6a52..2d0df165c25 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -536,6 +536,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 	if (node->hashtable)
 		ResetTupleHashTable(node->hashtable);
 	else
+	{
+		int			workmem_limit;
+
+		workmem_limit = workMemLimitFromId(planstate,
+										   subplan->hashtab_workmem_id);
+
 		node->hashtable = BuildTupleHashTable(node->parent,
 											  node->descRight,
 											  &TTSOpsVirtual,
@@ -546,10 +552,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  node->tab_collations,
 											  nbuckets,
 											  0,
+											  (Size) workmem_limit * 1024,
 											  node->planstate->state->es_query_cxt,
 											  node->hashtablecxt,
 											  node->hashtempcxt,
 											  false);
+	}
 
 	if (!subplan->unknownEqFalse)
 	{
@@ -565,6 +573,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		if (node->hashnulls)
 			ResetTupleHashTable(node->hashnulls);
 		else
+		{
+			int			workmem_limit;
+
+			workmem_limit = workMemLimitFromId(planstate,
+											   subplan->hashnul_workmem_id);
+
 			node->hashnulls = BuildTupleHashTable(node->parent,
 												  node->descRight,
 												  &TTSOpsVirtual,
@@ -575,10 +589,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 												  node->tab_collations,
 												  nbuckets,
 												  0,
+												  (Size) workmem_limit * 1024,
 												  node->planstate->state->es_query_cxt,
 												  node->hashtablecxt,
 												  node->hashtempcxt,
 												  false);
+		}
 	}
 	else
 		node->hashnulls = NULL;
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 83ade3f9437..f679bd67bee 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -276,7 +276,8 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	/* build tuplestore for the result */
 	oldcxt = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
-	tstate->tupstore = tuplestore_begin_heap(false, false, work_mem);
+	tstate->tupstore = tuplestore_begin_heap(false, false,
+											 workMemLimit(tstate));
 
 	/*
 	 * Each call to fetch a new set of rows - of which there may be very many
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 9a1acce2b5d..7660aa626b6 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1092,7 +1092,8 @@ prepare_tuplestore(WindowAggState *winstate)
 	Assert(winstate->buffer == NULL);
 
 	/* Create new tuplestore */
-	winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
+	winstate->buffer = tuplestore_begin_heap(false, false,
+											 workMemLimit(winstate));
 
 	/*
 	 * Set up read pointers for the tuplestore.  The current pointer doesn't
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 73d78617009..ca4ab9bd315 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2802,7 +2802,8 @@ cost_agg(Path *path, PlannerInfo *root,
 		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
 											input_width,
 											aggcosts->transitionSpace);
-		hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+		hash_agg_set_limits(hashentrysize, numGroups, 0,
+							get_hash_memory_limit(), &mem_limit,
 							&ngroups_limit, &num_partitions);
 
 		nbatches = Max((numGroups * hashentrysize) / mem_limit,
@@ -4224,6 +4225,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
+							get_hash_memory_limit(),
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
@@ -4541,6 +4543,17 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		sp_cost.startup += plan->total_cost +
 			cpu_operator_cost * plan->plan_rows;
 
+		/*
+		 * Working memory needed for the hashtable (and hashnulls, if needed).
+		 */
+		subplan->hashtab_workmem_id = add_hash_workmem(root->glob);
+
+		if (!subplan->unknownEqFalse)
+		{
+			/* Also needs a hashnulls table.  */
+			subplan->hashnul_workmem_id = add_hash_workmem(root->glob);
+		}
+
 		/*
 		 * The per-tuple costs include the cost of evaluating the lefthand
 		 * expressions, plus the cost of probing the hashtable.  We already
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 816a2b2a576..97e43d49d1f 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1656,6 +1656,8 @@ create_material_plan(PlannerInfo *root, MaterialPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -1710,6 +1712,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -1856,6 +1860,8 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 best_path->path.rows,
 								 0,
 								 subplan);
+
+		plan->workmem_id = add_hash_workmem(root->glob);
 	}
 	else
 	{
@@ -2202,6 +2208,8 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2228,6 +2236,8 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
+	plan->sort.plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2339,6 +2349,12 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	if (plan->aggstrategy == AGG_HASHED)
+		plan->plan.workmem_id = add_hash_workmem(root->glob);
+
+	/* Also include working memory needed to sort the input: */
+	plan->sortWorkMemId = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2392,6 +2408,7 @@ static Plan *
 create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 {
 	Agg		   *plan;
+	Agg		   *first_sort_agg = NULL;
 	Plan	   *subplan;
 	List	   *rollups = best_path->rollups;
 	AttrNumber *grouping_map;
@@ -2457,7 +2474,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			RollupData *rollup = lfirst(lc);
 			AttrNumber *new_grpColIdx;
 			Plan	   *sort_plan = NULL;
-			Plan	   *agg_plan;
+			Agg		   *agg_plan;
 			AggStrategy strat;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
@@ -2480,19 +2497,19 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			else
 				strat = AGG_SORTED;
 
-			agg_plan = (Plan *) make_agg(NIL,
-										 NIL,
-										 strat,
-										 AGGSPLIT_SIMPLE,
-										 list_length((List *) linitial(rollup->gsets)),
-										 new_grpColIdx,
-										 extract_grouping_ops(rollup->groupClause),
-										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
-										 NIL,
-										 rollup->numGroups,
-										 best_path->transitionSpace,
-										 sort_plan);
+			agg_plan = make_agg(NIL,
+								NIL,
+								strat,
+								AGGSPLIT_SIMPLE,
+								list_length((List *) linitial(rollup->gsets)),
+								new_grpColIdx,
+								extract_grouping_ops(rollup->groupClause),
+								extract_grouping_collations(rollup->groupClause, subplan->targetlist),
+								rollup->gsets,
+								NIL,
+								rollup->numGroups,
+								best_path->transitionSpace,
+								sort_plan);
 
 			/*
 			 * Remove stuff we don't need to avoid bloating debug output.
@@ -2503,6 +2520,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan->lefttree = NULL;
 			}
 
+			if (agg_plan->aggstrategy == AGG_SORTED && !first_sort_agg)
+			{
+				/* This might be the first Sort agg. */
+				first_sort_agg = agg_plan;
+			}
+
 			chain = lappend(chain, agg_plan);
 		}
 	}
@@ -2535,6 +2558,29 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
+
+		/*
+		 * NOTE: We will place the workmem needed to sort the input (if any)
+		 * on the first agg, the Hash workmem on the first Hash agg, and the
+		 * Sort workmem (if any) on the first Sort agg.
+		 */
+		if (plan->aggstrategy == AGG_HASHED || plan->aggstrategy == AGG_MIXED)
+		{
+			/* All Hash Grouping Sets share the same workmem limit. */
+			plan->plan.workmem_id = add_hash_workmem(root->glob);
+		}
+		else if (plan->aggstrategy == AGG_SORTED)
+		{
+			/* Every Sort Grouping Set gets its own workmem limit. */
+			first_sort_agg = plan;
+		}
+
+		/* Store the workmem limit, for all Sorts, on the first Sort. */
+		if (first_sort_agg)
+			first_sort_agg->plan.workmem_id = add_workmem(root->glob);
+
+		/* Also include working memory needed to sort the input: */
+		plan->sortWorkMemId = add_workmem(root->glob);
 	}
 
 	return (Plan *) plan;
@@ -2707,6 +2753,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2747,6 +2795,8 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2783,6 +2833,12 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
+	/* Also include working memory for hash table. */
+	if (plan->numCols > 0)
+		plan->hashWorkMemId = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -3489,6 +3545,9 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
+
+		plan->workmem_id = add_workmem(root->glob);
+
 		/* Extract original index clauses, actual index quals, relevant ECs */
 		subquals = NIL;
 		subindexquals = NIL;
@@ -3796,6 +3855,8 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -3839,6 +3900,8 @@ create_tablefuncscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -3977,6 +4040,8 @@ create_ctescan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -4616,6 +4681,8 @@ create_mergejoin_plan(PlannerInfo *root,
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
 
+		matplan->workmem_id = add_workmem(root->glob);
+
 		inner_plan = matplan;
 	}
 
@@ -4961,6 +5028,9 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	/* Assign workmem to the Hash subnode, not its parent HashJoin node. */
+	hash_plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return join_plan;
 }
 
@@ -5513,6 +5583,8 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
 	plan->plan.parallel_safe = lefttree->parallel_safe;
+
+	plan->plan.workmem_id = add_workmem(root->glob);
 }
 
 /*
@@ -5544,6 +5616,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 	plan->sort.plan.plan_width = lefttree->plan_width;
 	plan->sort.plan.parallel_aware = false;
 	plan->sort.plan.parallel_safe = lefttree->parallel_safe;
+
+	plan->sort.plan.workmem_id = add_workmem(root->glob);
 }
 
 /*
@@ -6595,14 +6669,14 @@ make_material(Plan *lefttree)
 
 /*
  * materialize_finished_plan: stick a Material node atop a completed plan
- *
+ *r/
  * There are a couple of places where we want to attach a Material node
  * after completion of create_plan(), without any MaterialPath path.
  * Those places should probably be refactored someday to do this on the
  * Path representation, but it's not worth the trouble yet.
  */
 Plan *
-materialize_finished_plan(Plan *subplan)
+materialize_finished_plan(PlannerGlobal *glob, Plan *subplan)
 {
 	Plan	   *matplan;
 	Path		matpath;		/* dummy for result of cost_material */
@@ -6641,6 +6715,8 @@ materialize_finished_plan(Plan *subplan)
 	matplan->parallel_aware = false;
 	matplan->parallel_safe = subplan->parallel_safe;
 
+	matplan->workmem_id = add_workmem(glob);
+
 	return matplan;
 }
 
@@ -7403,3 +7479,41 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+static int
+add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
+{
+	glob->workMemCategories = lappend_int(glob->workMemCategories, category);
+	/* the executor will fill this in later: */
+	glob->workMemLimits = lappend_int(glob->workMemLimits, 0);
+
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemLimits));
+
+	return list_length(glob->workMemCategories);
+}
+
+/*
+ * add_workmem
+ *	  Add (non-hash) workmem info to the glob's lists
+ *
+ * This data structure will have its working-memory limit set to work_mem.
+ */
+int
+add_workmem(PlannerGlobal *glob)
+{
+	return add_workmem_internal(glob, WORKMEM_NORMAL);
+}
+
+/*
+ * add_hash_workmem
+ *	  Add hash workmem info to the glob's lists
+ *
+ * This data structure will have its working-memory limit set to work_mem *
+ * hash_mem_multiplier.
+ */
+int
+add_hash_workmem(PlannerGlobal *glob)
+{
+	return add_workmem_internal(glob, WORKMEM_HASH);
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36ee6dd43de..56846fdeaab 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -437,7 +437,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	if (cursorOptions & CURSOR_OPT_SCROLL)
 	{
 		if (!ExecSupportsBackwardScan(top_plan))
-			top_plan = materialize_finished_plan(top_plan);
+			top_plan = materialize_finished_plan(glob, top_plan);
 	}
 
 	/*
@@ -573,6 +573,9 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->stmt_location = parse->stmt_location;
 	result->stmt_len = parse->stmt_len;
 
+	result->workMemCategories = glob->workMemCategories;
+	result->workMemLimits = glob->workMemLimits;
+
 	result->jitFlags = PGJIT_NONE;
 	if (jit_enabled && jit_above_cost >= 0 &&
 		top_plan->total_cost > jit_above_cost)
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 8230cbea3c3..27ccd04cada 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -533,7 +533,7 @@ build_subplan(PlannerInfo *root, Plan *plan, Path *path,
 		 */
 		else if (splan->parParam == NIL && enable_material &&
 				 !ExecMaterializesOutput(nodeTag(plan)))
-			plan = materialize_finished_plan(plan);
+			plan = materialize_finished_plan(root->glob, plan);
 
 		result = (Node *) splan;
 		isInitPlan = false;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d12e3f451d2..c4147876d55 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  Oid *collations,
 										  long nbuckets,
 										  Size additionalsize,
+										  Size hash_mem_limit,
 										  MemoryContext metacxt,
 										  MemoryContext tablecxt,
 										  MemoryContext tempcxt,
@@ -499,6 +500,7 @@ extern Tuplestorestate *ExecMakeTableFunctionResult(SetExprState *setexpr,
 													ExprContext *econtext,
 													MemoryContext argContext,
 													TupleDesc expectedDesc,
+													int workMem,
 													bool randomAccess);
 extern SetExprState *ExecInitFunctionResultSet(Expr *expr,
 											   ExprContext *econtext, PlanState *parent);
@@ -724,4 +726,9 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execWorkmem.c
+ */
+extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..9b184c47322 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -253,7 +253,8 @@ typedef struct ParallelHashJoinState
 	ParallelHashGrowth growth;	/* control batch/bucket growth */
 	dsa_pointer chunk_work_queue;	/* chunk work queue */
 	int			nparticipants;
-	size_t		space_allowed;
+	size_t		space_allowed;	/* -- might be shared with other workers */
+	size_t		worker_space_allowed;	/* -- exclusive to this worker */
 	size_t		total_tuples;	/* total number of inner tuples */
 	LWLock		lock;			/* lock protecting the above */
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 34b82d0f5d1..dee74d42d13 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -329,7 +329,8 @@ extern void ExecReScanAgg(AggState *node);
 extern Size hash_agg_entry_size(int numTrans, Size tupleWidth,
 								Size transitionSpace);
 extern void hash_agg_set_limits(double hashentrysize, double input_groups,
-								int used_bits, Size *mem_limit,
+								int used_bits,
+								Size hash_mem_limit, Size *mem_limit,
 								uint64 *ngroups_limit, int *num_partitions);
 
 /* parallel instrumentation support */
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..e4e9e0d1de1 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,8 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t *space_allowed,
+									size_t worker_space_allowed,
+									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
 									int *num_skew_mcvs);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a323fa98bbb..461db7a8822 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1265,6 +1265,19 @@ typedef struct PlanState
 			((PlanState *)(node))->instrument->nfiltered2 += (delta); \
 	} while(0)
 
+/* macros for fetching the workmem info associated with a PlanState */
+#define workMemFieldFromId(node, field, id)								\
+	(list_nth_int(((PlanState *)(node))->state->es_plannedstmt->field, \
+				  (id) - 1))
+#define workMemField(node, field)   \
+	(workMemFieldFromId((node), field, ((PlanState *)(node))->plan->workmem_id))
+
+/* workmem limit: */
+#define workMemLimitFromId(node, id) \
+	(workMemFieldFromId(node, workMemLimits, id))
+#define workMemLimit(node) \
+	(workMemField(node, workMemLimits))
+
 /*
  * EPQState is state for executing an EvalPlanQual recheck on a candidate
  * tuples e.g. in ModifyTable or LockRows.
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index fbf05322c75..b2901568ceb 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -179,6 +179,17 @@ typedef struct PlannerGlobal
 
 	/* partition descriptors */
 	PartitionDirectory partition_directory pg_node_attr(read_write_ignore);
+
+	/*
+	 * Working-memory info, for Plan and SubPlans. Any Plan or SubPlan that
+	 * needs working memory for a data structure maintains a "workmem_id"
+	 * index into the following lists (all kept in sync).
+	 */
+
+	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
+	List	   *workMemCategories;
+	/* - IntList: limit (in KB), after which data structure must spill */
+	List	   *workMemLimits;
 } PlannerGlobal;
 
 /* macro for fetching the Plan associated with a SubPlan node */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index bf1f25c0dba..9f86f37e6ea 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -133,13 +133,23 @@ typedef struct PlannedStmt
 	ParseLoc	stmt_location;
 	/* length in bytes; 0 means "rest of string" */
 	ParseLoc	stmt_len;
+
+	/*
+	 * Working-memory info, for Plan and SubPlans. Any Plan or SubPlan that
+	 * needs working memory for a data structure maintains a "workmem_id"
+	 * index into the following lists (all kept in sync).
+	 */
+
+	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
+	List	   *workMemCategories;
+	/* - IntList: limit (in KB), after which data structure must spill */
+	List	   *workMemLimits;
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
 #define exec_subplan_get_plan(plannedstmt, subplan) \
 	((Plan *) list_nth((plannedstmt)->subplans, (subplan)->plan_id - 1))
 
-
 /* ----------------
  *		Plan node
  *
@@ -195,6 +205,8 @@ typedef struct Plan
 	 */
 	/* unique across entire final plan tree */
 	int			plan_node_id;
+	/* 1-based id of workMem to use, or else zero */
+	int			workmem_id;
 	/* target list to be computed at this node */
 	List	   *targetlist;
 	/* implicitly-ANDed qual conditions */
@@ -426,6 +438,9 @@ typedef struct RecursiveUnion
 
 	/* estimated number of groups in input */
 	long		numGroups;
+
+	/* 1-based id of workMem to use for hash table, or else zero */
+	int			hashWorkMemId;
 } RecursiveUnion;
 
 /* ----------------
@@ -1145,6 +1160,9 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* 1-based id of workMem to use to sort inputs, or else zero */
+	int			sortWorkMemId;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
@@ -1758,4 +1776,11 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+/* different data structures get different working-memory limits*/
+typedef enum WorkMemCategory
+{
+	WORKMEM_NORMAL,				/* gets work_mem */
+	WORKMEM_HASH,				/* gets hash_mem_multiplier * work_mem */
+}			WorkMemCategory;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index d0576da3e25..2698cf09304 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1109,6 +1109,9 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	/* 1-based id of workMem to use, or else zero: */
+	int			hashtab_workmem_id; /* for hash table */
+	int			hashnul_workmem_id; /* for NULLs hash table */
 } SubPlan;
 
 /*
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5a930199611..bf5e89e8415 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -46,9 +46,11 @@ extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
 									 Plan *outer_plan);
 extern Plan *change_plan_targetlist(Plan *subplan, List *tlist,
 									bool tlist_parallel_safe);
-extern Plan *materialize_finished_plan(Plan *subplan);
+extern Plan *materialize_finished_plan(PlannerGlobal *glob, Plan *subplan);
 extern bool is_projection_capable_path(Path *path);
 extern bool is_projection_capable_plan(Plan *plan);
+extern int	add_workmem(PlannerGlobal *glob);
+extern int	add_hash_workmem(PlannerGlobal *glob);
 
 /* External use of these functions is deprecated: */
 extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
-- 
2.47.1

#25

Álvaro Herrera

alvherre@kurilemu.de

5 months ago

In reply to: James Hunter (#24)

4 attachment(s)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

Here's a rebased version of this patch. I didn't review it or touch it
in any way, just fixed conflicts from current master.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)

Attachments:

0001-Store-working-memory-limit-per-Plan-SubPlan-rather-t.patchtext/x-diff; charset=utf-8Download

From 03df3a9feb6710cf34845fecfad3086389a45635 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Tue, 25 Feb 2025 22:44:01 +0000
Subject: [PATCH 1/4] Store working memory limit per Plan/SubPlan, rather than
 in GUC

This commit moves the working-memory limit that an executor node checks, at
runtime, from the "work_mem" and "hash_mem_multiplier" GUCs, to a new
list, "workMemLimits", added to the PlannedStmt node. At runtimem an exec
node checks its limit by looking up the list element corresponding to its
plan->workmem_id field.

Indirecting the workMemLimit via a List index allows us to handle SubPlans
as well as Plans. It also allows a future extension to set limits on
individual Plans/SubPlans, without needing to re-traverse the Plan +
Expr tree.

To preserve backward, this commit also copies the "work_mem", etc., values
from the existing GUCs to the new field. This means that this commit is
just a refactoring, and doesn't change any behavior.

This "workmem_id" field is on the Plan node, instead of the corresponding
PlanState, because the workMemLimit needs to be set before we can call
ExecInitNode().
---
 src/backend/executor/Makefile              |   1 +
 src/backend/executor/execGrouping.c        |  10 +-
 src/backend/executor/execMain.c            |   6 +
 src/backend/executor/execParallel.c        |   2 +
 src/backend/executor/execSRF.c             |   5 +-
 src/backend/executor/execWorkmem.c         |  87 ++++++++++++
 src/backend/executor/meson.build           |   1 +
 src/backend/executor/nodeAgg.c             |  64 ++++++---
 src/backend/executor/nodeBitmapIndexscan.c |   2 +-
 src/backend/executor/nodeBitmapOr.c        |   2 +-
 src/backend/executor/nodeCtescan.c         |   3 +-
 src/backend/executor/nodeFunctionscan.c    |   2 +
 src/backend/executor/nodeHash.c            |  22 +++-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeMaterial.c        |   3 +-
 src/backend/executor/nodeMemoize.c         |   2 +-
 src/backend/executor/nodeRecursiveunion.c  |  14 +-
 src/backend/executor/nodeSetOp.c           |   1 +
 src/backend/executor/nodeSort.c            |   4 +-
 src/backend/executor/nodeSubplan.c         |  16 +++
 src/backend/executor/nodeTableFuncscan.c   |   3 +-
 src/backend/executor/nodeWindowAgg.c       |   3 +-
 src/backend/optimizer/path/costsize.c      |  16 ++-
 src/backend/optimizer/plan/createplan.c    | 146 ++++++++++++++++++---
 src/backend/optimizer/plan/planner.c       |   5 +-
 src/backend/optimizer/plan/subselect.c     |   2 +-
 src/include/executor/executor.h            |   7 +
 src/include/executor/hashjoin.h            |   3 +-
 src/include/executor/nodeAgg.h             |   3 +-
 src/include/executor/nodeHash.h            |   3 +-
 src/include/nodes/execnodes.h              |  13 ++
 src/include/nodes/pathnodes.h              |  11 ++
 src/include/nodes/plannodes.h              |  27 +++-
 src/include/nodes/primnodes.h              |   3 +
 src/include/optimizer/planmain.h           |   4 +-
 35 files changed, 434 insertions(+), 66 deletions(-)
 create mode 100644 src/backend/executor/execWorkmem.c

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..8aa9580558f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -30,6 +30,7 @@ OBJS = \
 	execScan.o \
 	execTuples.o \
 	execUtils.o \
+	execWorkmem.o \
 	functions.o \
 	instrument.o \
 	nodeAgg.o \
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index b5400749353..24e8034e4ee 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -168,6 +168,7 @@ BuildTupleHashTable(PlanState *parent,
 					Oid *collations,
 					long nbuckets,
 					Size additionalsize,
+					Size hash_mem_limit,
 					MemoryContext metacxt,
 					MemoryContext tablecxt,
 					MemoryContext tempcxt,
@@ -175,7 +176,6 @@ BuildTupleHashTable(PlanState *parent,
 {
 	TupleHashTable hashtable;
 	Size		entrysize;
-	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
@@ -184,8 +184,12 @@ BuildTupleHashTable(PlanState *parent,
 	additionalsize = MAXALIGN(additionalsize);
 	entrysize = sizeof(TupleHashEntryData) + additionalsize;
 
-	/* Limit initial table size request to not more than hash_mem */
-	hash_mem_limit = get_hash_memory_limit() / entrysize;
+	/*
+	 * Limit initial table size request to not more than hash_mem
+	 *
+	 * XXX - we should also limit the *maximum* table size to hash_mem.
+	 */
+	hash_mem_limit = hash_mem_limit / entrysize;
 	if (nbuckets > hash_mem_limit)
 		nbuckets = hash_mem_limit;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0391798dd2c..6aa9dde5a80 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -949,6 +949,12 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/* signal that this EState is not used for EPQ */
 	estate->es_epq_active = NULL;
 
+	/*
+	 * Assign working memory to SubPlan and Plan nodes, before initializing
+	 * their states.
+	 */
+	ExecAssignWorkMem(plannedstmt);
+
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..a8cb631963e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -216,6 +216,8 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->utilityStmt = NULL;
 	pstmt->stmt_location = -1;
 	pstmt->stmt_len = -1;
+	pstmt->workMemCategories = estate->es_plannedstmt->workMemCategories;
+	pstmt->workMemLimits = estate->es_plannedstmt->workMemLimits;
 
 	/* Return serialized copy of our dummy PlannedStmt. */
 	return nodeToString(pstmt);
diff --git a/src/backend/executor/execSRF.c b/src/backend/executor/execSRF.c
index a03fe780a02..4b1e7e0ad1e 100644
--- a/src/backend/executor/execSRF.c
+++ b/src/backend/executor/execSRF.c
@@ -102,6 +102,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 							ExprContext *econtext,
 							MemoryContext argContext,
 							TupleDesc expectedDesc,
+							int workMem,
 							bool randomAccess)
 {
 	Tuplestorestate *tupstore = NULL;
@@ -261,7 +262,7 @@ ExecMakeTableFunctionResult(SetExprState *setexpr,
 				MemoryContext oldcontext =
 					MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-				tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+				tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 				rsinfo.setResult = tupstore;
 				if (!returnsTuple)
 				{
@@ -396,7 +397,7 @@ no_function_result:
 		MemoryContext oldcontext =
 			MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
 
-		tupstore = tuplestore_begin_heap(randomAccess, false, work_mem);
+		tupstore = tuplestore_begin_heap(randomAccess, false, workMem);
 		rsinfo.setResult = tupstore;
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
new file mode 100644
index 00000000000..d8a19a58ebe
--- /dev/null
+++ b/src/backend/executor/execWorkmem.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * execWorkmem.c
+ *	 routine to set the "workmem_limit" field(s) on Plan nodes that need
+ *   workimg memory.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execWorkmem.c
+ *
+ * INTERFACE ROUTINES
+ *		ExecAssignWorkMem	- assign working memory to Plan nodes
+ *
+ *	 NOTES
+ *		Historically, every PlanState node, during initialization, looked at
+ *		the "work_mem" (plus maybe "hash_mem_multiplier") GUC, to determine
+ *		its working-memory limit.
+ *
+ *		Now, to allow different PlanState nodes to be restricted to different
+ *		amounts of memory, each PlanState node reads this limit off the
+ *		PlannedStmt's workMemLimits List, at the (1-based) position indicated
+ *		by the PlanState's Plan node's "workmem_id" field.
+ *
+ *		We assign the workmem_id and expand the workMemLimits List, when
+ *		creating the Plan node; and then we set this limit by calling
+ *		ExecAssignWorkMem(), from InitPlan(), before we initialize the PlanState
+ *		nodes.
+ *
+ * 		The workMemLimit has always applied "per data structure," rather than
+ *		"per PlanState". So a single SQL operator (e.g., RecursiveUnion) can
+ *		use more than the workMemLimit, even though each of its data
+ *		structures is restricted to it.
+ *
+ *		We store the "workmem_id" field(s) on the Plan, instead of the
+ *		PlanState, even though it conceptually belongs to execution rather than
+ *		to planning, because we need it to be set before initializing the
+ *		corresponding PlanState. This is a chicken-and-egg problem. We could,
+ *		of course, make ExecInitNode() a two-phase operation, but that seems
+ *		like overkill. Instead, we store these "workmem_id" fields on the Plan,
+ *		but set the workMemLimit when we start execution, as part of
+ *		InitPlan().
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "nodes/plannodes.h"
+
+
+/* ------------------------------------------------------------------------
+ *		ExecAssignWorkMem
+ *
+ *		Assigns working memory to any Plans or SubPlans that need it.
+ *
+ *		Inputs:
+ *		  'plannedstmt' is the statement to which we assign working memory
+ *
+ * ------------------------------------------------------------------------
+ */
+void
+ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	ListCell   *lc_category;
+	ListCell   *lc_limit;
+
+	/*
+	 * No need to re-assign working memory on parallel workers, since workers
+	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
+	 *
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	forboth(lc_category, plannedstmt->workMemCategories,
+			lc_limit, plannedstmt->workMemLimits)
+	{
+		lfirst_int(lc_limit) = lfirst_int(lc_category) == WORKMEM_HASH ?
+			get_hash_memory_limit() / 1024 : work_mem;
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..4e65974f5f3 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -18,6 +18,7 @@ backend_sources += files(
   'execScan.c',
   'execTuples.c',
   'execUtils.c',
+  'execWorkmem.c',
   'functions.c',
   'instrument.c',
   'nodeAgg.c',
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 377e016d732..bff143a8a8e 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -258,6 +258,7 @@
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "executor/nodeHash.h"
 #include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -403,7 +404,8 @@ static void find_cols(AggState *aggstate, Bitmapset **aggregated,
 					  Bitmapset **unaggregated);
 static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
+static void build_hash_table(AggState *aggstate, int setno, long nbuckets,
+							 Size hash_mem_limit);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
 static void hash_create_memory(AggState *aggstate);
@@ -412,6 +414,7 @@ static long hash_choose_num_buckets(double hashentrysize,
 static int	hash_choose_num_partitions(double input_groups,
 									   double hashentrysize,
 									   int used_bits,
+									   Size hash_mem_limit,
 									   int *log2_npartitions);
 static void initialize_hash_entry(AggState *aggstate,
 								  TupleHashTable hashtable,
@@ -434,7 +437,7 @@ static HashAggBatch *hashagg_batch_new(LogicalTape *input_tape, int setno,
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset,
 							   int used_bits, double input_groups,
-							   double hashentrysize);
+							   double hashentrysize, Size hash_mem_limit);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 								TupleTableSlot *inputslot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
@@ -522,6 +525,15 @@ initialize_phase(AggState *aggstate, int newphase)
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
+		int			workmem_limit;
+
+		/*
+		 * Read the sort-output workmem limit off the first AGG_SORTED node.
+		 * Since phase 0 is always AGG_HASHED, this will always be phase 1.
+		 */
+		workmem_limit =
+			workMemLimitFromId(aggstate,
+							   aggstate->phases[1].aggnode->plan.workmem_id);
 
 		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
 												  sortnode->numCols,
@@ -529,7 +541,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->sortOperators,
 												  sortnode->collations,
 												  sortnode->nullsFirst,
-												  work_mem,
+												  workmem_limit,
 												  NULL, TUPLESORT_NONE);
 	}
 
@@ -585,6 +597,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 	 */
 	if (pertrans->aggsortrequired)
 	{
+		int			workmem_limit;
+
 		/*
 		 * In case of rescan, maybe there could be an uncompleted sort
 		 * operation?  Clean it up if so.
@@ -592,6 +606,12 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 		if (pertrans->sortstates[aggstate->current_set])
 			tuplesort_end(pertrans->sortstates[aggstate->current_set]);
 
+		/*
+		 * Read the sort-input workmem limit off the first Agg node.
+		 */
+		workmem_limit =
+			workMemLimitFromId(aggstate,
+							   ((Agg *) aggstate->ss.ps.plan)->sortWorkMemId);
 
 		/*
 		 * We use a plain Datum sorter when there's a single input column;
@@ -607,7 +627,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, TUPLESORT_NONE);
+									  workmem_limit, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -617,7 +637,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, TUPLESORT_NONE);
+									 workmem_limit, NULL, TUPLESORT_NONE);
 	}
 
 	/*
@@ -1496,7 +1516,7 @@ build_hash_tables(AggState *aggstate)
 		}
 #endif
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, setno, nbuckets, memory);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1506,7 +1526,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, int setno, long nbuckets,
+				 Size hash_mem_limit)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
@@ -1535,6 +1556,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 perhash->aggnode->grpCollations,
 											 nbuckets,
 											 additionalsize,
+											 hash_mem_limit,
 											 metacxt,
 											 tablecxt,
 											 tmpcxt,
@@ -1807,12 +1829,11 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  */
 void
 hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
-					Size *mem_limit, uint64 *ngroups_limit,
+					Size hash_mem_limit, Size *mem_limit, uint64 *ngroups_limit,
 					int *num_partitions)
 {
 	int			npartitions;
 	Size		partition_mem;
-	Size		hash_mem_limit = get_hash_memory_limit();
 
 	/* if not expected to spill, use all of hash_mem */
 	if (input_groups * hashentrysize <= hash_mem_limit)
@@ -1832,6 +1853,7 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	npartitions = hash_choose_num_partitions(input_groups,
 											 hashentrysize,
 											 used_bits,
+											 hash_mem_limit,
 											 NULL);
 	if (num_partitions != NULL)
 		*num_partitions = npartitions;
@@ -1932,7 +1954,8 @@ hash_agg_enter_spill_mode(AggState *aggstate)
 
 			hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
+							   aggstate->hashentrysize,
+							   (Size) workMemLimit(aggstate) * 1024);
 		}
 	}
 }
@@ -2081,9 +2104,9 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  */
 static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
-						   int used_bits, int *log2_npartitions)
+						   int used_bits, Size hash_mem_limit,
+						   int *log2_npartitions)
 {
-	Size		hash_mem_limit = get_hash_memory_limit();
 	double		partition_limit;
 	double		mem_wanted;
 	double		dpartitions;
@@ -2219,7 +2242,8 @@ lookup_hash_entries(AggState *aggstate)
 			if (spill->partitions == NULL)
 				hashagg_spill_init(spill, aggstate->hash_tapeset, 0,
 								   perhash->aggnode->numGroups,
-								   aggstate->hashentrysize);
+								   aggstate->hashentrysize,
+								   (Size) workMemLimit(aggstate) * 1024);
 
 			hashagg_spill_tuple(aggstate, spill, slot, hash);
 			pergroup[setno] = NULL;
@@ -2693,7 +2717,9 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_batches = list_delete_last(aggstate->hash_batches);
 
 	hash_agg_set_limits(aggstate->hashentrysize, batch->input_card,
-						batch->used_bits, &aggstate->hash_mem_limit,
+						batch->used_bits,
+						(Size) workMemLimit(aggstate) * 1024,
+						&aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
 	/*
@@ -2783,7 +2809,8 @@ agg_refill_hash_table(AggState *aggstate)
 				 */
 				spill_initialized = true;
 				hashagg_spill_init(&spill, tapeset, batch->used_bits,
-								   batch->input_card, aggstate->hashentrysize);
+								   batch->input_card, aggstate->hashentrysize,
+								   (Size) workMemLimit(aggstate) * 1024);
 			}
 			/* no memory for a new group, spill */
 			hashagg_spill_tuple(aggstate, &spill, spillslot, hash);
@@ -2982,13 +3009,15 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
  */
 static void
 hashagg_spill_init(HashAggSpill *spill, LogicalTapeSet *tapeset, int used_bits,
-				   double input_groups, double hashentrysize)
+				   double input_groups, double hashentrysize,
+				   Size hash_mem_limit)
 {
 	int			npartitions;
 	int			partition_bits;
 
 	npartitions = hash_choose_num_partitions(input_groups, hashentrysize,
-											 used_bits, &partition_bits);
+											 used_bits, hash_mem_limit,
+											 &partition_bits);
 
 #ifdef USE_INJECTION_POINTS
 	if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-single-partition"))
@@ -3712,6 +3741,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			totalGroups += aggstate->perhash[k].aggnode->numGroups;
 
 		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							(Size) workMemLimit(aggstate) * 1024,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index abbb033881a..8bbf1d047c4 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -91,7 +91,7 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	else
 	{
 		/* XXX should we use less than work_mem for this? */
-		tbm = tbm_create(work_mem * (Size) 1024,
+		tbm = tbm_create(workMemLimit(node) * (Size) 1024,
 						 ((BitmapIndexScan *) node->ss.ps.plan)->isshared ?
 						 node->ss.ps.state->es_query_dsa : NULL);
 	}
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 231760ec93d..16d0a164292 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -143,7 +143,7 @@ MultiExecBitmapOr(BitmapOrState *node)
 			if (result == NULL) /* first subplan */
 			{
 				/* XXX should we use less than work_mem for this? */
-				result = tbm_create(work_mem * (Size) 1024,
+				result = tbm_create(workMemLimit(subnode) * (Size) 1024,
 									((BitmapOr *) node->ps.plan)->isshared ?
 									node->ps.state->es_query_dsa : NULL);
 			}
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index e1675f66b43..08f48f88e65 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -232,7 +232,8 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 		/* I am the leader */
 		prmdata->value = PointerGetDatum(scanstate);
 		scanstate->leader = scanstate;
-		scanstate->cte_table = tuplestore_begin_heap(true, false, work_mem);
+		scanstate->cte_table =
+			tuplestore_begin_heap(true, false, workMemLimit(scanstate));
 		tuplestore_set_eflags(scanstate->cte_table, scanstate->eflags);
 		scanstate->readptr = 0;
 	}
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 644363582d9..fda42a278b8 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -95,6 +95,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											node->funcstates[0].tupdesc,
+											workMemLimit(node),
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
@@ -154,6 +155,7 @@ FunctionNext(FunctionScanState *node)
 											node->ss.ps.ps_ExprContext,
 											node->argcontext,
 											fs->tupdesc,
+											workMemLimit(node),
 											node->eflags & EXEC_FLAG_BACKWARD);
 
 			/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..bb9af08dc5d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -448,6 +448,7 @@ ExecHashTableCreate(HashState *state)
 	Hash	   *node;
 	HashJoinTable hashtable;
 	Plan	   *outerNode;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;
 	int			nbuckets;
 	int			nbatch;
@@ -471,11 +472,15 @@ ExecHashTableCreate(HashState *state)
 	 */
 	rows = node->plan.parallel_aware ? node->rows_total : outerNode->plan_rows;
 
+	worker_space_allowed = (size_t) workMemLimit(state) * 1024;
+	Assert(worker_space_allowed > 0);
+
 	ExecChooseHashTableSize(rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
+							worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs);
 
@@ -599,6 +604,7 @@ ExecHashTableCreate(HashState *state)
 		{
 			pstate->nbatch = nbatch;
 			pstate->space_allowed = space_allowed;
+			pstate->worker_space_allowed = worker_space_allowed;
 			pstate->growth = PHJ_GROWTH_OK;
 
 			/* Set up the shared state for coordinating batches. */
@@ -658,7 +664,8 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t *space_allowed,
+						size_t worker_space_allowed,
+						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
 						int *num_skew_mcvs)
@@ -687,9 +694,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	inner_rel_bytes = ntuples * tupsize;
 
 	/*
-	 * Compute in-memory hashtable size limit from GUCs.
+	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = get_hash_memory_limit();
+	hash_table_bytes = worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -706,7 +713,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		hash_table_bytes = (size_t) newlimit;
 	}
 
-	*space_allowed = hash_table_bytes;
+	*total_space_allowed = hash_table_bytes;
 
 	/*
 	 * If skew optimization is possible, estimate the number of skew buckets
@@ -808,7 +815,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		{
 			ExecChooseHashTableSize(ntuples, tupwidth, useskew,
 									false, parallel_workers,
-									space_allowed,
+									worker_space_allowed,
+									total_space_allowed,
 									numbuckets,
 									numbatches,
 									num_skew_mcvs);
@@ -929,7 +937,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
-		*space_allowed = (*space_allowed) * 2;
+		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
 	Assert(nbuckets > 0);
@@ -1235,7 +1243,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
 					 * to switch from one large combined memory budget to the
 					 * regular hash_mem budget.
 					 */
-					pstate->space_allowed = get_hash_memory_limit();
+					pstate->space_allowed = pstate->worker_space_allowed;
 
 					/*
 					 * The combined hash_mem of all participants wasn't
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 975b0397e7a..7a92c1eb2c0 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -312,7 +312,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.sortOperators[nPresortedCols]),
 												&(plannode->sort.collations[nPresortedCols]),
 												&(plannode->sort.nullsFirst[nPresortedCols]),
-												work_mem,
+												workMemLimit(pstate),
 												NULL,
 												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
@@ -613,7 +613,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.sortOperators,
 												  plannode->sort.collations,
 												  plannode->sort.nullsFirst,
-												  work_mem,
+												  workMemLimit(pstate),
 												  NULL,
 												  node->bounded ?
 												  TUPLESORT_ALLOWBOUNDED :
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9798bb75365..bf5e921a205 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -61,7 +61,8 @@ ExecMaterial(PlanState *pstate)
 	 */
 	if (tuplestorestate == NULL && node->eflags != 0)
 	{
-		tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
+		tuplestorestate =
+			tuplestore_begin_heap(true, false, workMemLimit(node));
 		tuplestore_set_eflags(tuplestorestate, node->eflags);
 		if (node->eflags & EXEC_FLAG_MARK)
 		{
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 609deb12afb..4e3da4aab6b 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -1036,7 +1036,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
 	mstate->mem_used = 0;
 
 	/* Limit the total memory consumed by the cache to this */
-	mstate->mem_limit = get_hash_memory_limit();
+	mstate->mem_limit = (Size) workMemLimit(mstate) * 1024;
 
 	/* A memory context dedicated for the cache */
 	mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 40f66fd0680..5ffffd327d2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -33,6 +33,8 @@ build_hash_table(RecursiveUnionState *rustate)
 {
 	RecursiveUnion *node = (RecursiveUnion *) rustate->ps.plan;
 	TupleDesc	desc = ExecGetResultType(outerPlanState(rustate));
+	int			workmem_limit = workMemLimitFromId(rustate,
+												   node->hashWorkMemId);
 
 	Assert(node->numCols > 0);
 	Assert(node->numGroups > 0);
@@ -52,6 +54,7 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->dupCollations,
 											 node->numGroups,
 											 0,
+											 (Size) workmem_limit * 1024,
 											 rustate->ps.state->es_query_cxt,
 											 rustate->tableContext,
 											 rustate->tempContext,
@@ -202,8 +205,15 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/* initialize processing state */
 	rustate->recursing = false;
 	rustate->intermediate_empty = true;
-	rustate->working_table = tuplestore_begin_heap(false, false, work_mem);
-	rustate->intermediate_table = tuplestore_begin_heap(false, false, work_mem);
+
+	/*
+	 * NOTE: each of our working tables gets the same workmem_limit, since
+	 * we're going to swap them repeatedly.
+	 */
+	rustate->working_table = tuplestore_begin_heap(false, false,
+												   workMemLimit(rustate));
+	rustate->intermediate_table = tuplestore_begin_heap(false, false,
+														workMemLimit(rustate));
 
 	/*
 	 * If hashing, we need a per-tuple memory context for comparisons, and a
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 4068481a523..0e2d02aa243 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -105,6 +105,7 @@ build_hash_table(SetOpState *setopstate)
 												node->cmpCollations,
 												node->numGroups,
 												sizeof(SetOpStatePerGroupData),
+												(Size) workMemLimit(setopstate) * 1024,
 												setopstate->ps.state->es_query_cxt,
 												setopstate->tableContext,
 												econtext->ecxt_per_tuple_memory,
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f603337ecd3..8ec939e25d7 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -107,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->sortOperators[0],
 												   plannode->collations[0],
 												   plannode->nullsFirst[0],
-												   work_mem,
+												   workMemLimit(pstate),
 												   NULL,
 												   tuplesortopts);
 		else
@@ -117,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->sortOperators,
 												  plannode->collations,
 												  plannode->nullsFirst,
-												  work_mem,
+												  workMemLimit(pstate),
 												  NULL,
 												  tuplesortopts);
 		if (node->bounded)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index f7f6fc2da0b..56036e79933 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -536,6 +536,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 	if (node->hashtable)
 		ResetTupleHashTable(node->hashtable);
 	else
+	{
+		int			workmem_limit;
+
+		workmem_limit = workMemLimitFromId(planstate,
+										   subplan->hashtab_workmem_id);
+
 		node->hashtable = BuildTupleHashTable(node->parent,
 											  node->descRight,
 											  &TTSOpsVirtual,
@@ -546,10 +552,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  node->tab_collations,
 											  nbuckets,
 											  0,
+											  (Size) workmem_limit * 1024,
 											  node->planstate->state->es_query_cxt,
 											  node->hashtablecxt,
 											  node->hashtempcxt,
 											  false);
+	}
 
 	if (!subplan->unknownEqFalse)
 	{
@@ -565,6 +573,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		if (node->hashnulls)
 			ResetTupleHashTable(node->hashnulls);
 		else
+		{
+			int			workmem_limit;
+
+			workmem_limit = workMemLimitFromId(planstate,
+											   subplan->hashnul_workmem_id);
+
 			node->hashnulls = BuildTupleHashTable(node->parent,
 												  node->descRight,
 												  &TTSOpsVirtual,
@@ -575,10 +589,12 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 												  node->tab_collations,
 												  nbuckets,
 												  0,
+												  (Size) workmem_limit * 1024,
 												  node->planstate->state->es_query_cxt,
 												  node->hashtablecxt,
 												  node->hashtempcxt,
 												  false);
+		}
 	}
 	else
 		node->hashnulls = NULL;
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 83ade3f9437..f679bd67bee 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -276,7 +276,8 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	/* build tuplestore for the result */
 	oldcxt = MemoryContextSwitchTo(econtext->ecxt_per_query_memory);
-	tstate->tupstore = tuplestore_begin_heap(false, false, work_mem);
+	tstate->tupstore = tuplestore_begin_heap(false, false,
+											 workMemLimit(tstate));
 
 	/*
 	 * Each call to fetch a new set of rows - of which there may be very many
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 9a1acce2b5d..7660aa626b6 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1092,7 +1092,8 @@ prepare_tuplestore(WindowAggState *winstate)
 	Assert(winstate->buffer == NULL);
 
 	/* Create new tuplestore */
-	winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
+	winstate->buffer = tuplestore_begin_heap(false, false,
+											 workMemLimit(winstate));
 
 	/*
 	 * Set up read pointers for the tuplestore.  The current pointer doesn't
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 344a3188317..353f51fdff2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -102,6 +102,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/placeholder.h"
 #include "optimizer/plancat.h"
+#include "optimizer/planmain.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
 #include "utils/lsyscache.h"
@@ -2833,7 +2834,8 @@ cost_agg(Path *path, PlannerInfo *root,
 		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
 											input_width,
 											aggcosts->transitionSpace);
-		hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
+		hash_agg_set_limits(hashentrysize, numGroups, 0,
+							get_hash_memory_limit(), &mem_limit,
 							&ngroups_limit, &num_partitions);
 
 		nbatches = Max((numGroups * hashentrysize) / mem_limit,
@@ -4256,6 +4258,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
+							get_hash_memory_limit(),
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
@@ -4583,6 +4586,17 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		sp_cost.startup += plan->total_cost +
 			cpu_operator_cost * plan->plan_rows;
 
+		/*
+		 * Working memory needed for the hashtable (and hashnulls, if needed).
+		 */
+		subplan->hashtab_workmem_id = add_hash_workmem(root->glob);
+
+		if (!subplan->unknownEqFalse)
+		{
+			/* Also needs a hashnulls table.  */
+			subplan->hashnul_workmem_id = add_hash_workmem(root->glob);
+		}
+
 		/*
 		 * The per-tuple costs include the cost of evaluating the lefthand
 		 * expressions, plus the cost of probing the hashtable.  We already
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index bfefc7dbea1..22834fe37f4 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1706,6 +1706,8 @@ create_material_plan(PlannerInfo *root, MaterialPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -1761,6 +1763,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -1907,6 +1911,8 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 best_path->path.rows,
 								 0,
 								 subplan);
+
+		plan->workmem_id = add_hash_workmem(root->glob);
 	}
 	else
 	{
@@ -2253,6 +2259,8 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2279,6 +2287,8 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
+	plan->sort.plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2390,6 +2400,12 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	if (plan->aggstrategy == AGG_HASHED)
+		plan->plan.workmem_id = add_hash_workmem(root->glob);
+
+	/* Also include working memory needed to sort the input: */
+	plan->sortWorkMemId = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2443,6 +2459,7 @@ static Plan *
 create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 {
 	Agg		   *plan;
+	Agg		   *first_sort_agg = NULL;
 	Plan	   *subplan;
 	List	   *rollups = best_path->rollups;
 	AttrNumber *grouping_map;
@@ -2508,7 +2525,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			RollupData *rollup = lfirst(lc);
 			AttrNumber *new_grpColIdx;
 			Plan	   *sort_plan = NULL;
-			Plan	   *agg_plan;
+			Agg		   *agg_plan;
 			AggStrategy strat;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
@@ -2531,19 +2548,19 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			else
 				strat = AGG_SORTED;
 
-			agg_plan = (Plan *) make_agg(NIL,
-										 NIL,
-										 strat,
-										 AGGSPLIT_SIMPLE,
-										 list_length((List *) linitial(rollup->gsets)),
-										 new_grpColIdx,
-										 extract_grouping_ops(rollup->groupClause),
-										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
-										 NIL,
-										 rollup->numGroups,
-										 best_path->transitionSpace,
-										 sort_plan);
+			agg_plan = make_agg(NIL,
+								NIL,
+								strat,
+								AGGSPLIT_SIMPLE,
+								list_length((List *) linitial(rollup->gsets)),
+								new_grpColIdx,
+								extract_grouping_ops(rollup->groupClause),
+								extract_grouping_collations(rollup->groupClause, subplan->targetlist),
+								rollup->gsets,
+								NIL,
+								rollup->numGroups,
+								best_path->transitionSpace,
+								sort_plan);
 
 			/*
 			 * Remove stuff we don't need to avoid bloating debug output.
@@ -2554,6 +2571,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				sort_plan->lefttree = NULL;
 			}
 
+			if (agg_plan->aggstrategy == AGG_SORTED && !first_sort_agg)
+			{
+				/* This might be the first Sort agg. */
+				first_sort_agg = agg_plan;
+			}
+
 			chain = lappend(chain, agg_plan);
 		}
 	}
@@ -2586,6 +2609,29 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
+
+		/*
+		 * NOTE: We will place the workmem needed to sort the input (if any)
+		 * on the first agg, the Hash workmem on the first Hash agg, and the
+		 * Sort workmem (if any) on the first Sort agg.
+		 */
+		if (plan->aggstrategy == AGG_HASHED || plan->aggstrategy == AGG_MIXED)
+		{
+			/* All Hash Grouping Sets share the same workmem limit. */
+			plan->plan.workmem_id = add_hash_workmem(root->glob);
+		}
+		else if (plan->aggstrategy == AGG_SORTED)
+		{
+			/* Every Sort Grouping Set gets its own workmem limit. */
+			first_sort_agg = plan;
+		}
+
+		/* Store the workmem limit, for all Sorts, on the first Sort. */
+		if (first_sort_agg)
+			first_sort_agg->plan.workmem_id = add_workmem(root->glob);
+
+		/* Also include working memory needed to sort the input: */
+		plan->sortWorkMemId = add_workmem(root->glob);
 	}
 
 	return (Plan *) plan;
@@ -2750,6 +2796,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2790,6 +2838,8 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -2826,6 +2876,12 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	plan->plan.workmem_id = add_workmem(root->glob);
+
+	/* Also include working memory for hash table. */
+	if (plan->numCols > 0)
+		plan->hashWorkMemId = add_hash_workmem(root->glob);
+
 	return plan;
 }
 
@@ -3532,6 +3588,9 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
+
+		plan->workmem_id = add_workmem(root->glob);
+
 		/* Extract original index clauses, actual index quals, relevant ECs */
 		subquals = NIL;
 		subindexquals = NIL;
@@ -3839,6 +3898,8 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -3882,6 +3943,8 @@ create_tablefuncscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -4020,6 +4083,8 @@ create_ctescan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
+	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+
 	return scan_plan;
 }
 
@@ -4722,6 +4787,8 @@ create_mergejoin_plan(PlannerInfo *root,
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
 
+		matplan->workmem_id = add_workmem(root->glob);
+
 		inner_plan = matplan;
 	}
 
@@ -5067,6 +5134,9 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	/* Assign workmem to the Hash subnode, not its parent HashJoin node. */
+	hash_plan->plan.workmem_id = add_hash_workmem(root->glob);
+
 	return join_plan;
 }
 
@@ -5619,6 +5689,8 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	plan->plan.plan_width = lefttree->plan_width;
 	plan->plan.parallel_aware = false;
 	plan->plan.parallel_safe = lefttree->parallel_safe;
+
+	plan->plan.workmem_id = add_workmem(root->glob);
 }
 
 /*
@@ -5650,6 +5722,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 	plan->sort.plan.plan_width = lefttree->plan_width;
 	plan->sort.plan.parallel_aware = false;
 	plan->sort.plan.parallel_safe = lefttree->parallel_safe;
+
+	plan->sort.plan.workmem_id = add_workmem(root->glob);
 }
 
 /*
@@ -6701,14 +6775,14 @@ make_material(Plan *lefttree)
 
 /*
  * materialize_finished_plan: stick a Material node atop a completed plan
- *
+ *r/
  * There are a couple of places where we want to attach a Material node
  * after completion of create_plan(), without any MaterialPath path.
  * Those places should probably be refactored someday to do this on the
  * Path representation, but it's not worth the trouble yet.
  */
 Plan *
-materialize_finished_plan(Plan *subplan)
+materialize_finished_plan(PlannerGlobal *glob, Plan *subplan)
 {
 	Plan	   *matplan;
 	Path		matpath;		/* dummy for result of cost_material */
@@ -6747,6 +6821,8 @@ materialize_finished_plan(Plan *subplan)
 	matplan->parallel_aware = false;
 	matplan->parallel_safe = subplan->parallel_safe;
 
+	matplan->workmem_id = add_workmem(glob);
+
 	return matplan;
 }
 
@@ -7512,3 +7588,41 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+static int
+add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
+{
+	glob->workMemCategories = lappend_int(glob->workMemCategories, category);
+	/* the executor will fill this in later: */
+	glob->workMemLimits = lappend_int(glob->workMemLimits, 0);
+
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemLimits));
+
+	return list_length(glob->workMemCategories);
+}
+
+/*
+ * add_workmem
+ *	  Add (non-hash) workmem info to the glob's lists
+ *
+ * This data structure will have its working-memory limit set to work_mem.
+ */
+int
+add_workmem(PlannerGlobal *glob)
+{
+	return add_workmem_internal(glob, WORKMEM_NORMAL);
+}
+
+/*
+ * add_hash_workmem
+ *	  Add hash workmem info to the glob's lists
+ *
+ * This data structure will have its working-memory limit set to work_mem *
+ * hash_mem_multiplier.
+ */
+int
+add_hash_workmem(PlannerGlobal *glob)
+{
+	return add_workmem_internal(glob, WORKMEM_HASH);
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d59d6e4c6a0..a431808be96 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -447,7 +447,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	if (cursorOptions & CURSOR_OPT_SCROLL)
 	{
 		if (!ExecSupportsBackwardScan(top_plan))
-			top_plan = materialize_finished_plan(top_plan);
+			top_plan = materialize_finished_plan(glob, top_plan);
 	}
 
 	/*
@@ -584,6 +584,9 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->stmt_location = parse->stmt_location;
 	result->stmt_len = parse->stmt_len;
 
+	result->workMemCategories = glob->workMemCategories;
+	result->workMemLimits = glob->workMemLimits;
+
 	result->jitFlags = PGJIT_NONE;
 	if (jit_enabled && jit_above_cost >= 0 &&
 		top_plan->total_cost > jit_above_cost)
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d71ed958e31..8bc99aa8bc1 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -533,7 +533,7 @@ build_subplan(PlannerInfo *root, Plan *plan, Path *path,
 		 */
 		else if (splan->parParam == NIL && enable_material &&
 				 !ExecMaterializesOutput(nodeTag(plan)))
-			plan = materialize_finished_plan(plan);
+			plan = materialize_finished_plan(root->glob, plan);
 
 		result = (Node *) splan;
 		isInitPlan = false;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index a71502efeed..6008e3bc63c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -140,6 +140,7 @@ extern TupleHashTable BuildTupleHashTable(PlanState *parent,
 										  Oid *collations,
 										  long nbuckets,
 										  Size additionalsize,
+										  Size hash_mem_limit,
 										  MemoryContext metacxt,
 										  MemoryContext tablecxt,
 										  MemoryContext tempcxt,
@@ -559,6 +560,7 @@ extern Tuplestorestate *ExecMakeTableFunctionResult(SetExprState *setexpr,
 													ExprContext *econtext,
 													MemoryContext argContext,
 													TupleDesc expectedDesc,
+													int workMem,
 													bool randomAccess);
 extern SetExprState *ExecInitFunctionResultSet(Expr *expr,
 											   ExprContext *econtext, PlanState *parent);
@@ -796,4 +798,9 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execWorkmem.c
+ */
+extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index ecff4842fd3..9b184c47322 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -253,7 +253,8 @@ typedef struct ParallelHashJoinState
 	ParallelHashGrowth growth;	/* control batch/bucket growth */
 	dsa_pointer chunk_work_queue;	/* chunk work queue */
 	int			nparticipants;
-	size_t		space_allowed;
+	size_t		space_allowed;	/* -- might be shared with other workers */
+	size_t		worker_space_allowed;	/* -- exclusive to this worker */
 	size_t		total_tuples;	/* total number of inner tuples */
 	LWLock		lock;			/* lock protecting the above */
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 6c4891bbaeb..fd8ed34178f 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -329,7 +329,8 @@ extern void ExecReScanAgg(AggState *node);
 extern Size hash_agg_entry_size(int numTrans, Size tupleWidth,
 								Size transitionSpace);
 extern void hash_agg_set_limits(double hashentrysize, double input_groups,
-								int used_bits, Size *mem_limit,
+								int used_bits,
+								Size hash_mem_limit, Size *mem_limit,
 								uint64 *ngroups_limit, int *num_partitions);
 
 /* parallel instrumentation support */
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 3c1a09415aa..e4e9e0d1de1 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,8 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t *space_allowed,
+									size_t worker_space_allowed,
+									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
 									int *num_skew_mcvs);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f81..d543011d92a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1276,6 +1276,19 @@ typedef struct PlanState
 			((PlanState *)(node))->instrument->nfiltered2 += (delta); \
 	} while(0)
 
+/* macros for fetching the workmem info associated with a PlanState */
+#define workMemFieldFromId(node, field, id)								\
+	(list_nth_int(((PlanState *)(node))->state->es_plannedstmt->field, \
+				  (id) - 1))
+#define workMemField(node, field)   \
+	(workMemFieldFromId((node), field, ((PlanState *)(node))->plan->workmem_id))
+
+/* workmem limit: */
+#define workMemLimitFromId(node, id) \
+	(workMemFieldFromId(node, workMemLimits, id))
+#define workMemLimit(node) \
+	(workMemField(node, workMemLimits))
+
 /*
  * EPQState is state for executing an EvalPlanQual recheck on a candidate
  * tuples e.g. in ModifyTable or LockRows.
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index ad2726f026f..181437ac933 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -182,6 +182,17 @@ typedef struct PlannerGlobal
 
 	/* hash table for NOT NULL attnums of relations */
 	struct HTAB *rel_notnullatts_hash pg_node_attr(read_write_ignore);
+
+	/*
+	 * Working-memory info, for Plan and SubPlans. Any Plan or SubPlan that
+	 * needs working memory for a data structure maintains a "workmem_id"
+	 * index into the following lists (all kept in sync).
+	 */
+
+	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
+	List	   *workMemCategories;
+	/* - IntList: limit (in KB), after which data structure must spill */
+	List	   *workMemLimits;
 } PlannerGlobal;
 
 /* macro for fetching the Plan associated with a SubPlan node */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 29d7732d6a0..ba8fdc2e6db 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -154,13 +154,23 @@ typedef struct PlannedStmt
 	ParseLoc	stmt_location;
 	/* length in bytes; 0 means "rest of string" */
 	ParseLoc	stmt_len;
+
+	/*
+	 * Working-memory info, for Plan and SubPlans. Any Plan or SubPlan that
+	 * needs working memory for a data structure maintains a "workmem_id"
+	 * index into the following lists (all kept in sync).
+	 */
+
+	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
+	List	   *workMemCategories;
+	/* - IntList: limit (in KB), after which data structure must spill */
+	List	   *workMemLimits;
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
 #define exec_subplan_get_plan(plannedstmt, subplan) \
 	((Plan *) list_nth((plannedstmt)->subplans, (subplan)->plan_id - 1))
 
-
 /* ----------------
  *		Plan node
  *
@@ -216,6 +226,8 @@ typedef struct Plan
 	 */
 	/* unique across entire final plan tree */
 	int			plan_node_id;
+	/* 1-based id of workMem to use, or else zero */
+	int			workmem_id;
 	/* target list to be computed at this node */
 	List	   *targetlist;
 	/* implicitly-ANDed qual conditions */
@@ -447,6 +459,9 @@ typedef struct RecursiveUnion
 
 	/* estimated number of groups in input */
 	long		numGroups;
+
+	/* 1-based id of workMem to use for hash table, or else zero */
+	int			hashWorkMemId;
 } RecursiveUnion;
 
 /* ----------------
@@ -1176,6 +1191,9 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* 1-based id of workMem to use to sort inputs, or else zero */
+	int			sortWorkMemId;
+
 	/* estimated number of groups in input */
 	long		numGroups;
 
@@ -1792,4 +1810,11 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+/* different data structures get different working-memory limits*/
+typedef enum WorkMemCategory
+{
+	WORKMEM_NORMAL,				/* gets work_mem */
+	WORKMEM_HASH,				/* gets hash_mem_multiplier * work_mem */
+}			WorkMemCategory;
+
 #endif							/* PLANNODES_H */
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 6dfca3cb35b..c55b3cb356e 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -1111,6 +1111,9 @@ typedef struct SubPlan
 	/* Estimated execution costs: */
 	Cost		startup_cost;	/* one-time setup cost */
 	Cost		per_call_cost;	/* cost for each subplan evaluation */
+	/* 1-based id of workMem to use, or else zero: */
+	int			hashtab_workmem_id; /* for hash table */
+	int			hashnul_workmem_id; /* for NULLs hash table */
 } SubPlan;
 
 /*
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 9d3debcab28..8436136026b 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -46,9 +46,11 @@ extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
 									 Plan *outer_plan);
 extern Plan *change_plan_targetlist(Plan *subplan, List *tlist,
 									bool tlist_parallel_safe);
-extern Plan *materialize_finished_plan(Plan *subplan);
+extern Plan *materialize_finished_plan(PlannerGlobal *glob, Plan *subplan);
 extern bool is_projection_capable_path(Path *path);
 extern bool is_projection_capable_plan(Plan *plan);
+extern int	add_workmem(PlannerGlobal *glob);
+extern int	add_hash_workmem(PlannerGlobal *glob);
 
 /* External use of these functions is deprecated: */
 extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
-- 
2.39.5

0002-Add-workmem-estimates-to-Path-node-and-PlannedStmt.patchtext/x-diff; charset=utf-8Download

From b4af98013dddfa8f56c14b3d6a0b667b9faefece Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Tue, 4 Mar 2025 23:03:19 +0000
Subject: [PATCH 2/4] Add "workmem" estimates to Path node and PlannedStmt

To allow for future optimizers to make decisions at Path time, this commit
aggregates the Path's total working memory onto the Path's "workmem" field,
normalized to a minimum of 64 KB and rounded up to the next whole KB.

To allow future hooks to override ExecAssignWorkMem(), this commit then
breaks that total working memory into per-data structure working memory,
and stores it, next to the workMemLimit, on the PlannedStmt.
---
 src/backend/executor/execParallel.c     |   2 +
 src/backend/executor/nodeHash.c         |  32 +-
 src/backend/nodes/tidbitmap.c           |  18 ++
 src/backend/optimizer/path/costsize.c   | 406 ++++++++++++++++++++++--
 src/backend/optimizer/plan/createplan.c | 267 +++++++++++++---
 src/backend/optimizer/plan/planner.c    |   2 +
 src/backend/optimizer/prep/prepagg.c    |  12 +
 src/backend/optimizer/util/pathnode.c   |  53 +++-
 src/include/executor/nodeHash.h         |   3 +-
 src/include/nodes/execnodes.h           |  12 +
 src/include/nodes/pathnodes.h           |  10 +-
 src/include/nodes/plannodes.h           |   7 +-
 src/include/nodes/tidbitmap.h           |   1 +
 src/include/optimizer/cost.h            |  13 +-
 src/include/optimizer/planmain.h        |   3 +-
 15 files changed, 762 insertions(+), 79 deletions(-)

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a8cb631963e..5c90a29d7d1 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -217,6 +217,8 @@ ExecSerializePlan(Plan *plan, EState *estate)
 	pstmt->stmt_location = -1;
 	pstmt->stmt_len = -1;
 	pstmt->workMemCategories = estate->es_plannedstmt->workMemCategories;
+	pstmt->workMemEstimates = estate->es_plannedstmt->workMemEstimates;
+	pstmt->workMemCounts = estate->es_plannedstmt->workMemCounts;
 	pstmt->workMemLimits = estate->es_plannedstmt->workMemLimits;
 
 	/* Return serialized copy of our dummy PlannedStmt. */
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index bb9af08dc5d..7d09ac8b5a3 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
@@ -453,6 +454,7 @@ ExecHashTableCreate(HashState *state)
 	int			nbuckets;
 	int			nbatch;
 	double		rows;
+	int			workmem;		/* ignored */
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	MemoryContext oldcxt;
@@ -482,7 +484,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state->nparticipants - 1 : 0,
 							worker_space_allowed,
 							&space_allowed,
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
 	/* nbuckets must be a power of 2 */
 	log2_nbuckets = my_log2(nbuckets);
@@ -668,7 +670,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
-						int *num_skew_mcvs)
+						int *num_skew_mcvs,
+						int *workmem)
 {
 	int			tupsize;
 	double		inner_rel_bytes;
@@ -769,6 +772,27 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		*num_skew_mcvs = 0;
 
 	/*
+	 * Set "workmem" to the amount of memory needed to hold the inner rel in a
+	 * single batch. So this calculation doesn't care about "max_pointers".
+	 */
+	dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
+	nbuckets = (int) dbuckets;
+	/* don't let nbuckets be really small, though ... */
+	nbuckets = Max(nbuckets, 1024);
+	/* ... and force it to be a power of 2. */
+	nbuckets = pg_nextpower2_32(nbuckets);
+	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+	/* Don't forget the 2% overhead reserved for skew buckets! */
+	*workmem = useskew ?
+		normalize_work_bytes((inner_rel_bytes + bucket_bytes) *
+							 100.0 / (100.0 - SKEW_HASH_MEM_PERCENT)) :
+		normalize_work_bytes(inner_rel_bytes + bucket_bytes);
+
+	/*
+	 * Now redo the nbuckets and bucket_bytes calculations, taking memory
+	 * limits into account.
+	 *
 	 * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
 	 * memory is filled, assuming a single batch; but limit the value so that
 	 * the pointer arrays we'll try to allocate do not exceed hash_table_bytes
@@ -799,6 +823,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * the required bucket headers, we will need multiple batches.
 	 */
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+
 	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
@@ -819,7 +844,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									total_space_allowed,
 									numbuckets,
 									numbatches,
-									num_skew_mcvs);
+									num_skew_mcvs,
+									workmem);
 			return;
 		}
 
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 41031aa8f2f..425333b0218 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -1560,6 +1560,24 @@ tbm_calculate_entries(Size maxbytes)
 	return (int) nbuckets;
 }
 
+/*
+ * tbm_calculate_bytes
+ *
+ * Estimate number of bytes needed to store maxentries hashtable entries.
+ *
+ * This function is the inverse of tbm_calculate_entries(), and is used to
+ * estimate a work_mem limit, based on cardinality.
+ */
+double
+tbm_calculate_bytes(double maxentries)
+{
+	maxentries = Min(maxentries, INT_MAX - 1);	/* safety limit */
+	maxentries = Max(maxentries, 16);	/* sanity limit */
+
+	return maxentries * (sizeof(PagetableEntry) + sizeof(Pointer) +
+						 sizeof(Pointer));
+}
+
 /*
  * Create a shared or private bitmap iterator and start iteration.
  *
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 353f51fdff2..27daa1966c2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -105,6 +105,7 @@
 #include "optimizer/planmain.h"
 #include "optimizer/restrictinfo.h"
 #include "parser/parsetree.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "utils/spccache.h"
@@ -201,9 +202,14 @@ static Cost append_nonpartial_cost(List *subpaths, int numpaths,
 								   int parallel_workers);
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static int32 get_expr_width(PlannerInfo *root, const Node *expr);
-static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
+static void compute_sort_output_sizes(double input_tuples, int input_width,
+									  double limit_tuples,
+									  double *output_tuples,
+									  double *output_bytes);
+static double compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+									 Cardinality max_ancestor_rows);
 
 
 /*
@@ -1113,6 +1119,18 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	path->disabled_nodes = enable_bitmapscan ? 0 : 1;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+
+	/*
+	 * Set an overall working-memory estimate for the entire BitmapHeapPath --
+	 * including all of the IndexPaths and BitmapOrPaths in its bitmapqual.
+	 *
+	 * (When we convert this path into a BitmapHeapScan plan, we'll break this
+	 * overall estimate down into per-node estimates, just as we do for
+	 * AggPaths.)
+	 */
+	path->workmem = compute_bitmap_workmem(baserel, bitmapqual,
+										   0.0 /* max_ancestor_rows */ );
 }
 
 /*
@@ -1588,6 +1606,16 @@ cost_functionscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem = list_length(rte->functions) *
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1645,6 +1673,16 @@ cost_tablefuncscan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Per "XXX" comment above, this workmem estimate is likely to be wrong,
+	 * because the "rows" estimate is pretty phony. Report the estimate
+	 * anyway, for completeness. (This is at least better than saying it won't
+	 * use *any* working memory.)
+	 */
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1741,6 +1779,9 @@ cost_ctescan(Path *path, PlannerInfo *root,
 	path->disabled_nodes = 0;
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(path->rows,
+												path->pathtarget->width));
 }
 
 /*
@@ -1824,7 +1865,7 @@ cost_resultscan(Path *path, PlannerInfo *root,
  * We are given Paths for the nonrecursive and recursive terms.
  */
 void
-cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
+cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -1851,12 +1892,37 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
 	 */
 	total_cost += cpu_tuple_cost * total_rows;
 
-	runion->disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
-	runion->startup_cost = startup_cost;
-	runion->total_cost = total_cost;
-	runion->rows = total_rows;
-	runion->pathtarget->width = Max(nrterm->pathtarget->width,
-									rterm->pathtarget->width);
+	runion->path.disabled_nodes = nrterm->disabled_nodes + rterm->disabled_nodes;
+	runion->path.startup_cost = startup_cost;
+	runion->path.total_cost = total_cost;
+	runion->path.rows = total_rows;
+	runion->path.pathtarget->width = Max(nrterm->pathtarget->width,
+										 rterm->pathtarget->width);
+
+	/*
+	 * Include memory for working and intermediate tables. Since we'll
+	 * repeatedly swap the two tables, use 2x whichever is larger as our
+	 * estimate.
+	 */
+	runion->path.workmem =
+		normalize_work_bytes(
+							 Max(relation_byte_size(nrterm->rows,
+													nrterm->pathtarget->width),
+								 relation_byte_size(rterm->rows,
+													rterm->pathtarget->width))
+							 * 2);
+
+	if (list_length(runion->distinctList) > 0)
+	{
+		/* Also include memory for hash table. */
+		Size		hashentrysize;
+
+		hashentrysize = MAXALIGN(runion->path.pathtarget->width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		runion->path.workmem +=
+			normalize_work_bytes(runion->numGroups * hashentrysize);
+	}
 }
 
 /*
@@ -1896,7 +1962,7 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
  */
 static void
-cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+cost_tuplesort(Cost *startup_cost, Cost *run_cost, Cost *nbytes,
 			   double tuples, int width,
 			   Cost comparison_cost, int sort_mem,
 			   double limit_tuples)
@@ -1916,17 +1982,8 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	/* Include the default cost-per-comparison */
 	comparison_cost += 2.0 * cpu_operator_cost;
 
-	/* Do we have a useful LIMIT? */
-	if (limit_tuples > 0 && limit_tuples < tuples)
-	{
-		output_tuples = limit_tuples;
-		output_bytes = relation_byte_size(output_tuples, width);
-	}
-	else
-	{
-		output_tuples = tuples;
-		output_bytes = input_bytes;
-	}
+	compute_sort_output_sizes(tuples, width, limit_tuples,
+							  &output_tuples, &output_bytes);
 
 	if (output_bytes > sort_mem_bytes)
 	{
@@ -1983,6 +2040,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 	 * counting the LIMIT otherwise.
 	 */
 	*run_cost = cpu_operator_cost * tuples;
+	*nbytes = output_bytes;
 }
 
 /*
@@ -2012,6 +2070,7 @@ cost_incremental_sort(Path *path,
 				input_groups;
 	Cost		group_startup_cost,
 				group_run_cost,
+				group_nbytes,
 				group_input_run_cost;
 	List	   *presortedExprs = NIL;
 	ListCell   *l;
@@ -2086,7 +2145,7 @@ cost_incremental_sort(Path *path,
 	 * Estimate the average cost of sorting of one group where presorted keys
 	 * are equal.
 	 */
-	cost_tuplesort(&group_startup_cost, &group_run_cost,
+	cost_tuplesort(&group_startup_cost, &group_run_cost, &group_nbytes,
 				   group_tuples, width, comparison_cost, sort_mem,
 				   limit_tuples);
 
@@ -2127,6 +2186,14 @@ cost_incremental_sort(Path *path,
 
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+
+	/*
+	 * Incremental sort switches between two Tuplesortstates: one that sorts
+	 * all columns ("full"), and that sorts only suffix columns ("prefix").
+	 * We'll assume they're both around the same size: large enough to hold
+	 * one sort group.
+	 */
+	path->workmem = normalize_work_bytes(group_nbytes * 2.0);
 }
 
 /*
@@ -2151,8 +2218,9 @@ cost_sort(Path *path, PlannerInfo *root,
 {
 	Cost		startup_cost;
 	Cost		run_cost;
+	Cost		nbytes;
 
-	cost_tuplesort(&startup_cost, &run_cost,
+	cost_tuplesort(&startup_cost, &run_cost, &nbytes,
 				   tuples, width,
 				   comparison_cost, sort_mem,
 				   limit_tuples);
@@ -2163,6 +2231,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_work_bytes(nbytes);
 }
 
 /*
@@ -2549,6 +2618,7 @@ cost_material(Path *path,
 	path->disabled_nodes = input_disabled_nodes + (enable_material ? 0 : 1);
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
+	path->workmem = normalize_work_bytes(nbytes);
 }
 
 /*
@@ -2622,6 +2692,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	/* Remember the ndistinct estimate for EXPLAIN */
 	mpath->est_unique_keys = ndistinct;
 
+	/* How much working memory would we need, to store every distinct tuple? */
+	mpath->path.workmem = normalize_work_bytes(ndistinct * est_entry_bytes);
+
 	/*
 	 * Since we've already estimated the maximum number of entries we can
 	 * store at once and know the estimated number of distinct values we'll be
@@ -2899,6 +2972,19 @@ cost_agg(Path *path, PlannerInfo *root,
 	path->disabled_nodes = disabled_nodes;
 	path->startup_cost = startup_cost;
 	path->total_cost = total_cost;
+
+	/* Include memory needed to produce output. */
+	path->workmem =
+		compute_agg_output_workmem(root, aggstrategy, numGroups,
+								   aggcosts->transitionSpace, input_tuples,
+								   input_width, false /* cost_sort */ );
+
+	/* Also include memory needed to sort inputs (if needed): */
+	if (aggcosts->numSortBuffers > 0)
+	{
+		path->workmem += (double) aggcosts->numSortBuffers *
+			compute_agg_input_workmem(input_tuples, input_width);
+	}
 }
 
 /*
@@ -3133,7 +3219,7 @@ cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, WindowClause *winclause,
 			   int input_disabled_nodes,
 			   Cost input_startup_cost, Cost input_total_cost,
-			   double input_tuples)
+			   double input_tuples, int width)
 {
 	Cost		startup_cost;
 	Cost		total_cost;
@@ -3215,6 +3301,11 @@ cost_windowagg(Path *path, PlannerInfo *root,
 	if (startup_tuples > 1.0)
 		path->startup_cost += (total_cost - startup_cost) / input_tuples *
 			(startup_tuples - 1.0);
+
+
+	/* We need to store a window of size "startup_tuples", in a Tuplestore. */
+	path->workmem =
+		normalize_work_bytes(relation_byte_size(startup_tuples, width));
 }
 
 /*
@@ -3369,6 +3460,7 @@ initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost;
 	/* Save private data for final_cost_nestloop */
 	workspace->run_cost = run_cost;
+	workspace->workmem = 0;
 }
 
 /*
@@ -3833,6 +3925,14 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->total_cost = startup_cost + run_cost + inner_run_cost;
 	/* Save private data for final_cost_mergejoin */
 	workspace->run_cost = run_cost;
+
+	/*
+	 * By itself, Merge Join requires no working memory. If it adds one or
+	 * more Sort or Material nodes, we'll track their working memory when we
+	 * create them, inside createplan.c.
+	 */
+	workspace->workmem = 0;
+
 	workspace->inner_run_cost = inner_run_cost;
 	workspace->outer_rows = outer_rows;
 	workspace->inner_rows = inner_rows;
@@ -4204,6 +4304,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	double		outer_path_rows = outer_path->rows;
 	double		inner_path_rows = inner_path->rows;
 	double		inner_path_rows_total = inner_path_rows;
+	int			workmem;
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
@@ -4262,7 +4363,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
-							&num_skew_mcvs);
+							&num_skew_mcvs,
+							&workmem);
 
 	/*
 	 * If inner relation is too big then we will need to "batch" the join,
@@ -4293,6 +4395,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	workspace->numbuckets = numbuckets;
 	workspace->numbatches = numbatches;
 	workspace->inner_rows_total = inner_path_rows_total;
+	workspace->workmem = workmem;
 }
 
 /*
@@ -4301,8 +4404,8 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
  *
  * Note: the numbatches estimate is also saved into 'path' for use later
  *
- * 'path' is already filled in except for the rows and cost fields and
- *		num_batches
+ * 'path' is already filled in except for the rows and cost fields,
+ *		num_batches, and workmem
  * 'workspace' is the result from initial_cost_hashjoin
  * 'extra' contains miscellaneous information about the join
  */
@@ -4319,6 +4422,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	List	   *hashclauses = path->path_hashclauses;
 	Cost		startup_cost = workspace->startup_cost;
 	Cost		run_cost = workspace->run_cost;
+	int			workmem = workspace->workmem;
 	int			numbuckets = workspace->numbuckets;
 	int			numbatches = workspace->numbatches;
 	Cost		cpu_per_tuple;
@@ -4555,6 +4659,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 
 	path->jpath.path.startup_cost = startup_cost;
 	path->jpath.path.total_cost = startup_cost + run_cost;
+	path->jpath.path.workmem = workmem;
 }
 
 
@@ -4577,6 +4682,9 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 	if (subplan->useHashTable)
 	{
+		long		nbuckets;
+		Size		hashentrysize;
+
 		/*
 		 * If we are using a hash table for the subquery outputs, then the
 		 * cost of evaluating the query is a one-time cost.  We charge one
@@ -4588,13 +4696,37 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 
 		/*
 		 * Working memory needed for the hashtable (and hashnulls, if needed).
+		 * The logic below MUST match the logic in buildSubPlanHash() and
+		 * ExecInitSubPlan().
 		 */
-		subplan->hashtab_workmem_id = add_hash_workmem(root->glob);
+		nbuckets = clamp_cardinality_to_long(plan->plan_rows);
+		if (nbuckets < 1)
+			nbuckets = 1;
+
+		hashentrysize = MAXALIGN(plan->plan_width) +
+			MAXALIGN(SizeofMinimalTupleHeader);
+
+		subplan->hashtab_workmem_id =
+			add_hash_workmem(root->glob,
+							 normalize_work_bytes((double) nbuckets *
+												  hashentrysize));
 
 		if (!subplan->unknownEqFalse)
 		{
 			/* Also needs a hashnulls table.  */
-			subplan->hashnul_workmem_id = add_hash_workmem(root->glob);
+			if (IsA(subplan->testexpr, OpExpr))
+				nbuckets = 1;	/* there can be only one entry */
+			else
+			{
+				nbuckets /= 16;
+				if (nbuckets < 1)
+					nbuckets = 1;
+			}
+
+			subplan->hashnul_workmem_id =
+				add_hash_workmem(root->glob,
+								 normalize_work_bytes((double) nbuckets *
+													  hashentrysize));
 		}
 
 		/*
@@ -6481,7 +6613,7 @@ get_expr_width(PlannerInfo *root, const Node *expr)
  *	  Estimate the storage space in bytes for a given number of tuples
  *	  of a given width (size in bytes).
  */
-static double
+double
 relation_byte_size(double tuples, int width)
 {
 	return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
@@ -6660,3 +6792,219 @@ compute_gather_rows(Path *path)
 
 	return clamp_row_est(path->rows * get_parallel_divisor(path));
 }
+
+/*
+ * compute_sort_output_sizes
+ *	  Estimate amount of memory and rows needed to hold a Sort operator's output
+ */
+static void
+compute_sort_output_sizes(double input_tuples, int input_width,
+						  double limit_tuples,
+						  double *output_tuples, double *output_bytes)
+{
+	/*
+	 * We want to be sure the cost of a sort is never estimated as zero, even
+	 * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+	 */
+	if (input_tuples < 2.0)
+		input_tuples = 2.0;
+
+	/* Do we have a useful LIMIT? */
+	if (limit_tuples > 0 && limit_tuples < input_tuples)
+		*output_tuples = limit_tuples;
+	else
+		*output_tuples = input_tuples;
+
+	*output_bytes = relation_byte_size(*output_tuples, input_width);
+}
+
+/*
+ * compute_agg_input_workmem
+ *	  Estimate memory (in KB) needed to hold a sort buffer for aggregate's input
+ *
+ * Some aggregates involve DISTINCT or ORDER BY, so they need to sort their
+ * input, before they can process it. We need one sort buffer per such
+ * aggregate, and this function returns that sort buffer's (estimated) size (in
+ * KB).
+ */
+int
+compute_agg_input_workmem(double input_tuples, double input_width)
+{
+	double		output_tuples;	/* ignored */
+	double		output_bytes;
+
+	/* Account for size of one buffer needed to sort the input. */
+	compute_sort_output_sizes(input_tuples, input_width,
+							  0.0 /* limit_tuples */ ,
+							  &output_tuples, &output_bytes);
+	return normalize_work_bytes(output_bytes);
+}
+
+/*
+ * compute_agg_output_workmem
+ *	  Estimate amount of memory needed (in KB) to hold an aggregate's output
+ *
+ * In a Hash aggregate, we need space for the hash table that holds the
+ * aggregated data.
+ *
+ * Sort aggregates require output space only if they are part of a Grouping
+ * Sets chain: the first aggregate writes to its "sort_out" buffer, which the
+ * second aggregate uses as its "sort_in" buffer, and sorts.
+ *
+ * In the latter case, the "Path" code already costs the sort by calling
+ * cost_sort(), so it passes "cost_sort = false" to this function, to avoid
+ * double-counting.
+ */
+int
+compute_agg_output_workmem(PlannerInfo *root, AggStrategy aggstrategy,
+						   double numGroups, uint64 transitionSpace,
+						   double input_tuples, double input_width,
+						   bool cost_sort)
+{
+	/* Account for size of hash table to hold the output. */
+	if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
+	{
+		double		hashentrysize;
+
+		hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
+											input_width, transitionSpace);
+		return normalize_work_bytes(numGroups * hashentrysize);
+	}
+
+	/* Account for the size of the "sort_out" buffer. */
+	if (cost_sort && aggstrategy == AGG_SORTED)
+	{
+		double		output_tuples;	/* ignored */
+		double		output_bytes;
+
+		Assert(aggstrategy == AGG_SORTED);
+
+		compute_sort_output_sizes(numGroups, input_width,
+								  0.0 /* limit_tuples */ ,
+								  &output_tuples, &output_bytes);
+		return normalize_work_bytes(output_bytes);
+	}
+
+	return 0;
+}
+
+/*
+ * compute_bitmap_workmem
+ *	  Estimate total working memory (in KB) needed by bitmapqual
+ *
+ * Although we don't fill in the workmem_est or rows fields on the bitmapqual's
+ * paths, we fill them in on the owning BitmapHeapPath. This function estimates
+ * the total work_mem needed by all BitmapOrPaths and IndexPaths inside
+ * bitmapqual.
+ */
+static double
+compute_bitmap_workmem(RelOptInfo *baserel, Path *bitmapqual,
+					   Cardinality max_ancestor_rows)
+{
+	double		workmem = 0.0;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * baserel->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
+
+	if (IsA(bitmapqual, BitmapAndPath))
+	{
+		BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, apath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, BitmapOrPath))
+	{
+		BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
+		ListCell   *l;
+
+		foreach(l, opath->bitmapquals)
+		{
+			workmem +=
+				compute_bitmap_workmem(baserel, (Path *) lfirst(l),
+									   foreach_current_index(l) == 0 ?
+									   max_ancestor_rows : 0.0);
+		}
+	}
+	else if (IsA(bitmapqual, IndexPath))
+	{
+		/* Working memory needed for 1 TID bitmap. */
+		workmem +=
+			normalize_work_bytes(tbm_calculate_bytes(max_ancestor_rows));
+	}
+
+	return workmem;
+}
+
+/*
+ * normalize_work_kb
+ *	  Convert a double, "KB" working-memory estimate to an int, "KB" value
+ *
+ * Normalizes non-zero input to a minimum of 64 (KB), rounding up to the
+ * nearest whole KB.
+ */
+int
+normalize_work_kb(double nkb)
+{
+	double		workmem;
+
+	if (nkb == 0.0)
+		return 0;				/* caller apparently doesn't need any workmem */
+
+	/*
+	 * We'll assign working-memory to SQL operators in 1 KB increments, so
+	 * round up to the next whole KB.
+	 */
+	workmem = ceil(nkb);
+
+	/*
+	 * Although some components can probably work with < 64 KB of working
+	 * memory, PostgreSQL has imposed a hard minimum of 64 KB on the
+	 * "work_mem" GUC, for a long time; so, by now, some components probably
+	 * rely on this minimum, implicitly, and would fail if we tried to assign
+	 * them < 64 KB.
+	 *
+	 * Perhaps this minimum can be relaxed, in the future; but memory sizes
+	 * keep increasing, and right now the minimum of 64 KB = 1.6 percent of
+	 * the default "work_mem" of 4 MB.
+	 *
+	 * So, even with this (overly?) cautious normalization, with the default
+	 * GUC settings, we can still achieve a working-memory reduction of
+	 * 64-to-1.
+	 */
+	workmem = Max((double) 64, workmem);
+
+	/* And clamp to MAX_KILOBYTES. */
+	workmem = Min(workmem, (double) MAX_KILOBYTES);
+
+	return (int) workmem;
+}
+
+/*
+ * normalize_work_bytes
+ *	  Convert a double, "bytes" working-memory estimate to an int, "KB" value
+ *
+ * Same as above, but takes input in bytes rather than in KB.
+ */
+int
+normalize_work_bytes(double nbytes)
+{
+	return normalize_work_kb(nbytes / 1024.0);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 22834fe37f4..aba15d54fa1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -130,6 +130,7 @@ static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
 											   BitmapHeapPath *best_path,
 											   List *tlist, List *scan_clauses);
 static Plan *create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+								   Cardinality max_ancestor_rows,
 								   List **qual, List **indexqual, List **indexECs);
 static void bitmap_subplan_mark_shared(Plan *plan);
 static TidScan *create_tidscan_plan(PlannerInfo *root, TidPath *best_path,
@@ -319,6 +320,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
+static int	add_workmem(PlannerGlobal *glob, int estimate);
+static int	add_workmems(PlannerGlobal *glob, int estimate, int count);
 
 
 /*
@@ -1706,7 +1709,8 @@ create_material_plan(PlannerInfo *root, MaterialPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -1763,7 +1767,9 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_hash_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -1912,7 +1918,9 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 0,
 								 subplan);
 
-		plan->workmem_id = add_hash_workmem(root->glob);
+		plan->workmem_id =
+			add_hash_workmem(root->glob,
+							 normalize_work_kb(best_path->path.workmem));
 	}
 	else
 	{
@@ -2259,7 +2267,9 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob,
+					normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2287,7 +2297,13 @@ create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
 
 	copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
 
-	plan->sort.plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * IncrementalSort creates two sort buffers, which the Path's "workmem"
+	 * estimate combined into a single value. Split it into two now.
+	 */
+	plan->sort.plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_kb(best_path->spath.path.workmem / 2), 2);
 
 	return plan;
 }
@@ -2400,11 +2416,32 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
+	/*
+	 * Replace the AggPath's overall workmem estimate with finer-grained
+	 * estimates.
+	 */
 	if (plan->aggstrategy == AGG_HASHED)
-		plan->plan.workmem_id = add_hash_workmem(root->glob);
+	{
+		int			workmem =
+			compute_agg_output_workmem(root, AGG_HASHED,
+									   plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows,
+									   subplan->plan_width,
+									   false /* cost_sort */ );
 
-	/* Also include working memory needed to sort the input: */
-	plan->sortWorkMemId = add_workmem(root->glob);
+		plan->plan.workmem_id = add_hash_workmem(root->glob, workmem);
+	}
+
+	/* Also include estimated memory needed to sort the input: */
+	if (best_path->numSortBuffers > 0)
+	{
+		int			workmem = compute_agg_input_workmem(subplan->plan_rows,
+														subplan->plan_width);
+
+		plan->sortWorkMemId =
+			add_workmems(root->glob, workmem, best_path->numSortBuffers);
+	}
 
 	return plan;
 }
@@ -2466,6 +2503,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	int			maxref;
 	List	   *chain;
 	ListCell   *lc;
+	int			num_sort_aggs = 0;
+	int			max_sort_agg_workmem = 0.0;
+	double		sum_hash_agg_workmem = 0.0;
 
 	/* Shouldn't get here without grouping sets */
 	Assert(root->parse->groupingSets);
@@ -2527,6 +2567,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			Plan	   *sort_plan = NULL;
 			Agg		   *agg_plan;
 			AggStrategy strat;
+			bool		cost_sort;
+			int			workmem;
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2577,6 +2619,33 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 				first_sort_agg = agg_plan;
 			}
 
+			/*
+			 * If we're an AGG_SORTED, but not the last, we need to cost
+			 * working memory needed to produce our "sort_out" buffer.
+			 */
+			cost_sort = foreach_current_index(lc) < list_length(rollups) - 1;
+
+			/* Estimated memory needed to hold the output: */
+			workmem =
+				compute_agg_output_workmem(root, agg_plan->aggstrategy,
+										   agg_plan->numGroups,
+										   agg_plan->transitionSpace,
+										   subplan->plan_rows,
+										   subplan->plan_width,
+										   cost_sort);
+
+			if (agg_plan->aggstrategy == AGG_HASHED)
+			{
+				/* All Hash Grouping Sets share the same workmem limit. */
+				sum_hash_agg_workmem += workmem;
+			}
+			else if (agg_plan->aggstrategy == AGG_SORTED)
+			{
+				/* Every Sort Grouping Set gets its own workmem limit. */
+				max_sort_agg_workmem = Max(max_sort_agg_workmem, workmem);
+				++num_sort_aggs;
+			}
+
 			chain = lappend(chain, agg_plan);
 		}
 	}
@@ -2588,6 +2657,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
 		int			numGroupCols;
+		bool		cost_sort;
+		int			workmem;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
@@ -2610,6 +2681,27 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		/* Copy cost data from Path to Plan */
 		copy_generic_path_info(&plan->plan, &best_path->path);
 
+		/*
+		 * If we're an AGG_SORTED, but not the last, we need to cost working
+		 * memory needed to produce our "sort_out" buffer.
+		 */
+		cost_sort = list_length(rollups) > 1;
+
+		/*
+		 * Replace the overall workmem estimate that we copied from the Path
+		 * with finer-grained estimates.
+		 *
+		 */
+
+		/* Estimated memory needed to hold the output: */
+		workmem =
+			compute_agg_output_workmem(root, plan->aggstrategy,
+									   plan->numGroups,
+									   plan->transitionSpace,
+									   subplan->plan_rows,
+									   subplan->plan_width,
+									   cost_sort);
+
 		/*
 		 * NOTE: We will place the workmem needed to sort the input (if any)
 		 * on the first agg, the Hash workmem on the first Hash agg, and the
@@ -2618,20 +2710,37 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		if (plan->aggstrategy == AGG_HASHED || plan->aggstrategy == AGG_MIXED)
 		{
 			/* All Hash Grouping Sets share the same workmem limit. */
-			plan->plan.workmem_id = add_hash_workmem(root->glob);
+			sum_hash_agg_workmem += workmem;
+			plan->plan.workmem_id = add_hash_workmem(root->glob,
+													 sum_hash_agg_workmem);
 		}
 		else if (plan->aggstrategy == AGG_SORTED)
 		{
 			/* Every Sort Grouping Set gets its own workmem limit. */
+			max_sort_agg_workmem = Max(max_sort_agg_workmem, workmem);
+			++num_sort_aggs;
+
 			first_sort_agg = plan;
 		}
 
 		/* Store the workmem limit, for all Sorts, on the first Sort. */
-		if (first_sort_agg)
-			first_sort_agg->plan.workmem_id = add_workmem(root->glob);
+		if (num_sort_aggs > 1)
+		{
+			first_sort_agg->plan.workmem_id =
+				add_workmems(root->glob, max_sort_agg_workmem,
+							 num_sort_aggs > 2 ? 2 : 1);
+		}
 
 		/* Also include working memory needed to sort the input: */
-		plan->sortWorkMemId = add_workmem(root->glob);
+		if (best_path->numSortBuffers > 0)
+		{
+			workmem = compute_agg_input_workmem(subplan->plan_rows,
+												subplan->plan_width);
+
+			plan->sortWorkMemId =
+				add_workmems(root->glob, workmem,
+							 best_path->numSortBuffers * list_length(rollups));
+		}
 	}
 
 	return (Plan *) plan;
@@ -2796,7 +2905,8 @@ create_windowagg_plan(PlannerInfo *root, WindowAggPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2838,7 +2948,9 @@ create_setop_plan(PlannerInfo *root, SetOpPath *best_path, int flags)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_hash_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->path.workmem));
 
 	return plan;
 }
@@ -2876,11 +2988,38 @@ create_recursiveunion_plan(PlannerInfo *root, RecursiveUnionPath *best_path)
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * Replace our overall "workmem" estimate with estimates at finer
+	 * granularity.
+	 */
+
+	/*
+	 * Include memory for working and intermediate tables.  Since we'll
+	 * repeatedly swap the two tables, use the larger of the two as our
+	 * working- memory estimate.
+	 *
+	 * NOTE: The Path's "workmem" estimate is for the whole Path, but the
+	 * Plan's "workmem" estimates are *per data structure*. So, this value is
+	 * half of the corresponding Path's value.
+	 */
+	plan->plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_bytes(Max(relation_byte_size(leftplan->plan_rows,
+																 leftplan->plan_width),
+											  relation_byte_size(rightplan->plan_rows,
+																 rightplan->plan_width))),
+					 2);
 
 	/* Also include working memory for hash table. */
 	if (plan->numCols > 0)
-		plan->hashWorkMemId = add_hash_workmem(root->glob);
+	{
+		Size		entrysize =
+			sizeof(TupleHashEntryData) + plan->plan.plan_width;
+
+		plan->hashWorkMemId =
+			add_hash_workmem(root->glob,
+							 normalize_work_bytes(plan->numGroups * entrysize));
+	}
 
 	return plan;
 }
@@ -3322,6 +3461,7 @@ create_bitmap_scan_plan(PlannerInfo *root,
 
 	/* Process the bitmapqual tree into a Plan tree and qual lists */
 	bitmapqualplan = create_bitmap_subplan(root, best_path->bitmapqual,
+										   0.0 /* max_ancestor_rows */ ,
 										   &bitmapqualorig, &indexquals,
 										   &indexECs);
 
@@ -3433,9 +3573,24 @@ create_bitmap_scan_plan(PlannerInfo *root,
  */
 static Plan *
 create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
+					  Cardinality max_ancestor_rows,
 					  List **qual, List **indexqual, List **indexECs)
 {
 	Plan	   *plan;
+	Cost		cost;			/* not used */
+	Selectivity selec;
+	Cardinality plan_rows;
+
+	/* How many rows will this node output? */
+	cost_bitmap_tree_node(bitmapqual, &cost, &selec);
+	plan_rows = clamp_row_est(selec * bitmapqual->parent->tuples);
+
+	/*
+	 * At runtime, we'll reuse the left-most child's TID bitmap. Let that
+	 * child that child know to request enough working memory to hold all its
+	 * ancestors' results.
+	 */
+	max_ancestor_rows = Max(max_ancestor_rows, plan_rows);
 
 	if (IsA(bitmapqual, BitmapAndPath))
 	{
@@ -3461,6 +3616,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3472,8 +3629,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_and(subplans);
 		plan->startup_cost = apath->path.startup_cost;
 		plan->total_cost = apath->path.total_cost;
-		plan->plan_rows =
-			clamp_row_est(apath->bitmapselectivity * apath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = apath->path.parallel_safe;
@@ -3508,6 +3664,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			List	   *subindexEC;
 
 			subplan = create_bitmap_subplan(root, (Path *) lfirst(l),
+											foreach_current_index(l) == 0 ?
+											max_ancestor_rows : 0.0,
 											&subqual, &subindexqual,
 											&subindexEC);
 			subplans = lappend(subplans, subplan);
@@ -3536,8 +3694,7 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 			plan = (Plan *) make_bitmap_or(subplans);
 			plan->startup_cost = opath->path.startup_cost;
 			plan->total_cost = opath->path.total_cost;
-			plan->plan_rows =
-				clamp_row_est(opath->bitmapselectivity * opath->path.parent->tuples);
+			plan->plan_rows = plan_rows;
 			plan->plan_width = 0;	/* meaningless */
 			plan->parallel_aware = false;
 			plan->parallel_safe = opath->path.parallel_safe;
@@ -3583,13 +3740,14 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
-		plan->plan_rows =
-			clamp_row_est(ipath->indexselectivity * ipath->path.parent->tuples);
+		plan->plan_rows = plan_rows;
 		plan->plan_width = 0;	/* meaningless */
 		plan->parallel_aware = false;
 		plan->parallel_safe = ipath->path.parallel_safe;
 
-		plan->workmem_id = add_workmem(root->glob);
+		plan->workmem_id =
+			add_workmem(root->glob,
+						normalize_work_bytes(tbm_calculate_bytes(max_ancestor_rows)));
 
 		/* Extract original index clauses, actual index quals, relevant ECs */
 		subquals = NIL;
@@ -3898,7 +4056,15 @@ create_functionscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	/*
+	 * Replace the path's total working-memory estimate with a per-function
+	 * estimate.
+	 */
+	scan_plan->scan.plan.workmem_id =
+		add_workmems(root->glob,
+					 normalize_work_bytes(relation_byte_size(scan_plan->scan.plan.plan_rows,
+															 scan_plan->scan.plan.plan_width)),
+					 list_length(functions));
 
 	return scan_plan;
 }
@@ -3943,7 +4109,8 @@ create_tablefuncscan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	scan_plan->scan.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->workmem));
 
 	return scan_plan;
 }
@@ -4083,7 +4250,8 @@ create_ctescan_plan(PlannerInfo *root, Path *best_path,
 
 	copy_generic_path_info(&scan_plan->scan.plan, best_path);
 
-	scan_plan->scan.plan.workmem_id = add_workmem(root->glob);
+	scan_plan->scan.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(best_path->workmem));
 
 	return scan_plan;
 }
@@ -4786,8 +4954,10 @@ create_mergejoin_plan(PlannerInfo *root,
 		 */
 		copy_plan_costsize(matplan, inner_plan);
 		matplan->total_cost += cpu_operator_cost * matplan->plan_rows;
-
-		matplan->workmem_id = add_workmem(root->glob);
+		matplan->workmem_id =
+			add_workmem(root->glob,
+						normalize_work_bytes(relation_byte_size(matplan->plan_rows,
+																matplan->plan_width)));
 
 		inner_plan = matplan;
 	}
@@ -5135,7 +5305,9 @@ create_hashjoin_plan(PlannerInfo *root,
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
 	/* Assign workmem to the Hash subnode, not its parent HashJoin node. */
-	hash_plan->plan.workmem_id = add_hash_workmem(root->glob);
+	hash_plan->plan.workmem_id =
+		add_hash_workmem(root->glob,
+						 normalize_work_kb(best_path->jpath.path.workmem));
 
 	return join_plan;
 }
@@ -5690,7 +5862,8 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 	plan->plan.parallel_aware = false;
 	plan->plan.parallel_safe = lefttree->parallel_safe;
 
-	plan->plan.workmem_id = add_workmem(root->glob);
+	plan->plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(sort_path.workmem));
 }
 
 /*
@@ -5723,7 +5896,8 @@ label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
 	plan->sort.plan.parallel_aware = false;
 	plan->sort.plan.parallel_safe = lefttree->parallel_safe;
 
-	plan->sort.plan.workmem_id = add_workmem(root->glob);
+	plan->sort.plan.workmem_id =
+		add_workmem(root->glob, normalize_work_kb(sort_path.workmem));
 }
 
 /*
@@ -6821,7 +6995,8 @@ materialize_finished_plan(PlannerGlobal *glob, Plan *subplan)
 	matplan->parallel_aware = false;
 	matplan->parallel_safe = subplan->parallel_safe;
 
-	matplan->workmem_id = add_workmem(glob);
+	matplan->workmem_id =
+		add_workmem(glob, normalize_work_kb(matpath.workmem));
 
 	return matplan;
 }
@@ -7590,12 +7765,22 @@ is_projection_capable_plan(Plan *plan)
 }
 
 static int
-add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
+add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category,
+					 int estimate, int count)
 {
+	if (estimate == 0 || count == 0)
+		return 0;
+
 	glob->workMemCategories = lappend_int(glob->workMemCategories, category);
+	glob->workMemEstimates = lappend_int(glob->workMemEstimates, estimate);
+	glob->workMemCounts = lappend_int(glob->workMemCounts, count);
 	/* the executor will fill this in later: */
 	glob->workMemLimits = lappend_int(glob->workMemLimits, 0);
 
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemEstimates));
+	Assert(list_length(glob->workMemCategories) ==
+		   list_length(glob->workMemCounts));
 	Assert(list_length(glob->workMemCategories) ==
 		   list_length(glob->workMemLimits));
 
@@ -7608,10 +7793,10 @@ add_workmem_internal(PlannerGlobal *glob, WorkMemCategory category)
  *
  * This data structure will have its working-memory limit set to work_mem.
  */
-int
-add_workmem(PlannerGlobal *glob)
+static int
+add_workmem(PlannerGlobal *glob, int estimate)
 {
-	return add_workmem_internal(glob, WORKMEM_NORMAL);
+	return add_workmem_internal(glob, WORKMEM_NORMAL, estimate, 1);
 }
 
 /*
@@ -7622,7 +7807,13 @@ add_workmem(PlannerGlobal *glob)
  * hash_mem_multiplier.
  */
 int
-add_hash_workmem(PlannerGlobal *glob)
+add_hash_workmem(PlannerGlobal *glob, int estimate)
 {
-	return add_workmem_internal(glob, WORKMEM_HASH);
+	return add_workmem_internal(glob, WORKMEM_HASH, estimate, 1);
+}
+
+static int
+add_workmems(PlannerGlobal *glob, int estimate, int count)
+{
+	return add_workmem_internal(glob, WORKMEM_NORMAL, estimate, count);
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a431808be96..007e298565a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -585,6 +585,8 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	result->stmt_len = parse->stmt_len;
 
 	result->workMemCategories = glob->workMemCategories;
+	result->workMemEstimates = glob->workMemEstimates;
+	result->workMemCounts = glob->workMemCounts;
 	result->workMemLimits = glob->workMemLimits;
 
 	result->jitFlags = PGJIT_NONE;
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index c0a2f04a8c3..0d0fb5cf8ed 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -691,5 +691,17 @@ get_agg_clause_costs(PlannerInfo *root, AggSplit aggsplit, AggClauseCosts *costs
 			costs->finalCost.startup += argcosts.startup;
 			costs->finalCost.per_tuple += argcosts.per_tuple;
 		}
+
+		/*
+		 * How many aggrefs need to sort their input? (Each such aggref gets
+		 * its own sort buffer. The logic here MUST match the corresponding
+		 * logic in function build_pertrans_for_aggref().)
+		 */
+		if (!AGGKIND_IS_ORDERED_SET(aggref->aggkind) &&
+			!aggref->aggpresorted &&
+			(aggref->aggdistinct || aggref->aggorder))
+		{
+			++costs->numSortBuffers;
+		}
 	}
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index a4c5867cdcb..070b86563b1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1737,6 +1737,13 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
 	pathnode->path.rows = subpath->rows;
 
+	/*
+	 * For now, set workmem at hash memory limit. Function
+	 * cost_memoize_rescan() will adjust this field, same as it does for field
+	 * "est_entries".
+	 */
+	pathnode->path.workmem = normalize_work_bytes(get_hash_memory_limit());
+
 	return pathnode;
 }
 
@@ -1965,12 +1972,14 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		pathnode->path.disabled_nodes = agg_path.disabled_nodes;
 		pathnode->path.startup_cost = agg_path.startup_cost;
 		pathnode->path.total_cost = agg_path.total_cost;
+		pathnode->path.workmem = agg_path.workmem;
 	}
 	else
 	{
 		pathnode->path.disabled_nodes = sort_path.disabled_nodes;
 		pathnode->path.startup_cost = sort_path.startup_cost;
 		pathnode->path.total_cost = sort_path.total_cost;
+		pathnode->path.workmem = sort_path.workmem;
 	}
 
 	rel->cheapest_unique_path = (Path *) pathnode;
@@ -2317,6 +2326,13 @@ create_worktablescan_path(PlannerInfo *root, RelOptInfo *rel,
 	/* Cost is the same as for a regular CTE scan */
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
+	/*
+	 * But working memory used is 0, since the worktable scan doesn't create a
+	 * tuplestore -- it just reuses a tuplestore already created by a
+	 * recursive union.
+	 */
+	pathnode->workmem = 0;
+
 	return pathnode;
 }
 
@@ -3314,6 +3330,7 @@ create_agg_path(PlannerInfo *root,
 
 	pathnode->aggstrategy = aggstrategy;
 	pathnode->aggsplit = aggsplit;
+	pathnode->numSortBuffers = aggcosts ? aggcosts->numSortBuffers : 0;
 	pathnode->numGroups = numGroups;
 	pathnode->transitionSpace = aggcosts ? aggcosts->transitionSpace : 0;
 	pathnode->groupClause = groupClause;
@@ -3364,6 +3381,8 @@ create_groupingsets_path(PlannerInfo *root,
 	ListCell   *lc;
 	bool		is_first = true;
 	bool		is_first_sort = true;
+	int			num_sort_nodes = 0;
+	double		max_sort_workmem = 0.0;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3400,6 +3419,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->numSortBuffers = agg_costs ? agg_costs->numSortBuffers : 0;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 	pathnode->transitionSpace = agg_costs ? agg_costs->transitionSpace : 0;
@@ -3463,6 +3483,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 subpath->pathtarget->width);
 				if (!rollup->is_hashed)
 					is_first_sort = false;
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 			else
 			{
@@ -3475,6 +3497,12 @@ create_groupingsets_path(PlannerInfo *root,
 						  work_mem,
 						  -1.0);
 
+				/*
+				 * We costed sorting the previous "sort" rollup's "sort_out"
+				 * buffer. How much memory did it need?
+				 */
+				max_sort_workmem = Max(max_sort_workmem, sort_path.workmem);
+
 				/* Account for cost of aggregation */
 
 				cost_agg(&agg_path, root,
@@ -3488,12 +3516,17 @@ create_groupingsets_path(PlannerInfo *root,
 						 sort_path.total_cost,
 						 sort_path.rows,
 						 subpath->pathtarget->width);
+
+				pathnode->path.workmem += agg_path.workmem;
 			}
 
 			pathnode->path.disabled_nodes += agg_path.disabled_nodes;
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
+
+		if (!rollup->is_hashed)
+			++num_sort_nodes;
 	}
 
 	/* add tlist eval cost for each output row */
@@ -3501,6 +3534,17 @@ create_groupingsets_path(PlannerInfo *root,
 	pathnode->path.total_cost += target->cost.startup +
 		target->cost.per_tuple * pathnode->path.rows;
 
+	/*
+	 * Include working memory needed to sort agg output. If there's only 1
+	 * sort rollup, then we don't need any memory. If there are 2 sort
+	 * rollups, we need enough memory for 1 sort buffer. If there are >= 3
+	 * sort rollups, we need only 2 sort buffers, since we're
+	 * double-buffering.
+	 */
+	pathnode->path.workmem += num_sort_nodes > 2 ?
+		max_sort_workmem * 2.0 :
+		max_sort_workmem;
+
 	return pathnode;
 }
 
@@ -3650,7 +3694,8 @@ create_windowagg_path(PlannerInfo *root,
 				   subpath->disabled_nodes,
 				   subpath->startup_cost,
 				   subpath->total_cost,
-				   subpath->rows);
+				   subpath->rows,
+				   subpath->pathtarget->width);
 
 	/* add tlist eval cost for each output row */
 	pathnode->path.startup_cost += target->cost.startup;
@@ -3775,7 +3820,11 @@ create_setop_path(PlannerInfo *root,
 			MAXALIGN(SizeofMinimalTupleHeader);
 		if (hashentrysize * numGroups > get_hash_memory_limit())
 			pathnode->path.disabled_nodes++;
+
+		pathnode->path.workmem =
+			normalize_work_bytes(numGroups * hashentrysize);
 	}
+
 	pathnode->path.rows = outputRows;
 
 	return pathnode;
@@ -3826,7 +3875,7 @@ create_recursiveunion_path(PlannerInfo *root,
 	pathnode->wtParam = wtParam;
 	pathnode->numGroups = numGroups;
 
-	cost_recursive_union(&pathnode->path, leftpath, rightpath);
+	cost_recursive_union(pathnode, leftpath, rightpath);
 
 	return pathnode;
 }
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index e4e9e0d1de1..6cd9bffbee5 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -63,7 +63,8 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
-									int *num_skew_mcvs);
+									int *num_skew_mcvs,
+									int *workmem);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 extern void ExecHashEstimate(HashState *node, ParallelContext *pcxt);
 extern void ExecHashInitializeDSM(HashState *node, ParallelContext *pcxt);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d543011d92a..e15c37608d1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1283,6 +1283,18 @@ typedef struct PlanState
 #define workMemField(node, field)   \
 	(workMemFieldFromId((node), field, ((PlanState *)(node))->plan->workmem_id))
 
+/* workmem estimate: */
+#define workMemEstimateFromId(node, id) \
+	(workMemFieldFromId(node, workMemEstimates, id))
+#define workMemEstimate(node) \
+	(workMemField(node, workMemEstimates))
+
+/* workmem count: */
+#define workMemCountFromId(node, id) \
+	(workMemFieldFromId(node, workMemCounts, id))
+#define workMemCount(node) \
+	(workMemField(node, workMemCounts))
+
 /* workmem limit: */
 #define workMemLimitFromId(node, id) \
 	(workMemFieldFromId(node, workMemLimits, id))
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 181437ac933..779a56ede1a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -60,6 +60,7 @@ typedef struct AggClauseCosts
 	QualCost	transCost;		/* total per-input-row execution costs */
 	QualCost	finalCost;		/* total per-aggregated-row costs */
 	Size		transitionSpace;	/* space for pass-by-ref transition data */
+	int			numSortBuffers; /* # of required input-sort buffers */
 } AggClauseCosts;
 
 /*
@@ -188,9 +189,12 @@ typedef struct PlannerGlobal
 	 * needs working memory for a data structure maintains a "workmem_id"
 	 * index into the following lists (all kept in sync).
 	 */
-
 	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
 	List	   *workMemCategories;
+	/* - IntList: estimate (in KB) of memory needed to avoid spilling */
+	List	   *workMemEstimates;
+	/* - IntList: how many data structures get a copy of this info */
+	List	   *workMemCounts;
 	/* - IntList: limit (in KB), after which data structure must spill */
 	List	   *workMemLimits;
 } PlannerGlobal;
@@ -1807,6 +1811,7 @@ typedef struct Path
 	int			disabled_nodes; /* count of disabled nodes */
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
 	Cost		total_cost;		/* total cost (assuming all tuples fetched) */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* sort ordering of path's output; a List of PathKey nodes; see above */
 	List	   *pathkeys;
@@ -2411,6 +2416,7 @@ typedef struct AggPath
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy, see nodes.h */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+	int			numSortBuffers; /* number of inputs that require sorting */
 	Cardinality numGroups;		/* estimated number of groups in input */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	List	   *groupClause;	/* a list of SortGroupClause's */
@@ -2452,6 +2458,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	int			numSortBuffers; /* number of inputs that require sorting */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
@@ -3495,6 +3502,7 @@ typedef struct JoinCostWorkspace
 
 	/* Fields below here should be treated as private to costsize.c */
 	Cost		run_cost;		/* non-startup cost components */
+	Cost		workmem;		/* estimated work_mem (in KB) */
 
 	/* private for cost_nestloop code */
 	Cost		inner_run_cost; /* also used by cost_mergejoin code */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index ba8fdc2e6db..2134b15f95f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -160,9 +160,12 @@ typedef struct PlannedStmt
 	 * needs working memory for a data structure maintains a "workmem_id"
 	 * index into the following lists (all kept in sync).
 	 */
-
 	/* - IntList (of WorkMemCategory): is this a Hash or "normal" limit? */
 	List	   *workMemCategories;
+	/* - IntList: estimate (in KB) of memory needed to avoid spilling */
+	List	   *workMemEstimates;
+	/* - IntList: how many data structures get a copy of this info */
+	List	   *workMemCounts;
 	/* - IntList: limit (in KB), after which data structure must spill */
 	List	   *workMemLimits;
 } PlannedStmt;
@@ -1191,6 +1194,8 @@ typedef struct Agg
 	Oid		   *grpOperators pg_node_attr(array_size(numCols));
 	Oid		   *grpCollations pg_node_attr(array_size(numCols));
 
+	/* number of inputs that require sorting */
+	int			numSorts;
 	/* 1-based id of workMem to use to sort inputs, or else zero */
 	int			sortWorkMemId;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 99f795ceab5..d89a0f71a72 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -108,6 +108,7 @@ extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
 													dsa_pointer dp);
 extern int	tbm_calculate_entries(Size maxbytes);
+extern double tbm_calculate_bytes(double maxentries);
 
 extern TBMIterator tbm_begin_iterate(TIDBitmap *tbm,
 									 dsa_area *dsa, dsa_pointer dsp);
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8f3..ef80f6f9339 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -106,7 +106,7 @@ extern void cost_namedtuplestorescan(Path *path, PlannerInfo *root,
 									 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_resultscan(Path *path, PlannerInfo *root,
 							RelOptInfo *baserel, ParamPathInfo *param_info);
-extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
+extern void cost_recursive_union(RecursiveUnionPath *runion, Path *nrterm, Path *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
 					  List *pathkeys, int input_disabled_nodes,
 					  Cost input_cost, double tuples, int width,
@@ -139,7 +139,7 @@ extern void cost_windowagg(Path *path, PlannerInfo *root,
 						   List *windowFuncs, WindowClause *winclause,
 						   int input_disabled_nodes,
 						   Cost input_startup_cost, Cost input_total_cost,
-						   double input_tuples);
+						   double input_tuples, int width);
 extern void cost_group(Path *path, PlannerInfo *root,
 					   int numGroupCols, double numGroups,
 					   List *quals,
@@ -218,9 +218,18 @@ extern void set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *re
 extern void set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double relation_byte_size(double tuples, int width);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 								   Path *bitmapqual, double loop_count,
 								   Cost *cost_p, double *tuples_p);
 extern double compute_gather_rows(Path *path);
+extern int	compute_agg_input_workmem(double input_tuples, double input_width);
+extern int	compute_agg_output_workmem(PlannerInfo *root,
+									   AggStrategy aggstrategy,
+									   double numGroups, uint64 transitionSpace,
+									   double input_tuples, double input_width,
+									   bool cost_sort);
+extern int	normalize_work_kb(double nkb);
+extern int	normalize_work_bytes(double nbytes);
 
 #endif							/* COST_H */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 8436136026b..21894adffcc 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -49,8 +49,7 @@ extern Plan *change_plan_targetlist(Plan *subplan, List *tlist,
 extern Plan *materialize_finished_plan(PlannerGlobal *glob, Plan *subplan);
 extern bool is_projection_capable_path(Path *path);
 extern bool is_projection_capable_plan(Plan *plan);
-extern int	add_workmem(PlannerGlobal *glob);
-extern int	add_hash_workmem(PlannerGlobal *glob);
+extern int	add_hash_workmem(PlannerGlobal *glob, int estimate);
 
 /* External use of these functions is deprecated: */
 extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
-- 
2.39.5

0003-Add-EXPLAIN-work_mem-on-command-option.patchtext/x-diff; charset=utf-8Download

From f4824a23a2726059148de260d15b16dc411b19f1 Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 26 Feb 2025 01:02:19 +0000
Subject: [PATCH 3/4] Add EXPLAIN (work_mem on) command option

So that users can see how much working memory a query is likely to use, as
well as how much memory it will be limited to, this commit adds an
EXPLAIN (work_mem on) command option that displays the workmem estimate
and limit, added in the previous two commits.
---
 src/backend/commands/explain.c        | 231 +++++++++
 src/backend/commands/explain_state.c  |   2 +
 src/backend/executor/nodeHash.c       |   7 +-
 src/backend/optimizer/path/costsize.c |   4 +-
 src/include/commands/explain_state.h  |   3 +
 src/include/executor/nodeHash.h       |   2 +-
 src/test/regress/expected/workmem.out | 653 ++++++++++++++++++++++++++
 src/test/regress/parallel_schedule    |   2 +-
 src/test/regress/sql/workmem.sql      | 307 ++++++++++++
 9 files changed, 1205 insertions(+), 6 deletions(-)
 create mode 100644 src/test/regress/expected/workmem.out
 create mode 100644 src/test/regress/sql/workmem.sql

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8345bc0264b..bb73ab8741e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,8 @@
 #include "commands/explain_format.h"
 #include "commands/explain_state.h"
 #include "commands/prepare.h"
+#include "executor/hashjoin.h"
+#include "executor/nodeHash.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "libpq/pqformat.h"
@@ -29,6 +31,7 @@
 #include "nodes/extensible.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteHandler.h"
@@ -165,6 +168,14 @@ static ExplainWorkersState *ExplainCreateWorkersState(int num_workers);
 static void ExplainOpenWorker(int n, ExplainState *es);
 static void ExplainCloseWorker(int n, ExplainState *es);
 static void ExplainFlushWorkersState(ExplainState *es);
+static void compute_subplan_workmem(List *plans, double *sp_estimate,
+									double *sp_limit);
+static void compute_agg_workmem(PlanState *planstate, Agg *agg,
+								double *agg_estimate, double *agg_limit);
+static void compute_hash_workmem(PlanState *planstate, double *hash_estimate,
+								 double *hash_limit);
+static void increment_workmem(PlanState *planstate, int workmem_id,
+							  double *estimate, double *limit);
 
 
 
@@ -678,6 +689,14 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		ExplainPropertyFloat("Execution Time", "ms", 1000.0 * totaltime, 3,
 							 es);
 
+	if (es->work_mem)
+	{
+		ExplainPropertyFloat("Total Working Memory Estimate", "kB",
+							 es->total_workmem_estimate, 0, es);
+		ExplainPropertyFloat("Total Working Memory Limit", "kB",
+							 es->total_workmem_limit, 0, es);
+	}
+
 	ExplainCloseGroup("Query", NULL, true, es);
 }
 
@@ -1813,6 +1832,72 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
+	if (es->work_mem)
+	{
+		double		plan_estimate = 0.0;
+		double		plan_limit = 0.0;
+
+		/*
+		 * Include working memory used by this Plan's SubPlan objects, whether
+		 * they are included on the Plan's initPlan or subPlan lists.
+		 */
+		compute_subplan_workmem(planstate->initPlan, &plan_estimate,
+								&plan_limit);
+		compute_subplan_workmem(planstate->subPlan, &plan_estimate,
+								&plan_limit);
+
+		/* Include working memory used by this Plan, itself. */
+		switch (nodeTag(plan))
+		{
+			case T_Agg:
+				compute_agg_workmem(planstate, (Agg *) plan,
+									&plan_estimate, &plan_limit);
+				break;
+			case T_Hash:
+				compute_hash_workmem(planstate, &plan_estimate, &plan_limit);
+				break;
+			case T_RecursiveUnion:
+				{
+					RecursiveUnion *runion = (RecursiveUnion *) plan;
+
+					if (runion->hashWorkMemId > 0)
+						increment_workmem(planstate, runion->hashWorkMemId,
+										  &plan_estimate, &plan_limit);
+				}
+				/* FALLTHROUGH */
+			default:
+				if (plan->workmem_id > 0)
+					increment_workmem(planstate, plan->workmem_id,
+									  &plan_estimate, &plan_limit);
+				break;
+		}
+
+		/*
+		 * Every parallel worker (plus the leader) gets its own copy of
+		 * working memory.
+		 */
+		plan_estimate *= (1 + es->num_workers);
+		plan_limit *= (1 + es->num_workers);
+
+		es->total_workmem_estimate += plan_estimate;
+		es->total_workmem_limit += plan_limit;
+
+		if (plan_estimate > 0.0 || plan_limit > 0.0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfo(es->str,
+								 "  (work_mem=%.0f kB) (limit=%.0f kB)",
+								 plan_estimate, plan_limit);
+			else
+			{
+				ExplainPropertyFloat("Working Memory Estimate", "kB",
+									 plan_estimate, 0, es);
+				ExplainPropertyFloat("Working Memory Limit", "kB",
+									 plan_limit, 0, es);
+			}
+		}
+	}
+
 	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
@@ -2366,6 +2451,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (planstate->initPlan)
 		ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/*
+		 * Other than initPlan-s, every node below us gets the # of planned
+		 * workers we specified.
+		 */
+		Assert(es->num_workers == 0);
+
+		if (nodeTag(plan) == T_Gather)
+			es->num_workers = es->analyze ?
+				((GatherState *) planstate)->nworkers_launched :
+				((Gather *) plan)->num_workers;
+		else
+			es->num_workers = es->analyze ?
+				((GatherMergeState *) planstate)->nworkers_launched :
+				((GatherMerge *) plan)->num_workers;
+	}
+
 	/* lefttree */
 	if (outerPlanState(planstate))
 		ExplainNode(outerPlanState(planstate), ancestors,
@@ -2422,6 +2525,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainCloseGroup("Plans", "Plans", false, es);
 	}
 
+	if (nodeTag(plan) == T_Gather || nodeTag(plan) == T_GatherMerge)
+	{
+		/* End of parallel sub-tree. */
+		es->num_workers = 0;
+	}
+
 	/* in text format, undo whatever indentation we added */
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 		es->indent = save_indent;
@@ -4994,3 +5103,125 @@ ExplainFlushWorkersState(ExplainState *es)
 	pfree(wstate->worker_state_save);
 	pfree(wstate);
 }
+
+/*
+ * compute_subplan_work_mem - compute total workmem for a SubPlan object
+ *
+ * If a SubPlan object uses a hash table, then that hash table needs working
+ * memory. We display that working memory on the owning Plan. This function
+ * increments work_mem counters to include the SubPlan's working-memory.
+ */
+static void
+compute_subplan_workmem(List *plans, double *sp_estimate, double *sp_limit)
+{
+	foreach_node(SubPlanState, sps, plans)
+	{
+		SubPlan    *sp = sps->subplan;
+
+		if (sp->hashtab_workmem_id > 0)
+			increment_workmem(sps->planstate, sp->hashtab_workmem_id,
+							  sp_estimate, sp_limit);
+
+		if (sp->hashnul_workmem_id > 0)
+			increment_workmem(sps->planstate, sp->hashnul_workmem_id,
+							  sp_estimate, sp_limit);
+	}
+}
+
+static void
+compute_agg_workmem_node(PlanState *planstate, Agg *agg, double *agg_estimate,
+						 double *agg_limit)
+{
+	/* Record memory used for output data structures. */
+	if (agg->plan.workmem_id > 0)
+		increment_workmem(planstate, agg->plan.workmem_id, agg_estimate,
+						  agg_limit);
+
+	/* Record memory used for input sort buffers. */
+	if (agg->sortWorkMemId > 0)
+		increment_workmem(planstate, agg->sortWorkMemId, agg_estimate,
+						  agg_limit);
+}
+
+/*
+ * compute_agg_workmem - compute Agg node's total workmem estimate and limit
+ *
+ * An Agg node might point to a chain of additional Agg nodes. When we explain
+ * the plan, we display only the first, "main" Agg node.
+ */
+static void
+compute_agg_workmem(PlanState *planstate, Agg *agg, double *agg_estimate,
+					double *agg_limit)
+{
+	compute_agg_workmem_node(planstate, agg, agg_estimate, agg_limit);
+
+	/* Also include the chain of GROUPING SETS aggs. */
+	foreach_node(Agg, aggnode, agg->chain)
+		compute_agg_workmem_node(planstate, aggnode, agg_estimate, agg_limit);
+}
+
+/*
+ * compute_hash_workmem - compute total workmem for a Hash node
+ *
+ * This function is complicated, because we currently can adjust workmem limits
+ * for Hash (Joins), at runtime; and because the memory a Hash (Join) needs
+ * per-batch is not currently counted against the workmem limit.
+ *
+ * Here, we try to give a more accurate accounting than we'd get from just
+ * displaying limit * count.
+ */
+static void
+compute_hash_workmem(PlanState *planstate, double *hash_estimate,
+					 double *hash_limit)
+{
+	double		count = workMemCount(planstate);
+	double		estimate = workMemEstimate(planstate);
+	size_t		limit = workMemLimit(planstate);
+	HashState  *hstate = (HashState *) planstate;
+	Plan	   *plan = planstate->plan;
+	Hash	   *hash = (Hash *) plan;
+	Plan	   *outerNode = outerPlan(plan);
+	double		rows;
+	size_t		nbytes;
+	size_t		total_space_allowed;	/* ignored */
+	int			nbuckets;		/* ignored */
+	int			nbatch;
+	int			num_skew_mcvs;	/* ignored */
+	int			workmem_estimate;	/* ignored */
+
+	/*
+	 * For Hash Joins, we currently don't count per-batch memory against the
+	 * "workmem_limit", but we can at least estimate it for display with the
+	 * Plan.
+	 */
+	rows = plan->parallel_aware ? hash->rows_total : outerNode->plan_rows;
+	nbytes = limit * 1024;
+
+	ExecChooseHashTableSize(rows, outerNode->plan_width,
+							OidIsValid(hash->skewTable),
+							hstate->parallel_state != NULL,
+							hstate->parallel_state != NULL ?
+							hstate->parallel_state->nparticipants - 1 : 0,
+							&nbytes, &total_space_allowed,
+							&nbuckets, &nbatch, &num_skew_mcvs,
+							&workmem_estimate);
+
+	/* Include space for per-batch memory, if any: 2 blocks per batch. */
+	if (nbatch > 1)
+		nbytes += nbatch * 2 * BLCKSZ;
+
+	Assert(nbytes >= limit * 1024);
+
+	*hash_estimate += estimate * count;
+	*hash_limit += (double) normalize_work_bytes(nbytes) * count;
+}
+
+static void
+increment_workmem(PlanState *planstate, int workmem_id, double *estimate,
+				  double *limit)
+{
+	double		count = workMemCountFromId(planstate, workmem_id);
+
+	*estimate += workMemEstimateFromId(planstate, workmem_id) * count;
+	*limit += workMemLimitFromId(planstate, workmem_id) * count;
+}
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 60d98d63a62..eafa15b6795 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -115,6 +115,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 		}
 		else if (strcmp(opt->defname, "memory") == 0)
 			es->memory = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "work_mem") == 0)
+			es->work_mem = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "serialize") == 0)
 		{
 			if (opt->arg)
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 7d09ac8b5a3..6ae3d649be6 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -482,7 +482,7 @@ ExecHashTableCreate(HashState *state)
 							state->parallel_state != NULL,
 							state->parallel_state != NULL ?
 							state->parallel_state->nparticipants - 1 : 0,
-							worker_space_allowed,
+							&worker_space_allowed,
 							&space_allowed,
 							&nbuckets, &nbatch, &num_skew_mcvs, &workmem);
 
@@ -666,7 +666,7 @@ void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						bool try_combined_hash_mem,
 						int parallel_workers,
-						size_t worker_space_allowed,
+						size_t *worker_space_allowed,
 						size_t *total_space_allowed,
 						int *numbuckets,
 						int *numbatches,
@@ -699,7 +699,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/*
 	 * Caller tells us our (per-worker) in-memory hashtable size limit.
 	 */
-	hash_table_bytes = worker_space_allowed;
+	hash_table_bytes = *worker_space_allowed;
 
 	/*
 	 * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -963,6 +963,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		nbatch /= 2;
 		nbuckets *= 2;
 
+		*worker_space_allowed = (*worker_space_allowed) * 2;
 		*total_space_allowed = (*total_space_allowed) * 2;
 	}
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 27daa1966c2..d7fd4b214d8 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4309,6 +4309,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	int			numbuckets;
 	int			numbatches;
 	int			num_skew_mcvs;
+	size_t		worker_space_allowed;
 	size_t		space_allowed;	/* unused */
 
 	/* Count up disabled nodes. */
@@ -4354,12 +4355,13 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	 * XXX at some point it might be interesting to try to account for skew
 	 * optimization in the cost estimate, but for now, we don't.
 	 */
+	worker_space_allowed = get_hash_memory_limit();
 	ExecChooseHashTableSize(inner_path_rows_total,
 							inner_path->pathtarget->width,
 							true,	/* useskew */
 							parallel_hash,	/* try_combined_hash_mem */
 							outer_path->parallel_workers,
-							get_hash_memory_limit(),
+							&worker_space_allowed,
 							&space_allowed,
 							&numbuckets,
 							&numbatches,
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 32728f5d1a1..98639a32aec 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -69,6 +69,9 @@ typedef struct ExplainState
 	bool		hide_workers;	/* set if we find an invisible Gather */
 	int			rtable_size;	/* length of rtable excluding the RTE_GROUP
 								 * entry */
+	int			num_workers;	/* # of worker processes *planned* to use */
+	double		total_workmem_estimate; /* total working memory estimate */
+	double		total_workmem_limit;	/* total working memory limit */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 	/* extensions */
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 6cd9bffbee5..b346a270b67 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -59,7 +59,7 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 									bool try_combined_hash_mem,
 									int parallel_workers,
-									size_t worker_space_allowed,
+									size_t *worker_space_allowed,
 									size_t *total_space_allowed,
 									int *numbuckets,
 									int *numbatches,
diff --git a/src/test/regress/expected/workmem.out b/src/test/regress/expected/workmem.out
new file mode 100644
index 00000000000..ca8edde6d5f
--- /dev/null
+++ b/src/test/regress/expected/workmem.out
@@ -0,0 +1,653 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+-- Unique -> hash agg
+set enable_hashagg = on;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                         workmem_filter                          
+-----------------------------------------------------------------
+ Sort  (work_mem=N kB) (limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(11 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Unique -> sort
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ Sort  (work_mem=N kB) (limit=4096 kB)
+   Sort Key: onek.unique1
+   ->  Nested Loop
+         ->  Unique
+               ->  Sort  (work_mem=N kB) (limit=4096 kB)
+                     Sort Key: "*VALUES*".column1, "*VALUES*".column2
+                     ->  Values Scan on "*VALUES*"
+         ->  Index Scan using onek_unique1 on onek
+               Index Cond: (unique1 = "*VALUES*".column1)
+               Filter: ("*VALUES*".column2 = ten)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 8192 kB
+(12 rows)
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+       1 |     214 |   1 |    1 |   1 |      1 |       1 |        1 |           1 |         1 |        1 |   2 |    3 | BAAAAA   | GIAAAA   | OOOOxx
+      20 |     306 |   0 |    0 |   0 |      0 |       0 |       20 |          20 |        20 |       20 |   0 |    1 | UAAAAA   | ULAAAA   | OOOOxx
+      99 |     101 |   1 |    3 |   9 |     19 |       9 |       99 |          99 |        99 |       99 |  18 |   19 | VDAAAA   | XDAAAA   | HHHHxx
+(3 rows)
+
+reset enable_hashagg;
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+                     workmem_filter                      
+---------------------------------------------------------
+ Limit
+   ->  Incremental Sort  (work_mem=N kB) (limit=8192 kB)
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort  (work_mem=N kB) (limit=4096 kB)
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(9 rows)
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+ unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
+    4220 |    5017 |   0 |    0 |   0 |      0 |      20 |      220 |         220 |      4220 |     4220 |  40 |   41 | IGAAAA   | ZKHAAA   | HHHHxx
+(1 row)
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+                                 workmem_filter                                 
+--------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Hash Join
+               Hash Cond: (t3.thousand = t1.unique1)
+               ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+                     Group Key: t3.thousand, t3.tenthous
+                     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 t3
+               ->  Hash  (work_mem=N kB) (limit=8192 kB)
+                     ->  Index Only Scan using onek_unique1 on onek t1
+                           Index Cond: (unique1 < 1)
+         ->  Index Only Scan using tenk1_hundred on tenk1 t2
+               Index Cond: (hundred = t3.tenthous)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(14 rows)
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+ count 
+-------
+   100
+(1 row)
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+                               workmem_filter                               
+----------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Nested Loop Left Join
+               Filter: (t4.f1 IS NULL)
+               ->  Seq Scan on int4_tbl t2
+               ->  Materialize  (work_mem=N kB) (limit=4096 kB)
+                     ->  Nested Loop Left Join
+                           Join Filter: (t3.f1 > 1)
+                           ->  Seq Scan on int4_tbl t3
+                                 Filter: (f1 > 0)
+                           ->  Materialize  (work_mem=N kB) (limit=4096 kB)
+                                 ->  Seq Scan on int4_tbl t4
+         ->  Seq Scan on int4_tbl t1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 8192 kB
+(15 rows)
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+ count 
+-------
+     0
+(1 row)
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=4096 kB)
+   ->  Sort  (work_mem=N kB) (limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=4096 kB)
+   ->  Sort  (work_mem=N kB) (limit=4096 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=8192 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=4096 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 20480 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+                           workmem_filter                            
+---------------------------------------------------------------------
+ Finalize HashAggregate  (work_mem=N kB) (limit=8192 kB)
+   Group Key: (length((stringu1)::text))
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial HashAggregate  (work_mem=N kB) (limit=40960 kB)
+               Group Key: length((stringu1)::text)
+               ->  Parallel Seq Scan on tenk1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 49152 kB
+(9 rows)
+
+select length(stringu1) from tenk1 group by length(stringu1);
+ length 
+--------
+      6
+(1 row)
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+             QUERY PLAN              
+-------------------------------------
+ Aggregate
+   ->  Seq Scan on tenk1
+ Total Working Memory Estimate: 0 kB
+ Total Working Memory Limit: 0 kB
+(4 rows)
+
+select MAX(length(stringu1)) from tenk1;
+ max 
+-----
+   6
+(1 row)
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                              workmem_filter                              
+--------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=12288 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+                              workmem_filter                              
+--------------------------------------------------------------------------
+ Nested Loop
+   ->  Seq Scan on workmem_xmldata
+   ->  Table Function Scan on "xmltable"  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(5 rows)
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+ id | _id | country_name | country_id | region_id | size | unit | premier_name 
+----+-----+--------------+------------+-----------+------+------+--------------
+(0 rows)
+
+drop table workmem_xmldata;
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ SetOp Except
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Index Only Scan using tenk1_unique2 on tenk1 tenk1_1
+         Filter: (unique2 <> 10)
+ Total Working Memory Estimate: 0 kB
+ Total Working Memory Limit: 0 kB
+(6 rows)
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+ unique1 
+---------
+      10
+(1 row)
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+                          workmem_filter                          
+------------------------------------------------------------------
+ Aggregate
+   ->  HashSetOp Intersect  (work_mem=N kB) (limit=8192 kB)
+         ->  Seq Scan on tenk1
+         ->  Index Only Scan using tenk1_unique1 on tenk1 tenk1_1
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 8192 kB
+(6 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+ count 
+-------
+  5000
+(1 row)
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+                               workmem_filter                                
+-----------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on onek o
+               Filter: (ten = 1)
+         ->  Memoize  (work_mem=N kB) (limit=8192 kB)
+               Cache Key: o.four
+               Cache Mode: binary
+               ->  CTE Scan on x  (work_mem=N kB) (limit=4096 kB)
+                     CTE x
+                       ->  Recursive Union  (work_mem=N kB) (limit=16384 kB)
+                             ->  Result
+                             ->  WorkTable Scan on x x_1
+                                   Filter: (a < 10)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 28672 kB
+(15 rows)
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+ sum  | sum  
+------+------
+ 1700 | 5350
+(1 row)
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+                           workmem_filter                           
+--------------------------------------------------------------------
+ Aggregate
+   CTE q1
+     ->  HashAggregate  (work_mem=N kB) (limit=8192 kB)
+           Group Key: tenk1.hundred
+           ->  Seq Scan on tenk1
+   InitPlan 2
+     ->  Aggregate
+           ->  CTE Scan on q1 qsub  (work_mem=N kB) (limit=4096 kB)
+   ->  CTE Scan on q1  (work_mem=N kB) (limit=4096 kB)
+         Filter: ((y)::numeric > (InitPlan 2).col1)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(12 rows)
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+ count 
+-------
+    50
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                     
+---------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=4096 kB)
+         ->  Sort  (work_mem=N kB) (limit=4096 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 12288 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+                                            workmem_filter                                             
+-------------------------------------------------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         Join Filter: (((a.unique1 = 1) AND (b.unique1 = 2)) OR ((a.unique2 = 3) AND (b.hundred = 4)))
+         ->  Bitmap Heap Scan on tenk1 b
+               Recheck Cond: ((hundred = 4) OR (unique1 = 2))
+               ->  BitmapOr
+                     ->  Bitmap Index Scan on tenk1_hundred  (work_mem=N kB) (limit=4096 kB)
+                           Index Cond: (hundred = 4)
+                     ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB) (limit=4096 kB)
+                           Index Cond: (unique1 = 2)
+         ->  Materialize  (work_mem=N kB) (limit=4096 kB)
+               ->  Bitmap Heap Scan on tenk1 a
+                     Recheck Cond: ((unique2 = 3) OR (unique1 = 1))
+                     ->  BitmapOr
+                           ->  Bitmap Index Scan on tenk1_unique2  (work_mem=N kB) (limit=4096 kB)
+                                 Index Cond: (unique2 = 3)
+                           ->  Bitmap Index Scan on tenk1_unique1  (work_mem=N kB) (limit=4096 kB)
+                                 Index Cond: (unique1 = 1)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 20480 kB
+(20 rows)
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+ count 
+-------
+   101
+(1 row)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+              workmem_filter              
+------------------------------------------
+ Result  (work_mem=N kB) (limit=16384 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 16384 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=16384 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=8192 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 24576 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index fbffc67ae60..0d59d37bb35 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -123,7 +123,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression compression_lz4 memoize stats predicate numa
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression compression_lz4 memoize stats predicate numa workmem
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/workmem.sql b/src/test/regress/sql/workmem.sql
new file mode 100644
index 00000000000..2de22be0427
--- /dev/null
+++ b/src/test/regress/sql/workmem.sql
@@ -0,0 +1,307 @@
+----
+-- Tests that show "work_mem" output to EXPLAIN plans.
+----
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+-- Unique -> hash agg
+set enable_hashagg = on;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Unique -> sort
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+');
+
+select *
+from onek
+where (unique1,ten) in (values (1,1), (20,0), (99,9), (17,99))
+order by unique1;
+
+reset enable_hashagg;
+
+-- Incremental Sort
+select workmem_filter('
+explain (costs off, work_mem on)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+');
+
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- Hash Join
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+');
+
+select count(*) from (
+select t1.unique1, t2.hundred
+from onek t1, tenk1 t2
+where exists (select 1 from tenk1 t3
+              where t3.thousand = t1.unique1 and t3.tenthous = t2.hundred)
+      and t1.unique1 < 1
+) t;
+
+-- Materialize
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+');
+
+select count(*) from (
+select t1.f1
+from int4_tbl t1, int4_tbl t2
+  left join int4_tbl t3 on t3.f1 > 0
+  left join int4_tbl t4 on t3.f1 > 1
+where t4.f1 is null
+) t;
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Agg (hash, parallel)
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select length(stringu1) from tenk1 group by length(stringu1);
+');
+
+select length(stringu1) from tenk1 group by length(stringu1);
+
+reset parallel_setup_cost;
+reset parallel_tuple_cost;
+reset min_parallel_table_scan_size;
+reset max_parallel_workers_per_gather;
+
+-- Agg (simple) [no work_mem]
+explain (costs off, work_mem on)
+select MAX(length(stringu1)) from tenk1;
+
+select MAX(length(stringu1)) from tenk1;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- Table Function Scan
+CREATE TABLE workmem_xmldata(data xml);
+
+select workmem_filter('
+EXPLAIN (COSTS OFF, work_mem on)
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE(''/ROWS/ROW''
+                         PASSING data
+                         COLUMNS id int PATH ''@id'',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH ''COUNTRY_NAME'' NOT NULL,
+                                  country_id text PATH ''COUNTRY_ID'',
+                                  region_id int PATH ''REGION_ID'',
+                                  size float PATH ''SIZE'',
+                                  unit text PATH ''SIZE/@unit'',
+                                  premier_name text PATH ''PREMIER_NAME'' DEFAULT ''not specified'');
+');
+
+SELECT  xmltable.*
+   FROM (SELECT data FROM workmem_xmldata) x,
+        LATERAL XMLTABLE('/ROWS/ROW'
+                         PASSING data
+                         COLUMNS id int PATH '@id',
+                                  _id FOR ORDINALITY,
+                                  country_name text PATH 'COUNTRY_NAME' NOT NULL,
+                                  country_id text PATH 'COUNTRY_ID',
+                                  region_id int PATH 'REGION_ID',
+                                  size float PATH 'SIZE',
+                                  unit text PATH 'SIZE/@unit',
+                                  premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
+
+drop table workmem_xmldata;
+
+-- SetOp [no work_mem]
+explain (costs off, work_mem on)
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+select unique1 from tenk1 except select unique2 from tenk1 where unique2 != 10;
+
+-- HashSetOp
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+');
+
+select count(*) from
+  ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
+
+-- RecursiveUnion and Memoize (also WorkTable Scan [no work_mem])
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+');
+
+select sum(o.four), sum(ss.a) from onek o
+cross join lateral (with recursive x(a) as (
+          select o.four as a union select a + 1 from x where a < 10)
+    select * from x) ss where o.ten = 1;
+
+-- CTE Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+');
+
+WITH q1(x,y) AS (
+    SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
+  )
+SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- Bitmap Heap Scan
+select workmem_filter('
+explain (costs off, work_mem on)
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+');
+
+select count(*) from (
+select * from tenk1 a join tenk1 b on
+  (a.unique1 = 1 and b.unique1 = 2) or (a.unique2 = 3 and b.hundred = 4)
+);
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
-- 
2.39.5

0004-Add-workmem_hook-to-allow-extensions-to-override-per.patchtext/x-diff; charset=utf-8Download

From f0e5f69ac14927bb057bd765a48af4de2023b7bd Mon Sep 17 00:00:00 2001
From: James Hunter <james.hunter.pg@gmail.com>
Date: Wed, 5 Mar 2025 01:21:20 +0000
Subject: [PATCH 4/4] Add "workmem_hook" to allow extensions to override
 per-node work_mem

---
 contrib/Makefile                     |   3 +-
 contrib/meson.build                  |   1 +
 contrib/workmem/Makefile             |  20 +
 contrib/workmem/expected/workmem.out | 676 +++++++++++++++++++++++++++
 contrib/workmem/meson.build          |  28 ++
 contrib/workmem/sql/workmem.sql      | 304 ++++++++++++
 contrib/workmem/workmem.c            | 409 ++++++++++++++++
 src/backend/executor/execWorkmem.c   |  40 +-
 src/include/executor/executor.h      |   4 +
 9 files changed, 1474 insertions(+), 11 deletions(-)
 create mode 100644 contrib/workmem/Makefile
 create mode 100644 contrib/workmem/expected/workmem.out
 create mode 100644 contrib/workmem/meson.build
 create mode 100644 contrib/workmem/sql/workmem.sql
 create mode 100644 contrib/workmem/workmem.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 2f0a88d3f77..042f5128376 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -51,7 +51,8 @@ SUBDIRS = \
 		tsm_system_rows \
 		tsm_system_time \
 		unaccent	\
-		vacuumlo
+		vacuumlo	\
+		workmem
 
 ifeq ($(with_ssl),openssl)
 SUBDIRS += pgcrypto sslinfo
diff --git a/contrib/meson.build b/contrib/meson.build
index ed30ee7d639..a1b283789b5 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -70,4 +70,5 @@ subdir('tsm_system_time')
 subdir('unaccent')
 subdir('uuid-ossp')
 subdir('vacuumlo')
+subdir('workmem')
 subdir('xml2')
diff --git a/contrib/workmem/Makefile b/contrib/workmem/Makefile
new file mode 100644
index 00000000000..f920cdb9964
--- /dev/null
+++ b/contrib/workmem/Makefile
@@ -0,0 +1,20 @@
+# contrib/workmem/Makefile
+
+MODULE_big = workmem
+OBJS = \
+	$(WIN32RES) \
+	workmem.o
+PGFILEDESC = "workmem - extension that adjusts PostgreSQL work_mem per node"
+
+REGRESS = workmem
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/workmem
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/workmem/expected/workmem.out b/contrib/workmem/expected/workmem.out
new file mode 100644
index 00000000000..f69883b0005
--- /dev/null
+++ b/contrib/workmem/expected/workmem.out
@@ -0,0 +1,676 @@
+load 'workmem';
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=25600 kB)
+   ->  Sort  (work_mem=N kB) (limit=25600 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=51200 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=20480 kB)
+   ->  Sort  (work_mem=N kB) (limit=20480 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=40960 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=20480 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                               workmem_filter                                
+-----------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=102400 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=102399 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102399 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                     workmem_filter                                     
+----------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=34134 kB)
+         ->  Sort  (work_mem=N kB) (limit=34133 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=34133 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+              workmem_filter               
+-------------------------------------------
+ Result  (work_mem=N kB) (limit=102400 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=68267 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=34133 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 102400 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=1024 kB)
+   ->  Sort  (work_mem=N kB) (limit=1024 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=2048 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=820 kB)
+   ->  Sort  (work_mem=N kB) (limit=819 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=1638 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=819 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                              workmem_filter                               
+---------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=4096 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=4095 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4095 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+                                    workmem_filter                                     
+---------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=1366 kB)
+         ->  Sort  (work_mem=N kB) (limit=1365 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=1365 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+             workmem_filter              
+-----------------------------------------
+ Result  (work_mem=N kB) (limit=4096 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=2731 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=1365 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 4096 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=20 kB)
+   ->  Sort  (work_mem=N kB) (limit=20 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  HashAggregate  (work_mem=N kB) (limit=40 kB)
+               Hash Key: "*VALUES*".column1, "*VALUES*".column2
+               Hash Key: "*VALUES*".column1
+               ->  Values Scan on "*VALUES*"
+                     Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+(4 rows)
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                            
+----------------------------------------------------------------------
+ WindowAgg  (work_mem=N kB) (limit=16 kB)
+   ->  Sort  (work_mem=N kB) (limit=16 kB)
+         Sort Key: "*VALUES*".column1, "*VALUES*".column2 NULLS FIRST
+         ->  GroupAggregate  (work_mem=N kB) (limit=32 kB)
+               Group Key: "*VALUES*".column1, "*VALUES*".column2
+               Group Key: "*VALUES*".column1
+               Sort Key: "*VALUES*".column2
+                 Group Key: "*VALUES*".column2
+               Sort Key: "*VALUES*".column3
+                 Group Key: "*VALUES*".column3
+               Sort Key: "*VALUES*".column4
+                 Group Key: "*VALUES*".column4
+               ->  Sort  (work_mem=N kB) (limit=16 kB)
+                     Sort Key: "*VALUES*".column1
+                     ->  Values Scan on "*VALUES*"
+                           Filter: (column1 = column2)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(18 rows)
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ a | b | row_number 
+---+---+------------
+ 1 |   |          1
+ 1 | 1 |          2
+ 2 |   |          3
+ 2 | 2 |          4
+   |   |          5
+   |   |          6
+   |   |          7
+   |   |          8
+   | 1 |          9
+   | 2 |         10
+(10 rows)
+
+reset enable_hashagg;
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+                             workmem_filter                              
+-------------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series a  (work_mem=N kB) (limit=80 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(4 rows)
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+ count 
+-------
+  2000
+(1 row)
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                            workmem_filter                             
+-----------------------------------------------------------------------
+ Aggregate
+   ->  Function Scan on generate_series  (work_mem=N kB) (limit=78 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 78 kB
+(4 rows)
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ count 
+-------
+    12
+(1 row)
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                                   workmem_filter                                    
+-------------------------------------------------------------------------------------
+ Limit
+   ->  WindowAgg  (work_mem=N kB) (limit=27 kB)
+         ->  Sort  (work_mem=N kB) (limit=27 kB)
+               Sort Key: ((a.n < 3))
+               ->  Function Scan on generate_series a  (work_mem=N kB) (limit=26 kB)
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+   sum   
+---------
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+ 2000997
+(5 rows)
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+            workmem_filter             
+---------------------------------------
+ Result  (work_mem=N kB) (limit=80 kB)
+   SubPlan 1
+     ->  Append
+           ->  Result
+           ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(7 rows)
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ f
+(1 row)
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+                         workmem_filter                         
+----------------------------------------------------------------
+ Result  (work_mem=N kB) (limit=54 kB)
+   SubPlan 3
+     ->  Result  (work_mem=N kB) (limit=26 kB)
+           One-Time Filter: (ANY (1 = (hashed SubPlan 2).col1))
+           InitPlan 1
+             ->  Result
+           SubPlan 2
+             ->  Result
+ Total Working Memory Estimate: N kB
+ Total Working Memory Limit: 80 kB
+(10 rows)
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+WARNING:  not enough working memory for query: increase workmem.query_work_mem
+ ?column? 
+----------
+ t
+(1 row)
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/meson.build b/contrib/workmem/meson.build
new file mode 100644
index 00000000000..fce8030ba45
--- /dev/null
+++ b/contrib/workmem/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+workmem_sources = files(
+  'workmem.c',
+)
+
+if host_system == 'windows'
+  workmem_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'workmem',
+    '--FILEDESC', 'workmem - extension that adjusts PostgreSQL work_mem per node',])
+endif
+
+workmem = shared_module('workmem',
+  workmem_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += workmem
+
+tests += {
+  'name': 'workmem',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'workmem',
+    ],
+  },
+}
diff --git a/contrib/workmem/sql/workmem.sql b/contrib/workmem/sql/workmem.sql
new file mode 100644
index 00000000000..4e1ec056b80
--- /dev/null
+++ b/contrib/workmem/sql/workmem.sql
@@ -0,0 +1,304 @@
+load 'workmem';
+
+-- Note: Function derived from file explain.sql. We can't use that other
+-- function, since we're run in parallel with explain.sql.
+create or replace function workmem_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        -- Mask out work_mem estimate, since it might be brittle
+        ln := regexp_replace(ln, '\mwork_mem=\d+\M', 'work_mem=N', 'g');
+        ln := regexp_replace(ln, '\mMemory Estimate: \d+\M', 'Memory Estimate: N', 'g');
+        return next ln;
+    end loop;
+end;
+$$;
+
+--====
+-- Test suite 1: default workmem.query_work_mem (= 100 MB)
+--====
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+--====
+-- Test suite 2: set workmem.query_work_mem to 4 MB
+--====
+set workmem.query_work_mem = 4096;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
+
+--====
+-- Test suite 3: set workmem.query_work_mem to 80 KB
+--====
+set workmem.query_work_mem = 80;
+
+----
+-- Some tests from src/test/regress/sql/workmem.sql that don't require
+-- test_setup.sql, etc., to be run first.
+----
+
+-- Grouping Sets (Hash)
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1), (2, 2)) as t (a, b) where a = b
+group by grouping sets((a, b), (a));
+
+-- Grouping Sets (Sort)
+set enable_hashagg = off;
+
+select workmem_filter('
+explain (costs off, work_mem on)
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+');
+
+select a, b, row_number() over (order by a, b nulls first)
+from (values (1, 1, 1, 1), (2, 2, 2, 2)) as t (a, b, c, d) where a = b
+group by grouping sets((a, b), (a), (b), (c), (d));
+
+reset enable_hashagg;
+
+-- Function Scan
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+');
+
+select count(*) from (
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+) t;
+
+-- Three Function Scans
+select workmem_filter('
+explain (work_mem on, costs off)
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+');
+
+select count(*)
+from rows from(generate_series(1, 5),
+               generate_series(2, 10),
+               generate_series(4, 15));
+
+-- WindowAgg
+select workmem_filter('
+explain (costs off, work_mem on)
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+');
+
+select sum(n) over(partition by m)
+from (SELECT n < 3 as m, n from generate_series(1,2000) a(n))
+limit 5;
+
+-- InitPlan with hash table ("IN SELECT")
+select workmem_filter('
+explain (costs off, work_mem on)
+select ''foo''::text in (select ''bar''::name union all select ''bar''::name);
+');
+
+select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
+
+-- SubPlan with hash table
+select workmem_filter('
+explain (costs off, work_mem on)
+select 1 = any (select (select 1) where 1 = any (select 1));
+');
+
+select 1 = any (select (select 1) where 1 = any (select 1));
+
+reset workmem.query_work_mem;
diff --git a/contrib/workmem/workmem.c b/contrib/workmem/workmem.c
new file mode 100644
index 00000000000..d78f60c7d8d
--- /dev/null
+++ b/contrib/workmem/workmem.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * workmem.c
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/workmem/workmem.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/parallel.h"
+#include "common/int.h"
+#include "executor/executor.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+
+PG_MODULE_MAGIC;
+
+/* Local variables */
+
+/*
+ * A Target represents a collection of data structures, belonging to an
+ * execution node, that all share the same memory limit.
+ *
+ * For example, in parallel query, every parallel worker (plus the leader)
+ * gets a copy of the execution node, and therefore a copy of all of that
+ * node's work_mem limits. In this case, we'll track a single Target, but its
+ * count will include (1 + num_workers), because this Target gets "applied"
+ * to (1 + num_workers) execution nodes.
+ */
+typedef struct Target
+{
+	/* # of data structures to which target applies: */
+	int			count;
+	/* workmem estimate for each of these data structures: */
+	int			workmem;
+	/* (original) workmem limit for each of these data structures: */
+	int			limit;
+	/* workmem estimate, but capped at (original) workmem limit: */
+	int			priority;
+	/* ratio of (priority / limit); measure's Target's "greediness": */
+	double		ratio;
+	/* link to target's actual limit, so we can set it: */
+	int		   *target_limit;
+}			Target;
+
+typedef struct WorkMemStats
+{
+	/* total # of data structures that get working memory: */
+	uint64		count;
+	/* total working memory estimated for this query: */
+	uint64		workmem;
+	/* total working memory (currently) reserved for this query: */
+	uint64		limit;
+	/* total "capped" working memory estimate: */
+	uint64		priority;
+	/* list of Targets, used to update actual workmem limits: */
+	List	   *targets;
+}			WorkMemStats;
+
+/* GUC variables */
+static int	workmem_query_work_mem = 100 * 1024;	/* kB */
+
+/* internal functions */
+static void workmem_fn(PlannedStmt *plannedstmt);
+
+static int	clamp_priority(int workmem, int limit);
+static Target * make_target(int workmem, int *target_limit, int count);
+static void add_target(WorkMemStats * workmem_stats, Target * target);
+
+/* Sort comparators: sort by ratio, ascending or descending. */
+static int	target_compare_asc(const ListCell *a, const ListCell *b);
+static int	target_compare_desc(const ListCell *a, const ListCell *b);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Define custom GUC variable. */
+	DefineCustomIntVariable("workmem.query_work_mem",
+							"Amount of working-memory (in kB) to provide each "
+							"query.",
+							NULL,
+							&workmem_query_work_mem,
+							100 * 1024, /* default to 100 MB */
+							64,
+							INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_KB,
+							NULL,
+							NULL,
+							NULL);
+
+	MarkGUCPrefixReserved("workmem");
+
+	/* Install hooks. */
+	ExecAssignWorkMem_hook = workmem_fn;
+}
+
+static void
+workmem_analyze(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			idx;
+
+	for (idx = 0; idx < list_length(plannedstmt->workMemCategories); ++idx)
+	{
+		WorkMemCategory category;
+		int			count;
+		int			estimate;
+		ListCell   *limit_cell;
+		int			limit;
+		Target	   *target;
+
+		category =
+			(WorkMemCategory) list_nth_int(plannedstmt->workMemCategories, idx);
+		count = list_nth_int(plannedstmt->workMemCounts, idx);
+		estimate = list_nth_int(plannedstmt->workMemEstimates, idx);
+
+		limit = category == WORKMEM_HASH ?
+			get_hash_memory_limit() / 1024 : work_mem;
+		limit_cell = list_nth_cell(plannedstmt->workMemLimits, idx);
+		lfirst_int(limit_cell) = limit;
+
+		target = make_target(estimate, &lfirst_int(limit_cell), count);
+		add_target(workmem_stats, target);
+	}
+}
+
+static void
+workmem_set(PlannedStmt *plannedstmt, WorkMemStats * workmem_stats)
+{
+	int			remaining = workmem_query_work_mem;
+
+	if (workmem_stats->limit <= remaining)
+	{
+		/*
+		 * "High memory" case: we have more than enough query_work_mem; now
+		 * hand out the excess.
+		 */
+
+		/* This is memory that exceeds workmem limits. */
+		remaining -= workmem_stats->limit;
+
+		/*
+		 * Sort targets from highest ratio to lowest. When we assign memory to
+		 * a Target, we'll truncate fractional KB; so by going through the
+		 * list from highest to lowest ratio, we ensure that the lowest ratios
+		 * get the leftover fractional KBs.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/* NOTE: This is extra workmem *per data structure*. */
+			extra_workmem = (int) (fraction * remaining);
+
+			*target->target_limit += extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else if (workmem_stats->priority <= remaining)
+	{
+		/*
+		 * "Medium memory" case: we don't have enough query_work_mem to give
+		 * every target its full allotment, but we do have enough to give it
+		 * as much as (we estimate) it needs.
+		 *
+		 * So, just take some memory away from nodes that (we estimate) won't
+		 * need it.
+		 */
+
+		/* This is memory that exceeds workmem estimates. */
+		remaining -= workmem_stats->priority;
+
+		/*
+		 * Sort targets from highest ratio to lowest. We'll skip any Target
+		 * with ratio > 1.0, because (we estimate) they already need their
+		 * full allotment. Also, once a target reaches its workmem limit,
+		 * we'll stop giving it more workmem, leaving the surplus memory to be
+		 * assigned to targets with smaller ratios.
+		 */
+		list_sort(workmem_stats->targets, target_compare_desc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		fraction;
+			int			extra_workmem;
+
+			/* How much extra work mem should we assign to this target? */
+			fraction = (double) target->workmem / workmem_stats->workmem;
+
+			/*
+			 * Don't give the target more than its (original) limit.
+			 *
+			 * NOTE: This is extra workmem *per data structure*.
+			 */
+			extra_workmem = Min((int) (fraction * remaining),
+								target->limit - target->priority);
+
+			*target->target_limit = target->priority + extra_workmem;
+
+			/* OK, we've handled this target. */
+			workmem_stats->workmem -= (target->workmem * target->count);
+			remaining -= (extra_workmem * target->count);
+		}
+	}
+	else
+	{
+		uint64		limit = workmem_stats->limit;
+
+		/*
+		 * "Low memory" case: we are severely memory constrained, and need to
+		 * take "priority" memory away from targets that (we estimate)
+		 * actually need it. We'll do this by (effectively) reducing the
+		 * global "work_mem" limit, uniformly, for all targets, until we're
+		 * under the query_work_mem limit.
+		 */
+		elog(WARNING,
+			 "not enough working memory for query: increase "
+			 "workmem.query_work_mem");
+
+		/*
+		 * Sort targets from lowest ratio to highest. For any target whose
+		 * ratio is < the target_ratio, we'll just assign it its priority (=
+		 * workmem) as limit, and return the excess workmem to our "limit",
+		 * for use by subsequent, greedier, targets.
+		 */
+		list_sort(workmem_stats->targets, target_compare_asc);
+
+		foreach_ptr(Target, target, workmem_stats->targets)
+		{
+			double		target_ratio;
+			int			target_limit;
+
+			/*
+			 * If we restrict our targets to this ratio, we'll stay within the
+			 * query_work_mem limit.
+			 */
+			target_ratio = (double) remaining / limit;
+
+			/*
+			 * Don't give this target more than its priority request (but we
+			 * might give it less).
+			 */
+			target_limit = Min(target->priority,
+							   target_ratio * target->limit);
+			*target->target_limit = target_limit;
+
+			/* "Remaining" decreases by memory we actually assigned. */
+			remaining -= (target_limit * target->count);
+
+			/*
+			 * "Limit" decreases by target's original memory limit.
+			 *
+			 * If target_limit <= target->priority, so we restricted this
+			 * target to less memory than (we estimate) it needs, then the
+			 * target_ratio will stay the same, since, letting A = remaining,
+			 * B = limit, and R = ratio, we'll have:
+			 *
+			 * R=A/B <=> A=R*B <=> A-R*X = R*B - R*X <=> A-R*X = R * (B-X) <=>
+			 * R = (A-R*X) / (B-X)
+			 *
+			 * -- which is what we wanted to prove.
+			 *
+			 * And if target_limit > target->priority, so we didn't need to
+			 * restrict this target beyond its priority estimate, then the
+			 * target_ratio will increase. This means more memory for the
+			 * remaining, greedier, targets.
+			 */
+			limit -= (target->limit * target->count);
+
+			target_ratio = (double) remaining / limit;
+		}
+	}
+}
+
+/*
+ * workmem_fn: updates the query plan's work_mem based on query_work_mem
+ */
+static void
+workmem_fn(PlannedStmt *plannedstmt)
+{
+	WorkMemStats workmem_stats;
+	MemoryContext context,
+				oldcontext;
+
+	/*
+	 * We already assigned working-memory limits on the leader, and those
+	 * limits were sent to the workers inside the serialized Plan.
+	 *
+	 * We could re-assign working-memory limits on the parallel worker, to
+	 * only those Plan nodes that got sent to the worker, but for now we don't
+	 * bother.
+	 */
+	if (IsParallelWorker())
+		return;
+
+	if (workmem_query_work_mem == -1)
+		return;					/* disabled */
+
+	memset(&workmem_stats, 0, sizeof(workmem_stats));
+
+	/*
+	 * Set up our own memory context, so we can drop the metadata we generate,
+	 * all at once.
+	 */
+	context = AllocSetContextCreate(CurrentMemoryContext,
+									"workmem_fn context",
+									ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+
+	/* Figure out how much total working memory this query wants/needs. */
+	workmem_analyze(plannedstmt, &workmem_stats);
+
+	/* Now restrict the query to workmem.query_work_mem. */
+	workmem_set(plannedstmt, &workmem_stats);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Drop all our metadata. */
+	MemoryContextDelete(context);
+}
+
+static int
+clamp_priority(int workmem, int limit)
+{
+	return Min(workmem, limit);
+}
+
+static Target *
+make_target(int workmem, int *target_limit, int count)
+{
+	Target	   *result = palloc_object(Target);
+
+	result->count = count;
+	result->workmem = workmem;
+	result->limit = *target_limit;
+	result->priority = clamp_priority(result->workmem, result->limit);
+	result->ratio = (double) result->priority / result->limit;
+	result->target_limit = target_limit;
+
+	return result;
+}
+
+static void
+add_target(WorkMemStats * workmem_stats, Target * target)
+{
+	workmem_stats->count += target->count;
+	workmem_stats->workmem += target->count * target->workmem;
+	workmem_stats->limit += target->count * target->limit;
+	workmem_stats->priority += target->count * target->priority;
+	workmem_stats->targets = lappend(workmem_stats->targets, target);
+}
+
+/* This "ascending" comparator sorts least-greedy Targets first. */
+static int
+target_compare_asc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in ascending order: smallest ratio first, then (if ratios equal)
+	 * smallest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return pg_cmp_s32(((Target *) a->ptr_value)->workmem,
+						  ((Target *) b->ptr_value)->workmem);
+	}
+	else
+		return a_val > b_val ? 1 : -1;
+}
+
+/* This "descending" comparator sorts most-greedy Targets first. */
+static int
+target_compare_desc(const ListCell *a, const ListCell *b)
+{
+	double		a_val = ((Target *) a->ptr_value)->ratio;
+	double		b_val = ((Target *) b->ptr_value)->ratio;
+
+	/*
+	 * Sort in descending order: largest ratio first, then (if ratios equal)
+	 * largest workmem.
+	 */
+	if (a_val == b_val)
+	{
+		return pg_cmp_s32(((Target *) b->ptr_value)->workmem,
+						  ((Target *) a->ptr_value)->workmem);
+	}
+	else
+		return b_val > a_val ? 1 : -1;
+}
diff --git a/src/backend/executor/execWorkmem.c b/src/backend/executor/execWorkmem.c
index d8a19a58ebe..37420666065 100644
--- a/src/backend/executor/execWorkmem.c
+++ b/src/backend/executor/execWorkmem.c
@@ -52,6 +52,10 @@
 #include "nodes/plannodes.h"
 
 
+/* Hook for plugins to get control in ExecAssignWorkMem */
+ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook = NULL;
+
+
 /* ------------------------------------------------------------------------
  *		ExecAssignWorkMem
  *
@@ -64,20 +68,36 @@
  */
 void
 ExecAssignWorkMem(PlannedStmt *plannedstmt)
+{
+	if (ExecAssignWorkMem_hook)
+		(*ExecAssignWorkMem_hook) (plannedstmt);
+	else
+	{
+		/*
+		 * No need to re-assign working memory on parallel workers, since
+		 * workers have the same work_mem and hash_mem_multiplier GUCs as the
+		 * leader.
+		 *
+		 * We already assigned working-memory limits on the leader, and those
+		 * limits were sent to the workers inside the serialized Plan.
+		 *
+		 * We bail out here, in case the hook wants to re-assign memory on
+		 * parallel workers, and maybe wants to call
+		 * standard_ExecAssignWorkMem() first, as well.
+		 */
+		if (IsParallelWorker())
+			return;
+
+		standard_ExecAssignWorkMem(plannedstmt);
+	}
+}
+
+void
+standard_ExecAssignWorkMem(PlannedStmt *plannedstmt)
 {
 	ListCell   *lc_category;
 	ListCell   *lc_limit;
 
-	/*
-	 * No need to re-assign working memory on parallel workers, since workers
-	 * have the same work_mem and hash_mem_multiplier GUCs as the leader.
-	 *
-	 * We already assigned working-memory limits on the leader, and those
-	 * limits were sent to the workers inside the serialized Plan.
-	 */
-	if (IsParallelWorker())
-		return;
-
 	forboth(lc_category, plannedstmt->workMemCategories,
 			lc_limit, plannedstmt->workMemLimits)
 	{
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6008e3bc63c..7a34fa47489 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -96,6 +96,9 @@ typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
 											  bool ereport_on_violation);
 extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 
+/* Hook for plugins to get control in ExecAssignWorkMem() */
+typedef void (*ExecAssignWorkMem_hook_type) (PlannedStmt *plannedstmt);
+extern PGDLLIMPORT ExecAssignWorkMem_hook_type ExecAssignWorkMem_hook;
 
 /*
  * prototypes from functions in execAmi.c
@@ -802,5 +805,6 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
  * prototypes from functions in execWorkmem.c
  */
 extern void ExecAssignWorkMem(PlannedStmt *plannedstmt);
+extern void standard_ExecAssignWorkMem(PlannedStmt *plannedstmt);
 
 #endif							/* EXECUTOR_H  */
-- 
2.39.5

#26

Jeff Davis

pgsql@j-davis.com

5 months ago

In reply to: Álvaro Herrera (#25)

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

On Tue, 2025-08-05 at 14:15 +0200, Álvaro Herrera wrote:

Here's a rebased version of this patch. I didn't review it or touch
it
in any way, just fixed conflicts from current master.

James,

Patch 0001 is doing too much.

For a first step, I think it would be useful to create a new field for
each Plan node, and use that to enforce the execution-time memory
limit. There's a related discussion here about prepared statements:

/messages/by-id/83fbc36b66077e6ed0ad3a1c18fff3a7d2b22d36.camel@j-davis.com

so we may need to force replans if work_mem changes.

That first step wouldn't affect how memory usage is enforced (aside
from the prepared statement issue), because that field would just be a
copy of work_mem anyway. But once it's in place, extensions could
experiment by tweaking the work memory of individual plan nodes with a
planner_hook.

There's still a lot of work to do, but that work could be broken into
the following mostly-independent efforts:

* You point out that it's hard for an extension to handle subplans
without walking the expression tree. I haven't looked at this problem
in detail, but we can look at that as a separate change.

* Save the estimates, as well, which enables an extension to be smarter
about setting limits.

* We can consider starting from the paths first before copying to the
plan nodes. That would enable extensions to use set_rel_pathlist_hook
to affect the structure of the plan before it's generated rather than
just the per-node enforced limits after planning is done. For this to
be useful, we'd also need to have some additional infrastructure to
keep higher-cost lower-memory paths around. We'd need to find some
practical ways to prevent path explosion here, though.

* We should be more consistent about tracking memory usage by using
MemoryContextMemAllocated() where it makes sense.

* Consider enforcing the limit across all significant data structures
used within a node -- rather than the current behavior, where a single
node can use work_mem several times over by using multiple data
structures.

* We should free the memory from a node when execution is complete, not
wait until ExecutorEnd(). What really matters is the maximum
*concurrent* memory usage.

Regards,
Jeff Davis