Add mention of execution time memory for enable_partitionwise_* GUCs
Over on [1]/messages/by-id/3603c380-d094-136e-e333-610914fb3e80@gmx.net, there's a complaint about a query OOMing because the use
of enable_partitionwise_aggregate caused a plan with 1000 Hash
Aggregate nodes.
The only mention in the docs is the additional memory requirements and
CPU for query planning when that GUC is enabled. There's no mention
that execution could use work_mem * nparts more memory to be used. I
think that's bad and we should fix it.
I've attached my proposal to fix that.
David
[1]: /messages/by-id/3603c380-d094-136e-e333-610914fb3e80@gmx.net
Attachments:
enable_partitionwise_doc_change.patchapplication/octet-stream; name=enable_partitionwise_doc_change.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b14c5d81a1..46b978ce08 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5561,9 +5561,11 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
joining the matching partitions. Partitionwise join currently applies
only when the join conditions include all the partition keys, which
must be of the same data type and have one-to-one matching sets of
- child partitions. Because partitionwise join planning can use
- significantly more CPU time and memory during planning, the default is
- <literal>off</literal>.
+ child partitions. Enabling this can result in query plans containing
+ a large number of joins, which can result in a large increase in
+ execution time memory requirements. Additionally, planning becomes
+ more expensive in terms of memory and CPU. For these reasons, the
+ default is <literal>off</literal>.
</para>
</listitem>
</varlistentry>
@@ -5581,9 +5583,11 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
tables to be performed separately for each partition. If the
<literal>GROUP BY</literal> clause does not include the partition
keys, only partial aggregation can be performed on a per-partition
- basis, and finalization must be performed later. Because
- partitionwise grouping or aggregation can use significantly more CPU
- time and memory during planning, the default is
+ basis, and finalization must be performed later. Enabling this can
+ result in query plans containing a large number of sorts or aggregate
+ nodes, which can result in a large increase in execution time memory
+ requirements. Additionally, planning becomes more expensive in terms
+ of memory and CPU. For these reasons, the default is
<literal>off</literal>.
</para>
</listitem>
On Thu, Jul 18, 2024 at 4:03 AM David Rowley <dgrowleyml@gmail.com> wrote:
Over on [1], there's a complaint about a query OOMing because the use
of enable_partitionwise_aggregate caused a plan with 1000 Hash
Aggregate nodes.The only mention in the docs is the additional memory requirements and
CPU for query planning when that GUC is enabled. There's no mention
that execution could use work_mem * nparts more memory to be used. I
think that's bad and we should fix it.I've attached my proposal to fix that.
If those GUCs are enabled, the planner consumes large amount of memory
and also takes longer irrespective of whether partitionwise plan is
used or not. That's why the default is false. If majority of those
joins use nested loop memory, or use index scans instead sorting,
memory consumption won't be as large. Saying that it "can" result in
large increase in execution memory is not accurate. But I agree that
we need to mention the effect of work_mem on partitionwise
join/aggregation.
I had an offlist email exchange with Dimitrios where I suggested that
we should mention this in the work_mem description. I.e. in work_mem
description change "Note that a complex query might perform several
sort and hash operations"
to "Note that a complex query or a query using partitionwise
aggregates or joins might perform several sort and hash operations' '.
And in the description of enable_partitionwise_* GUCs mention that
"Each of the partitionwise join or aggregation which performs
sorting/hashing may consume work_mem worth of memory increasing the
total memory consumed during query execution.
--
Best Wishes,
Ashutosh Bapat
On Thu, 18 Jul 2024 at 21:24, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
If those GUCs are enabled, the planner consumes large amount of memory
and also takes longer irrespective of whether partitionwise plan is
used or not. That's why the default is false. If majority of those
joins use nested loop memory, or use index scans instead sorting,
memory consumption won't be as large. Saying that it "can" result in
large increase in execution memory is not accurate. But I agree that
we need to mention the effect of work_mem on partitionwise
join/aggregation.
hmm? please tell me what word other than "can" best describes
something that is possible to happen but does not always happen under
all circumstances.
David
On Thu, Jul 18, 2024 at 3:33 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 18 Jul 2024 at 21:24, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:If those GUCs are enabled, the planner consumes large amount of memory
and also takes longer irrespective of whether partitionwise plan is
used or not. That's why the default is false. If majority of those
joins use nested loop memory, or use index scans instead sorting,
memory consumption won't be as large. Saying that it "can" result in
large increase in execution memory is not accurate. But I agree that
we need to mention the effect of work_mem on partitionwise
join/aggregation.hmm? please tell me what word other than "can" best describes
something that is possible to happen but does not always happen under
all circumstances.
May I suggest "may"? :) [1]https://www.thesaurus.com/e/grammar/can-vs-may/, [2]https://www.britannica.com/dictionary/eb/qa/modal-verbs-may-might-can-could-and-ought, [3]https://www.merriam-webster.com/grammar/when-to-use-can-and-may.
My point is, we need to highlight the role of work_mem. So modify both
the descriptions.
[1]: https://www.thesaurus.com/e/grammar/can-vs-may/
[2]: https://www.britannica.com/dictionary/eb/qa/modal-verbs-may-might-can-could-and-ought
[3]: https://www.merriam-webster.com/grammar/when-to-use-can-and-may
--
Best Wishes,
Ashutosh Bapat
On Thu, 18 Jul 2024 at 22:28, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Thu, Jul 18, 2024 at 3:33 PM David Rowley <dgrowleyml@gmail.com> wrote:
hmm? please tell me what word other than "can" best describes
something that is possible to happen but does not always happen under
all circumstances.May I suggest "may"? :) [1], [2], [3].
Is this a wind-up?
If it's not, I disagree that "may" is a better choice. The
possibility example in your first link says "It may rain tomorrow.
(possibility)", but that's something someone would only say if there
was some evidence to support that, e.g. ominous clouds on the horizon
at dusk, or >0% chance of precipitation on the weather forecast.
Nobody is going to say that unless there's some supporting evidence.
For the executor using work_mem * nparts, we've no evidence either.
It's just a >0% possibility with no supporting evidence.
My point is, we need to highlight the role of work_mem. So modify both
the descriptions.
I considered writing about work_mem, but felt I wanted to keep it as
brief as possible and just have some words that might make someone
think twice. The details in the work_mem documentation should inform
the reader that work_mem is per executor node. It likely wouldn't
hurt to have more documentation around which executor node types can
use a work_mem, which use work_mem * hash_mem_multiplier and which use
neither. We tend to not write too much about executor nodes in the
documents, so I'm not proposing that for this patch.
David
Show quoted text
On Thu, Jul 18, 2024 at 4:24 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 18 Jul 2024 at 22:28, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:On Thu, Jul 18, 2024 at 3:33 PM David Rowley <dgrowleyml@gmail.com> wrote:
hmm? please tell me what word other than "can" best describes
something that is possible to happen but does not always happen under
all circumstances.May I suggest "may"? :) [1], [2], [3].
Is this a wind-up?
If it's not, I disagree that "may" is a better choice. The
possibility example in your first link says "It may rain tomorrow.
(possibility)", but that's something someone would only say if there
was some evidence to support that, e.g. ominous clouds on the horizon
at dusk, or >0% chance of precipitation on the weather forecast.
Nobody is going to say that unless there's some supporting evidence.
For the executor using work_mem * nparts, we've no evidence either.
It's just a >0% possibility with no supporting evidence.
I am not a native English speaker and might have made a mistake when
interpreting the definitions. Will leave that aside.
My point is, we need to highlight the role of work_mem. So modify both
the descriptions.I considered writing about work_mem, but felt I wanted to keep it as
brief as possible and just have some words that might make someone
think twice. The details in the work_mem documentation should inform
the reader that work_mem is per executor node. It likely wouldn't
hurt to have more documentation around which executor node types can
use a work_mem, which use work_mem * hash_mem_multiplier and which use
neither. We tend to not write too much about executor nodes in the
documents, so I'm not proposing that for this patch.
Something I didn't write in my first reply but wanted to discuss was
the intention of adding those GUCs. Sorry for missing it in my first
email. According to [1]/messages/by-id/CA+TgmoYggDp6k-HXNAgrykZh79w6nv2FevpYR_jeMbrORDaQrA@mail.gmail.com these GUCs were added because of increased
memory consumption during planning and time taken to plan the query.
The execution time memory consumption issue was known even back then
but the GUC was not set to default because of that. But your patch
proposes to change the narrative. In the same thread [1]/messages/by-id/CA+TgmoYggDp6k-HXNAgrykZh79w6nv2FevpYR_jeMbrORDaQrA@mail.gmail.com, you will
find the discussion about turning the default to ON once we have fixed
planner's memory and time consumption. We have patches addressing
those issues [2]/messages/by-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com [3]/messages/by-id/CAJ2pMkZNCgoUKSE+_5LthD+KbXKvq6h2hQN8Esxpxd+cxmgomg@mail.gmail.com. With your proposed changes we will almost never
have a chance to turn those GUCs ON by default. That seems a rather
sad prospect.
I am fine if we want to mention that the executor may consume a large
amount of memory when these GUCs are turned ON. Users may decide to
turn those OFF if they can not afford to spend that much memory during
execution. But I don't like tying execution time consumption with
default OFF.
[1]: /messages/by-id/CA+TgmoYggDp6k-HXNAgrykZh79w6nv2FevpYR_jeMbrORDaQrA@mail.gmail.com
[2]: /messages/by-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com
[3]: /messages/by-id/CAJ2pMkZNCgoUKSE+_5LthD+KbXKvq6h2hQN8Esxpxd+cxmgomg@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
On Fri, 19 Jul 2024 at 17:24, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
According to [1] these GUCs were added because of increased
memory consumption during planning and time taken to plan the query.
The execution time memory consumption issue was known even back then
but the GUC was not set to default because of that. But your patch
proposes to change the narrative. In the same thread [1], you will
find the discussion about turning the default to ON once we have fixed
planner's memory and time consumption. We have patches addressing
those issues [2] [3]. With your proposed changes we will almost never
have a chance to turn those GUCs ON by default. That seems a rather
sad prospect.
Sad prospect for who? If the day comes and we're considering enabling
these GUCs by default, I think we'll likely also be considering if
it'll be sad for users who get an increased likelihood of OOM kills
because the chosen plan uses $nparts times more memory to execute than
the old plan.
Can you honestly say that you have no concern about increased executor
memory consumption if we switched on these GUCs by default? I was
certainly concerned about it in [5]/messages/by-id/CAApHDvojKdBR3MR59JXmaCYbyHB6Q_5qPRU+dy93En8wm+XiDA@mail.gmail.com, so I dropped that patch after
realising what could happen.
I am fine if we want to mention that the executor may consume a large
amount of memory when these GUCs are turned ON. Users may decide to
turn those OFF if they can not afford to spend that much memory during
execution. But I don't like tying execution time consumption with
default OFF.
If you were to fix the planner issues then this text would need to be
revisited anyway. However, I seriously question your judgment on
fixing those alone being enough to allow us to switch these on by
default. It seems unlikely that the planner would use anything near
any realistic work_mem setting extra memory per partition, but that's
what the executor would do, given enough data per partition.
I'd say any analysis that only found planner memory and time to be the
only reason to turn these GUCs off by default was incomplete analysis.
Maybe it was a case of stopping looking for all the reasons once
enough had been found to make the choice. If not, then they found the
molehill and failed to notice the mountain.
David
[4]: /messages/by-id/3603c380-d094-136e-e333-610914fb3e80@gmx.net
[5]: /messages/by-id/CAApHDvojKdBR3MR59JXmaCYbyHB6Q_5qPRU+dy93En8wm+XiDA@mail.gmail.com
Thank you for the patch improving the docs, I think it's a clear
improvement from before.
On Thu, 18 Jul 2024, David Rowley wrote:
I considered writing about work_mem, but felt I wanted to keep it as
brief as possible and just have some words that might make someone
think twice. The details in the work_mem documentation should inform
the reader that work_mem is per executor node. It likely wouldn't
hurt to have more documentation around which executor node types can
use a work_mem, which use work_mem * hash_mem_multiplier and which use
neither. We tend to not write too much about executor nodes in the
documents, so I'm not proposing that for this patch.
This is the only part I think is missing, since we now know (measurements
in [1]/messages/by-id/d26e67d3-74bc-60aa-bf24-2a8fb83efe9c@gmx.net, reproducible scenario in [2]/messages/by-id/af6ed790-a5fe-19aa-1141-927595604c01@gmx.net) that the number of partitions plays
an important role in sizing the RAM of the server. It's just too big to
not mention that worst case will be n_partitions * work_mem.
[1]: /messages/by-id/d26e67d3-74bc-60aa-bf24-2a8fb83efe9c@gmx.net
[2]: /messages/by-id/af6ed790-a5fe-19aa-1141-927595604c01@gmx.net
I would also like to add an entry about this issue with links to the above
pages, to the TODO page at [3]https://wiki.postgresql.org/wiki/Todo, as this is the only bugtracker I'm aware
of. Am I doing it right bringing it up for approval on this list?
[3]: https://wiki.postgresql.org/wiki/Todo
Thanks,
Dimitris
On Fri, 19 Jul 2024 at 17:24, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
I am fine if we want to mention that the executor may consume a large
amount of memory when these GUCs are turned ON. Users may decide to
turn those OFF if they can not afford to spend that much memory during
execution. But I don't like tying execution time consumption with
default OFF.
Would the attached address your concern about the reasons for defaulting to off?
David
Attachments:
partitionwise_docs.patchapplication/octet-stream; name=partitionwise_docs.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 57cd7bb972..8c7a8b8645 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5567,9 +5567,13 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
joining the matching partitions. Partitionwise join currently applies
only when the join conditions include all the partition keys, which
must be of the same data type and have one-to-one matching sets of
- child partitions. Because partitionwise join planning can use
- significantly more CPU time and memory during planning, the default is
- <literal>off</literal>.
+ child partitions. With this setting enabled, the number of executor
+ nodes whose memory usage is restricted by <varname>work_mem</varname>
+ appearing in the final plan can increase linearly according to the
+ number of partitions being scanned. This can result in a large
+ increase in overall memory consumption during the execution of the
+ query. Query planning also becomes more expensive in terms of memory
+ and CPU. The default value is <literal>off</literal>.
</para>
</listitem>
</varlistentry>
@@ -5587,9 +5591,13 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
tables to be performed separately for each partition. If the
<literal>GROUP BY</literal> clause does not include the partition
keys, only partial aggregation can be performed on a per-partition
- basis, and finalization must be performed later. Because
- partitionwise grouping or aggregation can use significantly more CPU
- time and memory during planning, the default is
+ basis, and finalization must be performed later. With this setting
+ enabled, the number of executor nodes whose memory usage is restricted
+ by <varname>work_mem</varname> appearing in the final plan can
+ increase linearly according to the number of partitions being scanned.
+ This can result in a large increase in overall memory consumption
+ during the execution of the query. Query planning also becomes more
+ expensive in terms of memory and CPU. The default value is
<literal>off</literal>.
</para>
</listitem>
On Mon, Jul 29, 2024 at 10:37 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 19 Jul 2024 at 17:24, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:I am fine if we want to mention that the executor may consume a large
amount of memory when these GUCs are turned ON. Users may decide to
turn those OFF if they can not afford to spend that much memory during
execution. But I don't like tying execution time consumption with
default OFF.Would the attached address your concern about the reasons for defaulting to off?
Thanks. This looks better. Nitpick
+ child partitions. With this setting enabled, the number of executor
+ nodes whose memory usage is restricted by <varname>work_mem</varname>
This sentence appears to say that the memory usage of "all" nodes is
restricted by work_mem. I think what you want to convey is - nodes,
whose memory usage is subjected to <varname>work_mem</varname>
setting, ....
Or break it into two sentences
With this setting enabled, the number of executor nodes appearing in
the final plan can increase linearly proportional to the number of
partitions being scanned. Each of those nodes may use upto
<varname>work_mem</varname> memory. This can ...
I note that the work_mem documentation does not talk about executor
nodes, instead it uses the term "query operations".
--
Best Wishes,
Ashutosh Bapat
On Tue, 30 Jul 2024 at 21:12, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
Thanks. This looks better. Nitpick
+ child partitions. With this setting enabled, the number of executor + nodes whose memory usage is restricted by <varname>work_mem</varname>This sentence appears to say that the memory usage of "all" nodes is
restricted by work_mem. I think what you want to convey is - nodes,
whose memory usage is subjected to <varname>work_mem</varname>
setting, ....
I'm open to improving the sentence but I don't quite follow why
"subjected to" is better than "restricted by". It seems to remove
meaning without saving any words. With "restricted by" we can deduce
that the memory cannot go over work_mem, whereas "subjected to" only
gives us some indication that the memory usage somehow relates to the
work_mem setting and doesn't not mean that the memory used could be >=
work_mem.
Or break it into two sentences
With this setting enabled, the number of executor nodes appearing in
the final plan can increase linearly proportional to the number of
partitions being scanned. Each of those nodes may use upto
<varname>work_mem</varname> memory. This can ...
... but that's incorrect. This means that all additional nodes that
appear in the plan as a result of enabling the setting can use up to
work_mem. That's not the case as some of the new nodes might be
Nested Loops or Merge Joins and they're not going to use up to
work_mem.
As a compromise, I've dropped the word "executor" and it now just says
"the number of nodes whose memory usage is restricted by work_mem
appearing in the final plan". I'd have used "plan nodes" but it seems
strange to use "plan nodes" then later "in the final plan".
I note that the work_mem documentation does not talk about executor
nodes, instead it uses the term "query operations".
I much prefer "nodes" to "operations". If someone started asking me
about "query operations", I'd have to confirm what they meant. An I/O
is an operation that can occur during the execution of a query. Is
that a "query operation"? IMO, the wording you're proposing is less
concise. I'm not a fan of the terminology when talking about plan
nodes. I'd rather see the work_mem docs modified than copy what it
says.
Please have a look at: git grep -B 1 -i -F "plan node" -- doc/*
that seems like a popular term.
David
Attachments:
partitionwise_docs2.patchapplication/octet-stream; name=partitionwise_docs2.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 57cd7bb972..de22fb5090 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5567,9 +5567,13 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
joining the matching partitions. Partitionwise join currently applies
only when the join conditions include all the partition keys, which
must be of the same data type and have one-to-one matching sets of
- child partitions. Because partitionwise join planning can use
- significantly more CPU time and memory during planning, the default is
- <literal>off</literal>.
+ child partitions. With this setting enabled, the number of nodes
+ whose memory usage is restricted by <varname>work_mem</varname>
+ appearing in the final plan can increase linearly according to the
+ number of partitions being scanned. This can result in a large
+ increase in overall memory consumption during the execution of the
+ query. Query planning also becomes more expensive in terms of memory
+ and CPU. The default value is <literal>off</literal>.
</para>
</listitem>
</varlistentry>
@@ -5587,9 +5591,13 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
tables to be performed separately for each partition. If the
<literal>GROUP BY</literal> clause does not include the partition
keys, only partial aggregation can be performed on a per-partition
- basis, and finalization must be performed later. Because
- partitionwise grouping or aggregation can use significantly more CPU
- time and memory during planning, the default is
+ basis, and finalization must be performed later. With this setting
+ enabled, the number of nodes whose memory usage is restricted by
+ <varname>work_mem</varname> appearing in the final plan can increase
+ linearly according to the number of partitions being scanned. This
+ can result in a large increase in overall memory consumption during
+ the execution of the query. Query planning also becomes more
+ expensive in terms of memory and CPU. The default value is
<literal>off</literal>.
</para>
</listitem>
On Tue, Jul 30, 2024 at 5:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 30 Jul 2024 at 21:12, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:Thanks. This looks better. Nitpick
+ child partitions. With this setting enabled, the number of executor + nodes whose memory usage is restricted by <varname>work_mem</varname>This sentence appears to say that the memory usage of "all" nodes is
restricted by work_mem. I think what you want to convey is - nodes,
whose memory usage is subjected to <varname>work_mem</varname>
setting, ....I'm open to improving the sentence but I don't quite follow why
"subjected to" is better than "restricted by". It seems to remove
meaning without saving any words. With "restricted by" we can deduce
that the memory cannot go over work_mem, whereas "subjected to" only
gives us some indication that the memory usage somehow relates to the
work_mem setting and doesn't not mean that the memory used could be >=
work_mem.Or break it into two sentences
With this setting enabled, the number of executor nodes appearing in
the final plan can increase linearly proportional to the number of
partitions being scanned. Each of those nodes may use upto
<varname>work_mem</varname> memory. This can ...... but that's incorrect. This means that all additional nodes that
appear in the plan as a result of enabling the setting can use up to
work_mem. That's not the case as some of the new nodes might be
Nested Loops or Merge Joins and they're not going to use up to
work_mem.
Any wording, which indicates that "some" of those nodes "may" use upto
"work_mem" memory "each", is fine with me. If you think that your
current wording conveys that meaning, I am ok.
I much prefer "nodes" to "operations". If someone started asking me
about "query operations", I'd have to confirm what they meant. An I/O
is an operation that can occur during the execution of a query. Is
that a "query operation"? IMO, the wording you're proposing is less
concise. I'm not a fan of the terminology when talking about plan
nodes. I'd rather see the work_mem docs modified than copy what it
says.
We need to use consistent terms at both places. Somebody reading the
new text and then referring to work_mem description will wonder
whether "query operations" are the same thing as "executor nodes" or
not. But I get your point that query operation doesn't necessarily
mean operations performed by each executor node, especially when those
operations are not specified directly in the query. We can commit your
version and see if users find it confusing. May be they already know
that operations and nodes are interchangeable in this context.
I noticed a previous discussion about work_mem documentation [1]/messages/by-id/66590882-F48C-4A25-83E3-73792CF8C51F@amazon.com. But
that didn't change the first sentence in the description.
[1]: /messages/by-id/66590882-F48C-4A25-83E3-73792CF8C51F@amazon.com
--
Best Wishes,
Ashutosh Bapat