Make MemoryContextMemAllocated() more precise

Started by Jeff Davisabout 6 years ago22 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

AllocSet allocates memory for itself in blocks, which double in size up
to maxBlockSize. So, the current block (the last one malloc'd) may
represent half of the total memory allocated for the context itself.

The free space at the end of that block hasn't been touched at all, and
doesn't represent fragmentation or overhead. That means that the
"allocated" memory can be 2X the memory ever touched in the worst case.

Although that's technically correct, the purpose of
MemoryContextMemAllocated() is to give a "real" usage number so we know
when we're out of work_mem and need to spill (in particular, the disk-
based HashAgg work, but ideally other operators as well). This "real"
number should include fragmentation, freed-and-not-reused chunks, and
other overhead. But it should not include significant amounts of
allocated-but-never-touched memory, which says more about economizing
calls to malloc than it does about the algorithm's memory usage.

Attached is a patch that makes mem_allocated a method (rather than a
field) of MemoryContext, and allows each memory context type to track
the memory its own way. They all do the same thing as before
(increment/decrement a field), but AllocSet also subtracts out the free
space in the current block. For Slab and Generation, we could do
something similar, but it's not as much of a problem because there's no
doubling of the allocation size.

Although I think this still matches the word "allocation" in spirit,
it's not technically correct, so feel free to suggest a new name for
MemoryContextMemAllocated().

Regards,
Jeff Davis

Attachments:

allocated.patchtext/x-patch; charset=UTF-8; name=allocated.patchDownload+77-28
#2Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: Make MemoryContextMemAllocated() more precise

On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote:

Attached is a patch that makes mem_allocated a method (rather than a
field) of MemoryContext, and allows each memory context type to track
the memory its own way. They all do the same thing as before
(increment/decrement a field), but AllocSet also subtracts out the
free
space in the current block. For Slab and Generation, we could do
something similar, but it's not as much of a problem because there's
no
doubling of the allocation size.

Committed.

In an off-list discussion, Andres suggested that MemoryContextStats
could be refactored to achieve this purpose, perhaps with flags to
avoid walking through the blocks and freelists when those are not
needed.

We discussed a few other names, such as "space", "active memory", and
"touched". We didn't settle on any great name, but "touched" seemed to
be the most descriptive.

This refactoring/renaming can be done later; right now I committed this
to unblock disk-based Hash Aggregation, which is ready.

Regards,
Jeff Davis

#3Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#1)
Re: Make MemoryContextMemAllocated() more precise

On Mon, Mar 16, 2020 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote:

Attached is a patch that makes mem_allocated a method (rather than a
field) of MemoryContext, and allows each memory context type to track
the memory its own way. They all do the same thing as before
(increment/decrement a field), but AllocSet also subtracts out the free
space in the current block. For Slab and Generation, we could do
something similar, but it's not as much of a problem because there's no
doubling of the allocation size.

Although I think this still matches the word "allocation" in spirit,
it's not technically correct, so feel free to suggest a new name for
MemoryContextMemAllocated().

Procedurally, I think that it is highly inappropriate to submit a
patch two weeks after the start of the final CommitFest and then
commit it just over 48 hours later without a single endorsement of the
change from anyone.

Substantively, I think that whether or not this is improvement depends
considerably on how your OS handles overcommit. I do not have enough
knowledge to know whether it will be better in general, but would
welcome opinions from others.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#3)
Re: Make MemoryContextMemAllocated() more precise

On Thu, Mar 19, 2020 at 11:44:05AM -0400, Robert Haas wrote:

On Mon, Mar 16, 2020 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote:

Attached is a patch that makes mem_allocated a method (rather than a
field) of MemoryContext, and allows each memory context type to track
the memory its own way. They all do the same thing as before
(increment/decrement a field), but AllocSet also subtracts out the free
space in the current block. For Slab and Generation, we could do
something similar, but it's not as much of a problem because there's no
doubling of the allocation size.

Although I think this still matches the word "allocation" in spirit,
it's not technically correct, so feel free to suggest a new name for
MemoryContextMemAllocated().

Procedurally, I think that it is highly inappropriate to submit a
patch two weeks after the start of the final CommitFest and then
commit it just over 48 hours later without a single endorsement of the
change from anyone.

True.

Substantively, I think that whether or not this is improvement depends
considerably on how your OS handles overcommit. I do not have enough
knowledge to know whether it will be better in general, but would
welcome opinions from others.

Not sure overcommit is a major factor, and if it is then maybe it's the
strategy of doubling block size that's causing problems.

AFAICS the 2x allocation is the worst case, because it only happens
right after allocating a new block (of twice the size), when the
"utilization" drops from 100% to 50%. But in practice the utilization
will be somewhere in between, with an average of 75%. And we're not
doubling the block size indefinitely - there's an upper limit, so over
time the utilization drops less and less. So as the contexts grow, the
discrepancy disappears. And I'd argue the smaller the context, the less
of an issue the overcommit behavior is.

My understanding is that this is really just an accounting issue, where
allocating a block would get us over the limit, which I suppose might be
an issue with low work_mem values.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#4)
Re: Make MemoryContextMemAllocated() more precise

On Thu, 2020-03-19 at 19:11 +0100, Tomas Vondra wrote:

AFAICS the 2x allocation is the worst case, because it only happens
right after allocating a new block (of twice the size), when the
"utilization" drops from 100% to 50%. But in practice the utilization
will be somewhere in between, with an average of 75%.

Sort of. Hash Agg is constantly watching the memory, so it will
typically spill right at the point where the accounting for that memory
context is off by 2X.

That's mitigated because the hash table itself (the array of
TupleHashEntryData) ends up allocated as its own block, so does not
have any waste. The total (table mem + out of line) might be close to
right if the hash table array itself is a large fraction of the data,
but I don't think that's what we want.

And we're not
doubling the block size indefinitely - there's an upper limit, so
over
time the utilization drops less and less. So as the contexts grow,
the
discrepancy disappears. And I'd argue the smaller the context, the
less
of an issue the overcommit behavior is.

The problem is that the default work_mem is 4MB, and the doubling
behavior goes to 8MB, so it's a problem with default settings.

Regards,
Jeff Davis

#6Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#4)
Re: Make MemoryContextMemAllocated() more precise

On Thu, Mar 19, 2020 at 2:11 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

My understanding is that this is really just an accounting issue, where
allocating a block would get us over the limit, which I suppose might be
an issue with low work_mem values.

Well, the issue is, if I understand correctly, that this means that
MemoryContextStats() might now report a smaller amount of memory than
what we actually allocated from the operating system. That seems like
it might lead someone trying to figure out where a backend is leaking
memory to erroneous conclusions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#3)
Re: Make MemoryContextMemAllocated() more precise

On Thu, 2020-03-19 at 11:44 -0400, Robert Haas wrote:

Procedurally, I think that it is highly inappropriate to submit a
patch two weeks after the start of the final CommitFest and then
commit it just over 48 hours later without a single endorsement of
the
change from anyone.

Reverted.

Sorry, I misjudged this as a "supporting fix for a specific problem",
but it seems others feel it requires discussion.

Substantively, I think that whether or not this is improvement
depends
considerably on how your OS handles overcommit. I do not have enough
knowledge to know whether it will be better in general, but would
welcome opinions from others.

I think omitting the tail of the current block is an unqualified
improvement for the purpose of obeying work_mem, regardless of the OS.
The block sizes keep doubling up to 8MB, and it doesn't make a lot of
sense to count that last large mostly-empty block against work_mem.

Regards,
Jeff Davis

#8Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#7)
Re: Make MemoryContextMemAllocated() more precise

On Thu, Mar 19, 2020 at 3:27 PM Jeff Davis <pgsql@j-davis.com> wrote:

I think omitting the tail of the current block is an unqualified
improvement for the purpose of obeying work_mem, regardless of the OS.
The block sizes keep doubling up to 8MB, and it doesn't make a lot of
sense to count that last large mostly-empty block against work_mem.

Well, again, my point is that whether or not it counts depends on your
system's overcommit behavior. Depending on how you have the
configured, or what your OS likes to do, it may in reality count or
not count. Or so I believe. Am I wrong?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#5)
Re: Make MemoryContextMemAllocated() more precise

On Thu, Mar 19, 2020 at 12:25:16PM -0700, Jeff Davis wrote:

On Thu, 2020-03-19 at 19:11 +0100, Tomas Vondra wrote:

AFAICS the 2x allocation is the worst case, because it only happens
right after allocating a new block (of twice the size), when the
"utilization" drops from 100% to 50%. But in practice the utilization
will be somewhere in between, with an average of 75%.

Sort of. Hash Agg is constantly watching the memory, so it will
typically spill right at the point where the accounting for that memory
context is off by 2X.

That's mitigated because the hash table itself (the array of
TupleHashEntryData) ends up allocated as its own block, so does not
have any waste. The total (table mem + out of line) might be close to
right if the hash table array itself is a large fraction of the data,
but I don't think that's what we want.

And we're not
doubling the block size indefinitely - there's an upper limit, so
over
time the utilization drops less and less. So as the contexts grow,
the
discrepancy disappears. And I'd argue the smaller the context, the
less
of an issue the overcommit behavior is.

The problem is that the default work_mem is 4MB, and the doubling
behavior goes to 8MB, so it's a problem with default settings.

Yes, it's an issue for the accuracy of our accounting. What Robert was
talking about is overcommit behavior at the OS level, which I'm arguing
is unlikely to be an issue, because for low work_mem values the absolute
difference is small, and on large work_mem values it's limited by the
block size limit.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#6)
Re: Make MemoryContextMemAllocated() more precise

On Thu, 2020-03-19 at 15:26 -0400, Robert Haas wrote:

Well, the issue is, if I understand correctly, that this means that
MemoryContextStats() might now report a smaller amount of memory than
what we actually allocated from the operating system. That seems like
it might lead someone trying to figure out where a backend is leaking
memory to erroneous conclusions.

MemoryContextStats() was unaffected by my now-reverted change.

Regards,
Jeff Davis

#11Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: Make MemoryContextMemAllocated() more precise

On Mon, 2020-03-16 at 11:45 -0700, Jeff Davis wrote:

Although that's technically correct, the purpose of
MemoryContextMemAllocated() is to give a "real" usage number so we
know
when we're out of work_mem and need to spill (in particular, the
disk-
based HashAgg work, but ideally other operators as well). This "real"
number should include fragmentation, freed-and-not-reused chunks, and
other overhead. But it should not include significant amounts of
allocated-but-never-touched memory, which says more about economizing
calls to malloc than it does about the algorithm's memory usage.

To expand on this point:

work_mem is to keep executor algorithms somewhat constrained in the
memory that they use. With that in mind, it should reflect things that
the algorithm has some control over, and can be measured cheaply.

Therefore, we shouldn't include huge nearly-empty blocks of memory that
the system decided to allocate in response to a request for a small
chunk (the algorithm has little control). Nor should we try to walk a
list of blocks or free chunks (expensive).

We should include used memory, reasonable overhead (chunk headers,
alignment, etc.), and probably free chunks (which represent
fragmentation).

For AllocSet, the notion of "all touched memory", which is everything
except the current block's tail, seems to fit the requirements well.

Regards,
Jeff Davis

#12Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#10)
Re: Make MemoryContextMemAllocated() more precise

On Thu, Mar 19, 2020 at 3:44 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2020-03-19 at 15:26 -0400, Robert Haas wrote:

Well, the issue is, if I understand correctly, that this means that
MemoryContextStats() might now report a smaller amount of memory than
what we actually allocated from the operating system. That seems like
it might lead someone trying to figure out where a backend is leaking
memory to erroneous conclusions.

MemoryContextStats() was unaffected by my now-reverted change.

Oh. Well, that addresses my concern, then. If this only affects the
accounting for memory-bounded hash aggregation and nothing else is
going to be touched, including MemoryContextStats(), then it's not an
issue for me.

Other people may have different concerns, but that was the only thing
that was bothering me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#8)
Re: Make MemoryContextMemAllocated() more precise

On Thu, 2020-03-19 at 15:33 -0400, Robert Haas wrote:

On Thu, Mar 19, 2020 at 3:27 PM Jeff Davis <pgsql@j-davis.com> wrote:

I think omitting the tail of the current block is an unqualified
improvement for the purpose of obeying work_mem, regardless of the
OS.
The block sizes keep doubling up to 8MB, and it doesn't make a lot
of
sense to count that last large mostly-empty block against work_mem.

Well, again, my point is that whether or not it counts depends on
your
system's overcommit behavior. Depending on how you have the
configured, or what your OS likes to do, it may in reality count or
not count. Or so I believe. Am I wrong?

I don't believe it should not be counted for the purposes of work_mem.

Let's say that the OS eagerly allocates it, then what is the algorithm
supposed to do in response? It can either:

1. just accept that all of the space is used, even though it's
potentially as low as 50% used, and spill earlier than may be
necessary; or

2. find a way to measure the free space, and somehow predict whether
that space will be reused the next time a group is added to the hash
table

It just doesn't seem reasonable for me for the algorithm to change its
behavior based on these large block allocations. It may be valuable
information for other purposes (like tuning your OS, or tracking down
memory issues), though.

Regards,
Jeff Davis

#14Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#12)
Re: Make MemoryContextMemAllocated() more precise

On Thu, 2020-03-19 at 16:04 -0400, Robert Haas wrote:

Other people may have different concerns, but that was the only thing
that was bothering me.

OK, thank you for raising it.

Perhaps we can re-fix the issue for HashAgg if necessary, or I can
tweak some accounting things within HashAgg itself, or both.

Regards,
Jeff Davis

#15Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#2)
Re: Make MemoryContextMemAllocated() more precise

On Wed, 2020-03-18 at 15:41 -0700, Jeff Davis wrote:

In an off-list discussion, Andres suggested that MemoryContextStats
could be refactored to achieve this purpose, perhaps with flags to
avoid walking through the blocks and freelists when those are not
needed.

Attached refactoring patch. There's enough in here that warrants
discussion that I don't think this makes sense for v13 and I'm adding
it to the July commitfest.

I still think we should do something for v13, such as the originally-
proposed patch[1]/messages/by-id/ec63d70b668818255486a83ffadc3aec492c1f57.camel@j-davis.com. It's not critical, but it simply reports a better
number for memory consumption. Currently, the memory usage appears to
jump, often right past work mem (by a reasonable but noticable amount),
which could be confusing.

Regarding the attached patch (target v14):

* there's a new MemoryContextCount() that simply calculates the
statistics without printing anything, and returns a struct
- it supports flags to indicate which stats should be
calculated, so that some callers can avoid walking through
blocks/freelists
* it adds a new statistic for "new space" (i.e. untouched)
* it eliminates specialization of the memory context printing
- the only specialization was for generation.c to output the
number of chunks, which can be done easily enough for the
other types, too

Regards,
Jeff Davis

[1]: /messages/by-id/ec63d70b668818255486a83ffadc3aec492c1f57.camel@j-davis.com
/messages/by-id/ec63d70b668818255486a83ffadc3aec492c1f57.camel@j-davis.com

Attachments:

mcxt.patchtext/x-patch; charset=UTF-8; name=mcxt.patchDownload+264-223
#16Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#15)
Re: Make MemoryContextMemAllocated() more precise

Hi,

On 2020-03-27 17:21:10 -0700, Jeff Davis wrote:

Attached refactoring patch. There's enough in here that warrants
discussion that I don't think this makes sense for v13 and I'm adding
it to the July commitfest.

IDK, adding a commit to v13 that we know we should do architecturally
differently in v14, when the difference in complexity between the two
patches isn't actually *that* big...

I'd like to see others jump in here...

I still think we should do something for v13, such as the originally-
proposed patch[1]. It's not critical, but it simply reports a better
number for memory consumption. Currently, the memory usage appears to
jump, often right past work mem (by a reasonable but noticable amount),
which could be confusing.

Is that really a significant issue for most work mem sizes? Shouldn't
the way we increase sizes lead to the max difference between the
measurements to be somewhat limited?

* there's a new MemoryContextCount() that simply calculates the
statistics without printing anything, and returns a struct
- it supports flags to indicate which stats should be
calculated, so that some callers can avoid walking through
blocks/freelists
* it adds a new statistic for "new space" (i.e. untouched)
* it eliminates specialization of the memory context printing
- the only specialization was for generation.c to output the
number of chunks, which can be done easily enough for the
other types, too

That sounds like a good direction.

+	if (flags & MCXT_STAT_NBLOCKS)
+		counters.nblocks = nblocks;
+	if (flags & MCXT_STAT_NCHUNKS)
+		counters.nchunks = set->nChunks;
+	if (flags & MCXT_STAT_FREECHUNKS)
+		counters.freechunks = freechunks;
+	if (flags & MCXT_STAT_TOTALSPACE)
+		counters.totalspace = set->memAllocated;
+	if (flags & MCXT_STAT_FREESPACE)
+		counters.freespace = freespace;
+	if (flags & MCXT_STAT_NEWSPACE)
+		counters.newspace = set->blocks->endptr - set->blocks->freeptr;

I'd spec it so that context implementations are allowed to
unconditionally fill fields, even when the flag isn't specified. The
branches quoted don't buy us anyting...

diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index c9f2bbcb367..cc545852968 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -29,11 +29,21 @@
typedef struct MemoryContextCounters
{
Size		nblocks;		/* Total number of malloc blocks */
+	Size		nchunks;		/* Total number of chunks (used+free) */
Size		freechunks;		/* Total number of free chunks */
Size		totalspace;		/* Total bytes requested from malloc */
Size		freespace;		/* The unused portion of totalspace */
+	Size		newspace;		/* Allocated but never held any chunks */

I'd add some reasoning as to why this is useful.

} MemoryContextCounters;

+#define MCXT_STAT_NBLOCKS		(1 << 0)
+#define MCXT_STAT_NCHUNKS		(1 << 1)
+#define MCXT_STAT_FREECHUNKS	(1 << 2)
+#define MCXT_STAT_TOTALSPACE	(1 << 3)
+#define MCXT_STAT_FREESPACE		(1 << 4)
+#define MCXT_STAT_NEWSPACE		(1 << 5)

s/MCXT_STAT/MCXT_STAT_NEED/?

+#define MCXT_STAT_ALL ((1 << 6) - 1)

Hm, why not go for ~0 or such?

Greetings,

Andres Freund

#17David Rowley
dgrowleyml@gmail.com
In reply to: Jeff Davis (#15)
Re: Make MemoryContextMemAllocated() more precise

On Sat, 28 Mar 2020 at 13:21, Jeff Davis <pgsql@j-davis.com> wrote:

Attached refactoring patch. There's enough in here that warrants
discussion that I don't think this makes sense for v13 and I'm adding
it to the July commitfest.

I had a read over this too. I noted down the following during my pass of it.

1. The comment mentions "passthru", but you've removed that parameter.

  * For now, the passthru pointer just points to "int level"; later we might
  * make that more complicated.
  */
 static void
-MemoryContextStatsPrint(MemoryContext context, void *passthru,
+MemoryContextStatsPrint(MemoryContext context, int level,
  const char *stats_string)

2. I don't think MemoryContextCount is the best name for this
function. When I saw:

counters = MemoryContextCount(aggstate->hash_metacxt, flags, true);

i assumed it was returning some integer number, that is until I looked
at the "counters" datatype.

Maybe it would be better to name the function
MemoryContextGetTelemetry and the struct MemoryContextTelemetry rather
than MemoryContextCounters? Or maybe MemoryContextTally and call the
function either MemoryContextGetTelemetry() or
MemoryContextGetTally(). Or perhaps MemoryContextGetAccounting() and
the struct MemoryContextAccounting

3. I feel like it would be nicer if you didn't change the "count"
methods to return a MemoryContextCounters. It means you may need to
zero a struct for each level, assign the values, then add them to the
total. If you were just to zero the struct in MemoryContextCount()
then pass it into the count function, then you could just have it do
all the += work. It would reduce the code in MemoryContextCount() too.

4. Do you think it would be better to have two separate functions for
MemoryContextCount(), a recursive version and a non-recursive version.
I feel that the function should be so simple that it does not make a
great deal of sense to load it up to handle both cases. Looking
around mcxt.c, I see MemoryContextResetOnly() and
MemoryContextResetChildren(), while that's not a perfect example, it
does seem like a better lead to follow.

5. For performance testing, I tried using the following table with 1MB
work_mem then again with 1GB work_mem. I wondered if making the
accounting more complex would cause a slowdown in nodeAgg.c, as I see
we're calling this function each time we get a tuple that belongs in a
new group. The 1MB test is likely not such a great way to measure this
since we do spill to disk in that case and the changing in accounting
means we do that at a slightly different time, but the 1GB test does
not spill and it's a bit slower.

create table t (a int);
insert into t select x from generate_Series(1,10000000) x;
analyze t;

-- Unpatched

set work_mem = '1MB';
explain analyze select a from t group by a; -- Ran 3 times.

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=481750.10..659875.95 rows=10000048 width=4)
(actual time=3350.190..8193.400 rows=10000000 loops=1)
Group Key: a
Planned Partitions: 32
Peak Memory Usage: 1177 kB
Disk Usage: 234920 kB
HashAgg Batches: 1188
-> Seq Scan on t (cost=0.00..144248.48 rows=10000048 width=4)
(actual time=0.013..1013.755 rows=10000000 loops=1)
Planning Time: 0.131 ms
Execution Time: 8586.420 ms
Execution Time: 8446.961 ms
Execution Time: 8449.492 ms
(9 rows)

-- Patched

set work_mem = '1MB';
explain analyze select a from t group by a;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=481748.00..659873.00 rows=10000000 width=4)
(actual time=3470.107..8598.836 rows=10000000 loops=1)
Group Key: a
Planned Partitions: 32
Peak Memory Usage: 1033 kB
Disk Usage: 234816 kB
HashAgg Batches: 1056
-> Seq Scan on t (cost=0.00..144248.00 rows=10000000 width=4)
(actual time=0.017..1091.820 rows=10000000 loops=1)
Planning Time: 0.285 ms
Execution Time: 8996.824 ms
Execution Time: 8781.624 ms
Execution Time: 8900.324 ms
(9 rows)

-- Unpatched

set work_mem = '1GB';
explain analyze select a from t group by a;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=169248.00..269248.00 rows=10000000 width=4)
(actual time=4537.779..7033.318 rows=10000000 loops=1)
Group Key: a
Peak Memory Usage: 868369 kB
-> Seq Scan on t (cost=0.00..144248.00 rows=10000000 width=4)
(actual time=0.018..820.136 rows=10000000 loops=1)
Planning Time: 0.054 ms
Execution Time: 7561.063 ms
Execution Time: 7573.985 ms
Execution Time: 7572.058 ms
(6 rows)

-- Patched

set work_mem = '1GB';
explain analyze select a from t group by a;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=169248.00..269248.00 rows=10000000 width=4)
(actual time=4840.045..7359.970 rows=10000000 loops=1)
Group Key: a
Peak Memory Usage: 861975 kB
-> Seq Scan on t (cost=0.00..144248.00 rows=10000000 width=4)
(actual time=0.018..789.975 rows=10000000 loops=1)
Planning Time: 0.055 ms
Execution Time: 7904.069 ms
Execution Time: 7913.692 ms
Execution Time: 7927.061 ms
(6 rows)

Perhaps the slowdown is unrelated. If it, then maybe the reduction in
branching mentioned by Andres might help a bit plus maybe a bit from
what I mentioned above about passing in the counter struct instead of
returning it at each level.

David

#18Jeff Davis
pgsql@j-davis.com
In reply to: David Rowley (#17)
Re: Make MemoryContextMemAllocated() more precise

On Mon, 2020-04-06 at 23:39 +1200, David Rowley wrote:

1. The comment mentions "passthru", but you've removed that
parameter.

Fixed, thank you.

2. I don't think MemoryContextCount is the best name for this
function. When I saw:

I've gone back and forth on naming a bit. The right name, in my
opinion, is MemoryContextStats(), but that's taken by something that
should be called MemoryContextReport(). But I didn't want to change
that as it would probably annoy a lot of people who are used to calling
MemoryContextStats() from gdb.

I changed the new function to be called MemoryContextGetCounters(),
which is more directly what it's doing. "Telemetry" makes me think more
of a stream of information rather than a particular point in time.

3. I feel like it would be nicer if you didn't change the "count"
methods to return a MemoryContextCounters. It means you may need to
zero a struct for each level, assign the values, then add them to the
total. If you were just to zero the struct in MemoryContextCount()
then pass it into the count function, then you could just have it do
all the += work. It would reduce the code in MemoryContextCount()
too.

I changed it to use a pointer out parameter, but I don't think an
in/out parameter is quite right there. MemoryContextStats() ends up
using both the per-context counters as well as the totals, so it's not
helpful to return just the totals.

4. Do you think it would be better to have two separate functions for
MemoryContextCount(), a recursive version and a non-recursive
version.

I could, but right now the only caller passes recurse=true, so I'm not
really eliminating any code in that path by specializing the functions.
Are you thinking about performance or you just think it would be better
to have two entry points?

5. For performance testing, I tried using the following table with
1MB
work_mem then again with 1GB work_mem. I wondered if making the
accounting more complex would cause a slowdown in nodeAgg.c, as I see
we're calling this function each time we get a tuple that belongs in
a
new group. The 1MB test is likely not such a great way to measure
this
since we do spill to disk in that case and the changing in accounting
means we do that at a slightly different time, but the 1GB test does
not spill and it's a bit slower.

I think this was because I was previously returning a struct. By just
passing a pointer as an out param, it seems to have mitigated it, but
not completely eliminated the cost (< 2% in my tests). I was using an
OFFSET 100000000 instead of EXPLAIN ANALYZE in my test measurements
because it was less noisy, and I focused on the 1GB test for the reason
you mention.

I also addressed Andres's comments:

* changed the name of the flags from MCXT_STAT to MCXT_COUNT
* changed ((1<<6)-1) to ~0
* removed unnecessary branches from the GetCounters method
* expanded some comments

Regards,
Jeff Davis

Attachments:

mcxt-20200406.patchtext/x-patch; charset=UTF-8; name=mcxt-20200406.patchDownload+264-226
#19Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#16)
Re: Make MemoryContextMemAllocated() more precise

On Sun, 2020-04-05 at 16:48 -0700, Andres Freund wrote:

I still think we should do something for v13, such as the
originally-
proposed patch[1]. It's not critical, but it simply reports a
better
number for memory consumption. Currently, the memory usage appears
to
jump, often right past work mem (by a reasonable but noticable
amount),
which could be confusing.

Is that really a significant issue for most work mem sizes? Shouldn't
the way we increase sizes lead to the max difference between the
measurements to be somewhat limited?

For work_mem less than 16MB, it's essentially spilling when the table
context is about 75% of what it could be (as bad as 50% and as good as
100% depending on a number of factors). That's not terrible, but it is
significant.

It also means that the reported memory peak jumps up rather than going
up gradually, so it ends up surpassing work_mem (e.g. 4MB of work_mem
might show a memory peak of 5MB). So it's a weird combination of under-
utilizing and over-reporting.

I'd spec it so that context implementations are allowed to
unconditionally fill fields, even when the flag isn't specified. The
branches quoted don't buy us anyting...

Done.

I'd add some reasoning as to why this is useful.

Done.

s/MCXT_STAT/MCXT_STAT_NEED/?

I changed to MCXT_COUNT_. MCXT_STAT_NEED seemed slightly verbose for
me.

+#define MCXT_STAT_ALL ((1 << 6) - 1)

Hm, why not go for ~0 or such?

Done.

Regards,
Jeff Davis

#20David Rowley
dgrowleyml@gmail.com
In reply to: Jeff Davis (#18)
Re: Make MemoryContextMemAllocated() more precise

On Tue, 7 Apr 2020 at 14:26, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2020-04-06 at 23:39 +1200, David Rowley wrote:

4. Do you think it would be better to have two separate functions for
MemoryContextCount(), a recursive version and a non-recursive
version.

I could, but right now the only caller passes recurse=true, so I'm not
really eliminating any code in that path by specializing the functions.
Are you thinking about performance or you just think it would be better
to have two entry points?

I was thinking in terms of both performance by eliminating a branch,
but mostly it was down to ease of code reading.

I thought it was easier to read:
MemoryContextGetCountersRecruse(&counters); then
MemoryContextGetCounters(&counters, true); since I might need to go
see what "true" is for.

The non-recursive version, if we decide we need one, would likely just
be 1 one-line body function calling the implementation's getcounter
method.

David

#21Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
#22Daniel Gustafsson
daniel@yesql.se
In reply to: Jeff Davis (#21)