9.5: Better memory accounting, towards memory-bounded HashAgg

Started by Jeff Davisover 11 years ago54 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

Attached is a patch that explicitly tracks allocated memory (the blocks,
not the chunks) for each memory context, as well as its children.

This is a prerequisite for memory-bounded HashAgg, which I intend to
submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
the hash table, which is a good estimate for Hashjoin. But I don't think
it's as easy for Hashagg, for which we need to track transition values,
etc. (also, for HashAgg, I expect that the overhead will be more
significant than for Hashjoin). If we track the space used by the memory
contexts directly, it's easier and more accurate.

I did some simple pgbench select-only tests, and I didn't see any TPS
difference.

Regards,
Jeff Davis

Attachments:

memory-accounting.patchtext/x-patch; charset=UTF-8; name=memory-accounting.patchDownload+58-0
In reply to: Jeff Davis (#1)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Sat, Aug 2, 2014 at 1:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:

This is a prerequisite for memory-bounded HashAgg, which I intend to
submit for the next CF.

FWIW, I think that's a good project. A large number of these TPC-H
queries used HashAggs when I checked, on a moderate sized sample TPC-H
database:

http://examples.citusdata.com/tpch_queries.html

(I found these queries at random from Googling, but happened to have a
~2GB TPC-H database on my laptop). I attach EXPLAIN ANALYZE ouput for
each, as shown on master. From this admittedly unscientific random
sampling, 5 out of 8 query plans have a hash aggregate node. TPC-H is
a benchmark that Postgres does not tend to do too well on [1]https://wiki.postgresql.org/wiki/TPC-H -- Peter Geoghegan, and I
suspect that this has something to do with it; lower work_mem settings
will spook the optimizer into using a group aggregate within
choose_hashed_grouping(). Of course, in order to get the benefit of
your patch, that will need to be adjusted. I think that part is
surprisingly straightforward, though.

[1]: https://wiki.postgresql.org/wiki/TPC-H -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

tpc-h-plans.txttext/plain; charset=US-ASCII; name=tpc-h-plans.txtDownload
#3Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#1)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Sat, Aug 2, 2014 at 4:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:

Attached is a patch that explicitly tracks allocated memory (the blocks,
not the chunks) for each memory context, as well as its children.

This is a prerequisite for memory-bounded HashAgg, which I intend to
submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
the hash table, which is a good estimate for Hashjoin. But I don't think
it's as easy for Hashagg, for which we need to track transition values,
etc. (also, for HashAgg, I expect that the overhead will be more
significant than for Hashjoin). If we track the space used by the memory
contexts directly, it's easier and more accurate.

I did some simple pgbench select-only tests, and I didn't see any TPS
difference.

I was curious whether a performance difference would show up when
sorting, so I tried it out. I set up a test with pgbench -i 300. I
then repeatedly restarted the database, and after each restart, did
this:

time psql -c 'set trace_sort=on; reindex index pgbench_accounts_pkey;'

I alternated runs between master and master with this patch, and got
the following results:

master:
LOG: internal sort ended, 1723933 KB used: CPU 2.58s/11.54u sec
elapsed 16.88 sec
LOG: internal sort ended, 1723933 KB used: CPU 2.50s/12.37u sec
elapsed 17.60 sec
LOG: internal sort ended, 1723933 KB used: CPU 2.14s/11.28u sec
elapsed 16.11 sec

memory-accounting:
LOG: internal sort ended, 1723933 KB used: CPU 2.57s/11.97u sec
elapsed 17.39 sec
LOG: internal sort ended, 1723933 KB used: CPU 2.30s/12.57u sec
elapsed 17.68 sec
LOG: internal sort ended, 1723933 KB used: CPU 2.54s/11.99u sec
elapsed 17.25 sec

Comparing the median times, that's about a 3% regression. For this
particular case, we might be able to recapture that by replacing the
bespoke memory-tracking logic in tuplesort.c with use of this new
facility. I'm not sure whether there are other cases that we might
also want to test; I think stuff that runs all on the server side is
likely to show up problems more clearly than pgbench. Maybe a
PL/pgsql loop that does something allocation-intensive on each
iteration, for example, like parsing a big JSON document.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#1)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On 2.8.2014 22:40, Jeff Davis wrote:

Attached is a patch that explicitly tracks allocated memory (the blocks,
not the chunks) for each memory context, as well as its children.

This is a prerequisite for memory-bounded HashAgg, which I intend to

Anyway, I'm really looking forward to the memory-bounded hashagg, and
I'm willing to spend some time testing it.

submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
the hash table, which is a good estimate for Hashjoin. But I don't think

Actually, even for HashJoin the estimate is pretty bad, and varies a lot
depending on the tuple width. With "narrow" tuples (e.g. ~40B of data),
which is actually quite common case, you easily get ~100% palloc overhead.

I managed to address that by using a custom allocator. See this:

https://commitfest.postgresql.org/action/patch_view?id=1503

I wonder whether something like that would be possible for the hashagg?

That would make the memory accounting accurate with 0% overhead (because
it's not messing with the memory context at all), but it only for the
one node (but maybe that's OK?).

it's as easy for Hashagg, for which we need to track transition values,
etc. (also, for HashAgg, I expect that the overhead will be more
significant than for Hashjoin). If we track the space used by the memory
contexts directly, it's easier and more accurate.

I don't think that's comparable - I can easily think about cases leading
to extreme palloc overhead with HashAgg (think of an aggregate
implementing COUNT DISTINCT - that effectively needs to store all the
values, and with short values the palloc overhead will be terrible).

Actually, I was looking at HashAgg (it's a somehow natural direction
after messing with Hash Join), and my plan was to use a similar dense
allocation approach. The trickiest part would probably me making this
available from the custom aggregates.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#3)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Wed, 2014-08-06 at 11:43 -0400, Robert Haas wrote:

Comparing the median times, that's about a 3% regression. For this
particular case, we might be able to recapture that by replacing the
bespoke memory-tracking logic in tuplesort.c with use of this new
facility. I'm not sure whether there are other cases that we might
also want to test; I think stuff that runs all on the server side is
likely to show up problems more clearly than pgbench. Maybe a
PL/pgsql loop that does something allocation-intensive on each
iteration, for example, like parsing a big JSON document.

I wasn't able to reproduce your results on my machine. At -s 300, with
maintenance_work_mem set high enough to do internal sort, it took about
40s and I heard some disk activity, so I didn't think it was a valid
result. I went down to -s 150, and it took around 5.3s on both master
and memory-accounting.

Either way, it's better to be conservative. Attached is a version of the
patch with opt-in memory usage tracking. Child contexts inherit the
setting from their parent.

Regards,
Jeff Davis

Attachments:

memory-accounting-v2.patchtext/x-patch; charset=UTF-8; name=memory-accounting-v2.patchDownload+106-9
#6Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#5)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:

Either way, it's better to be conservative. Attached is a version of the
patch with opt-in memory usage tracking. Child contexts inherit the
setting from their parent.

There was a problem with the previous patch: the parent was unlinked
before the delete_context method was called, which meant that the
parent's memory was not being decremented.

Attached is a fix. It would be simpler to just reorder the operations in
MemoryContextDelete, but there is a comment warning against that. So, I
pass the old parent as an argument to the delete_context method.

Regards,
Jeff Davis

Attachments:

memory-accounting-v3.patchtext/x-patch; charset=UTF-8; name=memory-accounting-v3.patchDownload+127-19
#7Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#5)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Fri, Aug 8, 2014 at 4:16 AM, Jeff Davis <pgsql@j-davis.com> wrote:

I wasn't able to reproduce your results on my machine. At -s 300, with
maintenance_work_mem set high enough to do internal sort, it took about
40s and I heard some disk activity, so I didn't think it was a valid
result. I went down to -s 150, and it took around 5.3s on both master
and memory-accounting.

Either way, it's better to be conservative. Attached is a version of the
patch with opt-in memory usage tracking. Child contexts inherit the
setting from their parent.

I repeated my tests with your v3 patch. Here are the new results:

master, as of commit a4287a689d10bd4863e3dfbf9c4f232feeca0cdd:
2014-08-14 16:45:24 UTC [2940] LOG: internal sort ended, 1723933 KB
used: CPU 2.44s/11.52u sec elapsed 16.68 sec
2014-08-14 16:46:34 UTC [2960] LOG: internal sort ended, 1723933 KB
used: CPU 2.63s/11.65u sec elapsed 16.94 sec
2014-08-14 16:47:26 UTC [2979] LOG: internal sort ended, 1723933 KB
used: CPU 2.63s/11.48u sec elapsed 16.85 sec

memory-accounting-v3, on top of the aforementioned master commit:
2014-08-14 16:46:05 UTC [2950] LOG: internal sort ended, 1723933 KB
used: CPU 2.52s/12.16u sec elapsed 17.36 sec
2014-08-14 16:47:00 UTC [2969] LOG: internal sort ended, 1723933 KB
used: CPU 2.52s/11.90u sec elapsed 17.11 sec
2014-08-14 16:47:51 UTC [2988] LOG: internal sort ended, 1723933 KB
used: CPU 2.52s/11.98u sec elapsed 17.31 sec

It appears to me that the performance characteristics for this version
are not significantly different from version 1. I have not looked at
the code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#6)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On 10.8.2014 22:50, Jeff Davis wrote:

On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:

Either way, it's better to be conservative. Attached is a version
of the patch with opt-in memory usage tracking. Child contexts
inherit the setting from their parent.

There was a problem with the previous patch: the parent was unlinked
before the delete_context method was called, which meant that the
parent's memory was not being decremented.

Attached is a fix. It would be simpler to just reorder the operations
in MemoryContextDelete, but there is a comment warning against that.
So, I pass the old parent as an argument to the delete_context
method.

Hi,

I did a quick check of the patch, mostly because I wasn't sure whether
it allows accounting for a selected subtree only (e.g. aggcontext and
it's children). And yes, it does exactly that, which is great.

Also, the code seems fine to me, except maybe for this piece in
AllocSetDelete:

/*
* Parent is already unlinked from context, so can't use
* update_allocation().
*/
while (parent != NULL)
{
parent->total_allocated -= context->total_allocated;
parent = parent->parent;
}

I believe this should check parent->track_mem, just like
update_allocation, because this way it walks all the parent context up
to the TopMemoryContext.

It does not really break anything (because parents without enabled
tracking don't report allocated space), but for plans
creating/destroying a lot of contexts (say, GroupAggregate with a lot of
groups) this might unnecessarily add overhead.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#8)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

Tomas Vondra <tv@fuzzy.cz> writes:

I believe this should check parent->track_mem, just like
update_allocation, because this way it walks all the parent context up
to the TopMemoryContext.

TBH, I don't think this "opt-in tracking" business has been thought
through even a little bit; for example, what happens if there is a
declared-to-not-track context in between levels that are declared
to track? And it certainly has not been proven that the extra
complexity actually saves cycles rather than the reverse.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#9)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On 16.8.2014 20:00, Tom Lane wrote:

Tomas Vondra <tv@fuzzy.cz> writes:

I believe this should check parent->track_mem, just like
update_allocation, because this way it walks all the parent context up
to the TopMemoryContext.

TBH, I don't think this "opt-in tracking" business has been thought
through even a little bit; for example, what happens if there is a
declared-to-not-track context in between levels that are declared to
track? And it certainly has not been proven that the extra complexity
actually saves cycles rather than the reverse.

AFAIK there's no other built-in accounting mechanism at the moment.
That's one of the reasons why some of the nodes had to invent their own
custom accounting - hashjoin is an example of that.

We do have MemoryContextStats, which essentially walks the memory
context hierarchy and accounts all the important information. While
working on the hashjoin patch, I was thinking that maybe we could create
MemoryContextStats clone, computing the accounting info on request.

So I did an experiment, writing this new MemoryContextStats clone and
rewriting the nodeHash.c to use this instead of custom tracking. The
impact was consistently ~2-3%, for the hashjoin queries I measured (e.g.
9100 ms instead of 8900 ms etc.).

I just tried the same with the hashjoin patch applied. With the built-in
accounting the queries now run ~2.5x faster - around 3500ms. With the
MemoryContextStats clone, it takes a whopping 18000 ms.

I don't know what causes such increase (I assume I use the accounting
info within a loop or something). The point is that with Jeff's patch,
it goes down to 3500 ms, with pretty much zero overhead.

I'll check why the patched hashjoin is so much slower, but even if I can
optimize it somehow I don't think I'll get it to 0 overhead. So I'd say
this accounting actally saves the cycles.

Of course, that does not prove it's well a well thought-out design,
although it seems sensible to me. It's the least complex accounting
implementation I can think of.

For example, I don't think being able to create nonsensical hierarchies
is a major flaw. We certainly should not do that in our code, and we
should take reasonable precautions to prevent this in a user code (as
for example inheriting the tracking flag in MemoryContextCreate, as Jeff
does). Can this be seen as a foot gun? Maybe, but we're handing
ammunition to users on a daily basis - for example they can switch to
TopMemoryContext, but that does not make memory contexts bad.

But maybe the inheritance really is not necessary - maybe it would be
enough to track this per-context, and then just walk through the
contexts and collect this. Because my observation is that most of the
time is actually spent in walking through blocks and freelists.

Then the issues with nonsensical hierarchies would disappear, and also
the possible overhead with updating all parent hierarchies would go away
(not sure if it's even measurable, though).

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#7)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Thu, 2014-08-14 at 12:52 -0400, Robert Haas wrote:

It appears to me that the performance characteristics for this version
are not significantly different from version 1. I have not looked at
the code.

While trying to reproduce your results, I noticed what might be around a
1% regression just from adding the 3 fields to MemoryContextData. If I
cut it down to adding just one field, the regression disappears.

The results are fairly noisy, so I could be chasing the wrong thing. But
one reason to believe it is that I pushed the size of MemoryContextData
above 64, which sounds like it might be an important threshold.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#10)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote:

But maybe the inheritance really is not necessary - maybe it would be
enough to track this per-context, and then just walk through the
contexts and collect this. Because my observation is that most of the
time is actually spent in walking through blocks and freelists.

That makes a lot of sense to me.

Another approach is to pass a flag to hash_create that tells it not to
create a subcontext. Then we don't need to traverse anything; we already
know which context we want to look at. Perhaps I was being too clever
with the idea of tracking space for an entire hierarchy.

Also, as I pointed out in my reply to Robert, adding too many fields to
MemoryContextData may be the cause of the regression. Your idea requires
only one field, which doesn't show the same regression in my tests.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#12)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On 19 Srpen 2014, 10:26, Jeff Davis wrote:

On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote:

But maybe the inheritance really is not necessary - maybe it would be
enough to track this per-context, and then just walk through the
contexts and collect this. Because my observation is that most of the
time is actually spent in walking through blocks and freelists.

That makes a lot of sense to me.

Another approach is to pass a flag to hash_create that tells it not to
create a subcontext. Then we don't need to traverse anything; we already
know which context we want to look at. Perhaps I was being too clever
with the idea of tracking space for an entire hierarchy.

Also, as I pointed out in my reply to Robert, adding too many fields to
MemoryContextData may be the cause of the regression. Your idea requires
only one field, which doesn't show the same regression in my tests.

Yeah, keeping the structure size below 64B seems like a good idea.

The use-case for this is tracking a chosen subtree of contexts - e.g.
aggcontext and below, so I'd expect the tracked subtrees to be relatively
shallow. Am I right?

My fear is that by removing the inheritance bit, we'll hurt cases with a
lot of child contexts. For example, array_agg currently creates a separate
context for each group - what happens if you have 100k groups and do
MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Wouldn't the solution with inheritance and propagating the accounting info
to the parent actually better? Or maybe allowing both, having two flags
when creating a context instead of one?

AllocSetCreateTracked( ...., bool track, bool propagate_immediately)

By squashing both flags into a single mask you wouldn't increase the size.
Also, do we really need to track allocated bytes - couldn't we track
kilobytes or something and use smaller data types to get below the 64B?

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#13)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:

The use-case for this is tracking a chosen subtree of contexts - e.g.
aggcontext and below, so I'd expect the tracked subtrees to be relatively
shallow. Am I right?

Right.

My fear is that by removing the inheritance bit, we'll hurt cases with a
lot of child contexts. For example, array_agg currently creates a separate
context for each group - what happens if you have 100k groups and do
MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Good point. We don't want to make checking the allocated space into an
expensive operation, because then we will be forced to call it less
frequently, which implies that we'd be sloppier about actually fitting
in work_mem.

Wouldn't the solution with inheritance and propagating the accounting info
to the parent actually better? Or maybe allowing both, having two flags
when creating a context instead of one?

That seems overly complicated. We only have one driving use case, so two
orthogonal options sounds excessive. Perhaps one option if we can't
solve the performance problem and we need to isolate the changes to
hashagg.

Also, do we really need to track allocated bytes - couldn't we track
kilobytes or something and use smaller data types to get below the 64B?

Good idea.

I attached a patch that uses two uint32 fields so that it doesn't
increase the size of MemoryContextData, and it tracks memory usage for
all contexts. I was unable to detect any performance regression vs.
master, but on my machine the results are noisy.

It would be easier to resolve the performance concern if I could
reliably get the results Robert is getting. I think I was able to
reproduce the regression with the old patch, but the results were still
noisy.

Regards,
Jeff Davis

Attachments:

memory-accounting-v4.patchtext/x-patch; charset=UTF-8; name=memory-accounting-v4.patchDownload+83-10
#15Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#14)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Wed, Aug 20, 2014 at 2:11 AM, Jeff Davis <pgsql@j-davis.com> wrote:

I attached a patch that uses two uint32 fields so that it doesn't
increase the size of MemoryContextData, and it tracks memory usage for
all contexts. I was unable to detect any performance regression vs.
master, but on my machine the results are noisy.

Still doesn't look good here. On the same PPC64 machine I've been
using for testing:

master:
2014-08-22 16:26:42 UTC [13153] LOG: internal sort ended, 1723933 KB
used: CPU 2.19s/11.41u sec elapsed 16.40 sec
2014-08-22 16:28:18 UTC [13173] LOG: internal sort ended, 1723933 KB
used: CPU 2.56s/11.37u sec elapsed 16.64 sec
2014-08-22 16:29:37 UTC [13194] LOG: internal sort ended, 1723933 KB
used: CPU 2.57s/11.48u sec elapsed 16.75 sec

memory-accounting-v4:
2014-08-22 16:27:35 UTC [13163] LOG: internal sort ended, 1723933 KB
used: CPU 2.72s/11.91u sec elapsed 17.43 sec
2014-08-22 16:29:10 UTC [13185] LOG: internal sort ended, 1723933 KB
used: CPU 2.67s/12.03u sec elapsed 17.40 sec
2014-08-22 16:30:01 UTC [13203] LOG: internal sort ended, 1723933 KB
used: CPU 2.65s/11.97u sec elapsed 17.45 sec

That's a 4.7% regression on the median results, worse than before.

It would be easier to resolve the performance concern if I could
reliably get the results Robert is getting. I think I was able to
reproduce the regression with the old patch, but the results were still
noisy.

If you send me an SSH key by private email I can set you an account up
on this machine, if that's helpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#14)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On 20.8.2014 08:11, Jeff Davis wrote:

On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:

The use-case for this is tracking a chosen subtree of contexts - e.g.
aggcontext and below, so I'd expect the tracked subtrees to be relatively
shallow. Am I right?

Right.

My fear is that by removing the inheritance bit, we'll hurt cases with a
lot of child contexts. For example, array_agg currently creates a separate
context for each group - what happens if you have 100k groups and do
MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Good point. We don't want to make checking the allocated space into an
expensive operation, because then we will be forced to call it less
frequently, which implies that we'd be sloppier about actually fitting
in work_mem.

Wouldn't the solution with inheritance and propagating the accounting info
to the parent actually better? Or maybe allowing both, having two flags
when creating a context instead of one?

That seems overly complicated. We only have one driving use case, so two
orthogonal options sounds excessive. Perhaps one option if we can't
solve the performance problem and we need to isolate the changes to
hashagg.

Also, do we really need to track allocated bytes - couldn't we track
kilobytes or something and use smaller data types to get below the 64B?

Good idea.

I attached a patch that uses two uint32 fields so that it doesn't
increase the size of MemoryContextData, and it tracks memory usage for
all contexts. I was unable to detect any performance regression vs.
master, but on my machine the results are noisy.

I don't think we really need to abandon the 'tracked' flag (or that we
should). I think it was useful, and removing it might be one of the
reasons why Robert now sees worse impact than before.

I assume you did that to get below 64B, right? What about to changing
'isReset' in MemoryContextData to a 'flags' field instead? It's a bool
now, i.e. 1B -> 8 bits.

I don't think it makes sense to abandon this only to get <= 64B, if it
forces you to touch multiple 64B structures unnecessarily.

Also, you're doing this for every block in AllocSetReset:

update_allocation(context, -(block->endptr - ((char*) block)));

So if there are 1000 blocks, you'll call that 1000x (and it will
propagate up to TopMemoryContext). Instead keep a local sum and only do
the call once at the end. Again, this might be one of the reasons why
Robet sees ~4% overhead.

BTW what happens when AllocSetContext is created with initBlockSize that
is not a multiple of 1kB? I see it's enforced to be at least 1024, but I
don't see anything forcing it to be a multiple of 1024?

What if I create a context with initBlockSize=1500? ISTM it'll break the
accounting, right? (Not that it would make sense to create such
contexts, but it's possible.)

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#16)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Fri, Aug 22, 2014 at 2:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

I don't think we really need to abandon the 'tracked' flag (or that we
should). I think it was useful, and removing it might be one of the
reasons why Robert now sees worse impact than before.

The version that introduced that flag had the same overhead as the
previous version that didn't, at least in my testing, so I'm not sure
it's at all useful. Part of the problem here is that the number of
instructions that you can afford to take to complete a memory
allocation is just really small, and so every one hurts. Memory
latency may be an issue as well, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#14)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On 20.8.2014 08:11, Jeff Davis wrote:

On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:

It would be easier to resolve the performance concern if I could
reliably get the results Robert is getting. I think I was able to
reproduce the regression with the old patch, but the results were still
noisy.

I created a small extension for this purpose, it's available here:

https://github.com/tvondra/palloc_bench

In short, it creates a chain of memory contexts, and then repeats
palloc/pfree a given number of times (with a chosen request size).

It either calls AllocSetContextCreate or AllocSetContextCreateTracked,
depending on whether there's a

#define TRACKING_FLAG

so either leave it there or comment it out. The time is printed as a
warhing.

I did a number of tests to get an idea of the overhead, using this call

select palloc_bench(10,100000000,32768);

which means 10 memory contexts, 100000000 palloc/free cycles with 32kB
requests.

And I got these results:

master
-------
3246.03 ms
3125.24 ms
3247.86 ms
3251.85 ms
3113.03 ms
3140.35 ms

v3 patch (AllocSetContextCreate => no tracing)
3303.64 ms
3278.57 ms
3295.11 ms
3325.63 ms
3329.84 ms
3439.27 ms

-- v3 patch (AllocSetContextCreateTracked => tracing)
6296.43 ms
5228.83 ms
5271.43 ms
5158.60 ms
5404.57 ms
5240.40 ms

-- v4 (tracing all the time)
6728.84 ms
6478.88 ms
6478.82 ms
6473.57 ms
6554.96 ms
6528.66 ms

I think this makes the overhead clearly visible. I also worth mentioning
that this does nothing else except for palloc/free, which is not really
what a real workload does. And 100000000 palloc/free of 32kB blocks
means ~3TB of RAM, unless my math is broken.

So looking at the numbers and saying "7 seconds >> 3 seconds", all is
lost is not really appropriate IMHO.

Anyway, ISTM that v4 is actually a bitm ore expensive than v3 for some
reason. I'm not entirely sure why, but I suspect it's because of
updating the few additional memory contexts up to TopMemoryContext.
That's something v3 didn't do.

I tried to hack a bit on the idea of using a single byte for the flags
(isReset and track_mem) - patch attached. That got me pretty much to v3
performance (or maybe slightly better):

-- v4 + flag (tracing all the time)
5222.38 ms
4958.37 ms
5072.21 ms
5100.43 ms
5059.65 ms
4995.52 ms

But nothing that'd magically save the day ... and with disabled tracing
we get pretty much to v3 numbers (with trace_mem=false). So this
gymnastics gave us pretty much nothing ...

But I have realized that maybe the problem is that we're handling memory
contexts and accounting as 1:1. But that's not really the case - most of
the context is not really interested in this. They don't use accounting
now, and it won't change. So only small fraction of memory contexts will
ask for acounting. Yet all the contexts are forced to pass accounting
info from their children to their parent, if there happens to be a
context with track_mem=true somewhere above them.

And that's exactly the problem, because most of the time is spent in
update_accounting, in the loop over parents.

So my proposal is to separate those two things into two hierarchies,
that are somehow parallel, but not exactly.

That means:

(1) creating a structure with the accouting info

typedef struct MemoryAccountingData {

Size total_allocated;
Size self_allocated;

struct MemoryAccountingData * parent;

} MemoryAccountingData;

(2) adding a pointer to MemoryAccountingData to MemoryContextData, and
a 'track_mem' flag for contexts that requested tracking

typedef struct MemoryContextData
{
...
MemoryAccounting accounting;
...
} MemoryContextData;

(3) when a context does not request accounting, it just uses the
accounting pointer from the parent context, and track_mem=false

(4) when the context requests accounting, it allocates it's own
accounting structure, sets accounting->parent to the accounting
from parent, and sets track_mem=true

Now all the contexts have a direct pointer to the accounting of the
nearest parent context that explicitly requested accounting, and don't
need to walk through all the parents.

Contexts that did not request tracking have track_mem=false, and their
accounting points to the parent with explicit accounting, or is NULL if
there's no such parent. For these contexts, GetAllocated always returns
0, but that's OK because they haven't requested accounting anyway.

Contexts that requested tracking have have track_mem=true, and have
their own specific accounting instance. The accounting->parent plays
pretty much the same role as 'accounting' with 'track_mem=false' (see
previous paragraph). These contexts return GetAllocated properly.

Now, I did a quick with the palloc_bench - 1 context with tracking
enabled, and a chain of 10 contexts without tracking (but updating the
accounting for the first context).

And I see this - with tracking enabled:

3235.57 ms
3240.09 ms
3225.47 ms
3306.95 ms
3249.14 ms
3225.56 ms

and with tracking disabled:

3193.43 ms
3169.57 ms
3156.48 ms
3147.12 ms
3142.25 ms
3161.91 ms
3149.97 ms

Which is quite good, IMHO. Disabled is pretty much exactly as master
(i.e. no accounting at all), enabled is about equal to v3 with disabled
tracing.

But maybe I did something stupid in those patches, it's 3AM here ...

Patch attached, consider it a an early alpha version.

regards
Tomas

Attachments:

memory-accounting-tomas.patchtext/x-diff; name=memory-accounting-tomas.patchDownload+155-7
#19Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#15)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote:

Still doesn't look good here. On the same PPC64 machine I've been
using for testing:

I have a new approach to the patch which is to call a callback at each
block allocation and child contexts inherit the callback from their
parents. The callback could be defined to simply dereference and
increment its argument, which would mean block allocations anywhere in
the hierarchy are incrementing the same variable, avoiding the need to
traverse the hierarchy.

I've made some progress I think (thanks to both Robert and Tomas), but I
have a bit more microoptimization and testing to do. I'll mark it
"returned with feedback" for now, though if I find the time I'll do more
testing to see if the performance concerns are fully addressed.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#19)
Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

Jeff Davis <pgsql@j-davis.com> writes:

I have a new approach to the patch which is to call a callback at each
block allocation and child contexts inherit the callback from their
parents. The callback could be defined to simply dereference and
increment its argument, which would mean block allocations anywhere in
the hierarchy are incrementing the same variable, avoiding the need to
traverse the hierarchy.

Cute, but it's impossible to believe that a function call isn't going
to be *even more* overhead than what you've got now. This could only
measure out as a win when the feature is going unused.

What about removing the callback per se and just keeping the argument,
as it were. That is, a MemoryContext has a field of type "size_t *"
that is normally NULL, but if set to non-NULL, then we increment the
pointed-to value for pallocs and decrement for pfrees.

One thought in either case is that we don't have to touch the API for
MemoryContextCreate: rather, the feature can be enabled in a separate
call after making the context.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#20)
#22Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#20)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#22)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#22)
#25Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#24)
#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#25)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#25)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Simon Riggs (#24)
#29Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#25)
#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#25)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#29)
#32Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#31)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#32)
#34Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#29)
#35Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tomas Vondra (#28)
#36Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#33)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#36)
In reply to: Jeff Davis (#34)
#39Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#38)
#40Michael Paquier
michael@paquier.xyz
In reply to: Jeff Davis (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#39)
#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#42)
#44Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#42)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#44)
#46Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#44)
In reply to: Jeff Davis (#46)
#48Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#46)
#49Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Peter Geoghegan (#47)
#50Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Geoghegan (#47)
#51Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#44)
#52Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#51)
#53Jeff Davis
pgsql@j-davis.com
In reply to: Tomas Vondra (#51)
#54Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#53)