Improving the memory allocator

Started by Andres Freundover 14 years ago12 messages
#1Andres Freund
andres@anarazel.de

Hi,

In the thread http://archives.postgresql.org/message-id/4DA69D60.4000108@2ndquadrant.com /
"Single client performance on trivial SELECTs" I wondered whether it might be
possible to increase the performance of the current allocator by using some
slab allocator like concept.

While avoiding doing my tax returns and having a long and delayed train ride
I started to play around.

The profile I used in this case was:

pgbench -h /tmp/ -p5433 -s 30 pgbench -S -T 20

I hacked together a patch which records every allocation to a file. That patch
obviously slows everything down quite a bit so that I only get about a third
of the normal tps.

The current profile for that looks like that (hope thats recognizable). I
guess the only unclear column is "compile_time". Thats the result of
__builtin_constant_p(size) for the size argument of the various allocation
operations which is true for the case where size is known at compile time.

action | file | func | line | size | compile_time | count
--------------------------+------------------+--------------------------------------+-------+-------+--------------+--------
chunk_alloc | list.c | new_list | 68 | 16 | t | 940753
chunk_alloc | list.c | new_list | 72 | 24 | t | 940753
chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 202908
chunk_alloc_zero | bitmapset.c | bms_make_singleton | 189 | 8 | f | 129122
chunk_alloc_zero_aligned | value.c | makeString | 55 | 16 | t | 129122
chunk_alloc_zero_aligned | makefuncs.c | makeVar | 73 | 40 | t | 110676
chunk_alloc_zero_aligned | makefuncs.c | makeTargetEntry | 216 | 48 | t | 92230
chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 73799
chunk_alloc | bitmapset.c | bms_copy | 119 | 8 | f | 73784
chunk_alloc | lock.c | LockAcquireExtended | 559 | 128 | t | 55441
chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2036 | 40 | t | 55338
chunk_alloc | snapmgr.c | PushActiveSnapshot | 282 | 24 | t | 55338
chunk_alloc | nbtsearch.c | _bt_search | 121 | 24 | t | 36908
chunk_alloc | resowner.c | ResourceOwnerEnlargeCatCacheRefs | 612 | 128 | t | 36893
chunk_alloc | resowner.c | ResourceOwnerEnlargeRelationRefs | 754 | 128 | t | 36893
chunk_alloc_zero | resowner.c | ResourceOwnerCreate | 135 | 168 | t | 36893
chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2027 | 40 | t | 36892
chunk_alloc | snapmgr.c | CopySnapshot | 217 | 72 | f | 36892
chunk_alloc_zero_aligned | copyfuncs.c | _copyVar | 1041 | 40 | t | 36892
chunk_alloc_zero_aligned | equivclass.c | add_eq_member | 452 | 32 | t | 36892
chunk_alloc_zero_aligned | execQual.c | ExecInitExpr | 4231 | 24 | t | 36892
chunk_alloc_zero_aligned | execTuples.c | MakeTupleTableSlot | 118 | 96 | t | 36892
chunk_alloc_zero_aligned | gram.y | makeColumnRef | 12286 | 24 | t | 36892
chunk_alloc | list.c | new_head_cell | 93 | 16 | t | 18529
chunk_alloc | genam.c | RelationGetIndexScan | 77 | 104 | t | 18489
chunk_alloc | nbtree.c | btbeginscan | 385 | 6608 | t | 18489
chunk_alloc | tupdesc.c | CreateTemplateTupleDesc | 63 | 160 | f | 18479
chunk_alloc | genam.c | RelationGetIndexScan | 89 | 72 | f | 18471
chunk_alloc | nbtree.c | btbeginscan | 388 | 72 | f | 18471
chunk_alloc | resowner.c | ResourceOwnerEnlargeBuffers | 524 | 64 | t | 18448
chunk_alloc | trigger.c | AfterTriggerBeginXact | 3607 | 112 | t | 18447
chunk_alloc | trigger.c | AfterTriggerBeginXact | 3619 | 192 | t | 18447

splitted to contexts:

self | action | file | func | line | size | compile_time | count
-----------------------------------------+--------------------------+------------------+--------------------------------------+-------+-------+--------------+--------
MessageContext | chunk_alloc | list.c | new_list | 68 | 16 | t | 848520
MessageContext | chunk_alloc | list.c | new_list | 72 | 24 | t | 848520
MessageContext | chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 166016
MessageContext | chunk_alloc_zero | bitmapset.c | bms_make_singleton | 189 | 8 | f | 129122
MessageContext | chunk_alloc_zero_aligned | value.c | makeString | 55 | 16 | t | 129122
MessageContext | chunk_alloc_zero_aligned | makefuncs.c | makeVar | 73 | 40 | t | 110676
ExecutorState | chunk_alloc | list.c | new_list | 68 | 16 | t | 92230
ExecutorState | chunk_alloc | list.c | new_list | 72 | 24 | t | 92230
MessageContext | chunk_alloc_zero_aligned | makefuncs.c | makeTargetEntry | 216 | 48 | t | 92230
MessageContext | chunk_alloc | bitmapset.c | bms_copy | 119 | 8 | f | 73784
TopMemoryContext | chunk_alloc | lock.c | LockAcquireExtended | 559 | 128 | t | 55441
MessageContext | chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2036 | 40 | t | 55338
TopTransactionContext | chunk_alloc | snapmgr.c | PushActiveSnapshot | 282 | 24 | t | 55338
MessageContext | chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 36894
TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeCatCacheRefs | 612 | 128 | t | 36893
TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeRelationRefs | 754 | 128 | t | 36893
TopMemoryContext | chunk_alloc_zero | resowner.c | ResourceOwnerCreate | 135 | 168 | t | 36893
ExecutorState | chunk_alloc | list.c | new_tail_cell | 112 | 16 | t | 36892
ExecutorState | chunk_alloc | nbtsearch.c | _bt_search | 121 | 24 | t | 36892
ExecutorState | chunk_alloc | stringinfo.c | initStringInfo | 50 | 1024 | t | 36892
ExecutorState | chunk_alloc_zero_aligned | execQual.c | ExecInitExpr | 4231 | 24 | t | 36892
ExecutorState | chunk_alloc_zero_aligned | execTuples.c | MakeTupleTableSlot | 118 | 96 | t | 36892
MessageContext | chunk_alloc | nodeFuncs.c | expression_tree_mutator | 2027 | 40 | t | 36892
MessageContext | chunk_alloc_zero_aligned | copyfuncs.c | _copyVar | 1041 | 40 | t | 36892
MessageContext | chunk_alloc_zero_aligned | equivclass.c | add_eq_member | 452 | 32 | t | 36892
MessageContext | chunk_alloc_zero_aligned | gram.y | makeColumnRef | 12286 | 24 | t | 36892
TopTransactionContext | chunk_alloc | snapmgr.c | CopySnapshot | 217 | 72 | f | 36892
TopMemoryContext | chunk_alloc | resowner.c | ResourceOwnerEnlargeBuffers | 524 | 64 | t | 18448
ExecutorState | chunk_alloc | genam.c | RelationGetIndexScan | 77 | 104 | t | 18447
ExecutorState | chunk_alloc | genam.c | RelationGetIndexScan | 89 | 72 | f | 18447
ExecutorState | chunk_alloc | nbtree.c | btbeginscan | 385 | 6608 | t | 18447
ExecutorState | chunk_alloc | nbtree.c | btbeginscan | 388 | 72 | f | 18447
MessageContext | chunk_alloc | list.c | new_head_cell | 93 | 16 | t | 18447
TopTransactionContext | chunk_alloc | trigger.c | AfterTriggerBeginXact | 3607 | 112 | t | 18447
TopTransactionContext | chunk_alloc | trigger.c | AfterTriggerBeginXact | 3619 | 192 | t | 18447

1:
For that allocation profile I found that one significant part of the costs is
AllocSetFreeIndex (which is inlined in optimized mode, but still).

If we would move more stuff to the headers those very common cases with
compile time known sizes could eliminate the AllocSetFreeIndex computation
at runtime as its only dependent on the size.

The problem with that is obviously that it would violate the whole mctx.c
abstraction as it has to be known at compile time which memory manager is
used.

Are we willing to reduce that abstraction?

2:
aset.c fragments horribly for longliving contexts. I haven't generated
sensibly accessible data for that as I currently just store it in an sql
database and sql (at least my queries :P) isn't really that efficient
for figuring out when how much was allocated from that:
Table "public.allocations"
Column | Type | Modifiers
--------------+---------+----------------------------------------------------------
id | integer | not null default nextval('allocations_id_seq'::regclass)
action | text |
parent | text |
self | text |
size | bigint |
addr | bigint |
file | text |
func | text |
line | integer |
compile_time | boolean |

schema when the data is sensibly large. A sensible benchmark easily produces
three digit million rows...

I guess some custom implementation for evaluating that data should be way much
better at that.

That fragmentation causes two main problems. 1 space wastage and 2. (more
importantly imo) cache misses.

3:
Given the frequentness of very small allocations the current space overhead
of at least 16 bytes on 64bit Platforms seems quite high.

I don't think its too hard to devise a scheme (actually I have started that)
where the overhead is only sizeof(*void). But again that would reduce the
the abstraction because it would make it harder to implement something like
pfree/repalloc/GetMemoryChunkContext which need to figure out the context
from a pointer.

If we had a separate api for small allocations we could even devise a schema
where a single chunk wouldn't have any memory overhead, but that seems to
complicated for now.

So after all this my question basically is: How important do we think the
mctx.c abstraction is?
I personally found the speed of AllocSetAlloc being the most limiting factor
in several workloads so I am willing to sacrifice some abstraction to gain
speed here. Especially as I hope its possible to write a single allocator
which is "good enough" for everything.

I currently started working on two pieces:
* A new memory manager implementation. For that the equivalent of
AllocSetFreeIdx seems to be the biggest limit so far. Its not really ready
for anything yet though.

* recording and replaying MemoryContext* to get a somewhat reasonable
and reproducable micro benchmark. That unfortunately doesn't model the
effects of cache misses that well but I got no better idea so far.

Any opinions, input, ideas?

Thanks for reading,

Andres

#2Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: Improving the memory allocator

On Mon, Apr 25, 2011 at 5:45 PM, Andres Freund <andres@anarazel.de> wrote:

[ lots of awesome test results ]

Very interesting work. I think it's interesting that there is so much
allocation happening inside MessageContext; I barely knew that
existed, let alone that it was a hotspot for memory allocation. I
think it would be interesting to break out the calls to new_list() by
caller. It's been vaguely bothering me for a long time that we store
so many things in using List structures. That is not a particularly
efficient representation, because a List of N nodes requires 2N+1
memory allocations - N nodes, N ListCell objects, and the List object
itself. It might be worth experimenting with some kind of array/list
hybridy thing, like this:

typedef struct BunchOfThings {
NodeTag type;
int total_length;
int this_node_length;
struct BunchOfThings *next;
Thing data[1]; /* really variable-length */
};

This is probably a bit more of a pain to deal with than a plain old
List, and for many purposes it might not be suitable, but there might
be some hotspots where the performance gain is enough to justify the
trouble. In some cases you might be able to get even simpler - just
use a fixed array and, if it fills up, either reallocate-and-copy or
error out.

The problem with that is obviously that it would violate the whole mctx.c
abstraction as it has to be known at compile time which memory manager is
used.

Are we willing to reduce that abstraction?

I think it's definitely worth considering, if there is a significant
performance benefit.

Given the frequentness of very small allocations the current space overhead
of at least 16 bytes on 64bit Platforms seems quite high.

Yeah.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: Improving the memory allocator

Andres Freund <andres@anarazel.de> writes:

So after all this my question basically is: How important do we think the
mctx.c abstraction is?

I think it's pretty important. As a specific example, a new context
type in which pfree() is a no-op would be fine with me. A new context
type in which pfree() dumps core will be useless, or close enough to
useless. That means you can't get rid of the per-chunk back-link to the
context, but you might be able to get rid of the other overhead such as
per-chunk size data. (It might be practical to not support
GetMemoryChunkSize for such contexts, or if it's a slab allocator then
you could possibly know that all the chunks in a given block have size X.)

Another point worth making is that it's a nonstarter to propose running
any large part of the system in memory context types that are incapable
of supporting all the debugging options we rely on (freed-memory-reuse
detection and write-past-end-of-chunk in particular). It's okay if a
production build hasn't got that support, not okay for debug builds.
Perhaps you'll propose using completely different context
implementations in the two cases, but I'd be suspicious of that because
it'd mean the "fast" context code doesn't get any developer testing.

Especially as I hope its possible to write a single allocator
which is "good enough" for everything.

I'll lay a side bet that that approach is a dead end. If one size fits
all were good enough, we'd not be having this conversation at all. The
point of the mctx interface layer from the beginning was to support
multiple allocator policies, and that's the direction I think we want to
go to improve this.

BTW, what your numbers actually suggest to me is not that we need a
better allocator, but that we need a better implementation of List.
We speculated back when Neil redid List the first time about aggregating
list cells to reduce palloc traffic, but it was left out to keep the
patch complexity down. Now that the bugs have been shaken out it might
be time to have another go at that. In particular, teaching List to
allocate the list head and first cell together would alone remove a
third of your runtime ...

regards, tom lane

#4Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#3)
Re: Improving the memory allocator

On Tuesday, April 26, 2011 01:39:37 AM Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

So after all this my question basically is: How important do we think the
mctx.c abstraction is?

I think it's pretty important. As a specific example, a new context
type in which pfree() is a no-op would be fine with me. A new context
type in which pfree() dumps core will be useless, or close enough to
useless.

Well, what I suggested for that would be using a different api for such small
+ static size allocations.
But as I said, I would prefer to exhaust other possibilities first.

That means you can't get rid of the per-chunk back-link to the
context, but you might be able to get rid of the other overhead such as
per-chunk size data. (It might be practical to not support
GetMemoryChunkSize for such contexts, or if it's a slab allocator then
you could possibly know that all the chunks in a given block have size X.)

For the slab allocator design I have in mind I would need to have a back
pointer to the block, not the context... Thats one other reason why I started
thinking about removing the abstraction.
So far I couldn't envision a clean design where you can intermix two
implementations with such a different interpretations. One could have the
blocks and contexts have a 'allocator' node tag in the first element and
switch over that but I don't really like that.

And I don't see a way with that abstraction to let the compiler do expensive
stuff like the index offset determination at compile time instead of run time
with that abstraction. And I think pulling such computations out of runtime is
quite an important part of improvements in that area.

Another point worth making is that it's a nonstarter to propose running
any large part of the system in memory context types that are incapable
of supporting all the debugging options we rely on (freed-memory-reuse
detection and write-past-end-of-chunk in particular). It's okay if a
production build hasn't got that support, not okay for debug builds.

Totally with you. I have absolutely no problem of enlarging the chunkheader
for debug builds and I can't envision a design where that would be a major
problem.

Perhaps you'll propose using completely different context
implementations in the two cases, but I'd be suspicious of that because
it'd mean the "fast" context code doesn't get any developer testing.

I don't like that option either. Its *way* to easy to screw up slightly in
that area.

Especially as I hope its possible to write a single allocator
which is "good enough" for everything.

I'll lay a side bet that that approach is a dead end. If one size fits
all were good enough, we'd not be having this conversation at all. The
point of the mctx interface layer from the beginning was to support
multiple allocator policies, and that's the direction I think we want to
go to improve this.

I don't think I am with you here. I am not around that long so I might be
missing something but I haven't found much evidence of somebody trying to
improve the allocator on a whole instead of improving currently problematic
pieces for 10+ years. So I don't see there is enough evidence proving that
there isn't a possibility to envision an allocator thats good enough for all
needs.

I quite much fear having to figure out which allocator to use where. I don't
see that working very well.

BTW, what your numbers actually suggest to me is not that we need a
better allocator, but that we need a better implementation of List.
We speculated back when Neil redid List the first time about aggregating
list cells to reduce palloc traffic, but it was left out to keep the
patch complexity down. Now that the bugs have been shaken out it might
be time to have another go at that. In particular, teaching List to
allocate the list head and first cell together would alone remove a
third of your runtime ...

Thats certainly true for that workload. The hotspot is somewhere else entirely
though if you start doing even mildly more complex statements than the default
readonly statements from pgbench.
Its actually not totally easy finding any workload thats not totally IO bound
where memory allocation is not in the top 5 in a profile...

But sure. Improving that point is a good idea independent from the allocator.
One less allocation won't hurt any allocator ;)

Greetings,

Andres

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: Improving the memory allocator

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 25, 2011 at 5:45 PM, Andres Freund <andres@anarazel.de> wrote:

[ lots of awesome test results ]

Very interesting work. I think it's interesting that there is so much
allocation happening inside MessageContext; I barely knew that
existed, let alone that it was a hotspot for memory allocation.

The reason it shows as a hotspot is that parsing and planning happen in
that context for the simple-Query code path. You'd get different
answers if you were using extended-query protocol, or running prepared
statements, or doing the bulk of your work in plpgsql, etc etc.

It would be interesting if Andres could dig down another level or so in
the call stack to see where the list creations etc are called from, but
I'll bet quite a bit that what we're looking at is mostly parse/plan
activity. (The high showing of bitmapset activity is a dead tipoff for
planner, in particular, since that datatype is hardly used at all
elsewhere.)

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: Improving the memory allocator

On Tuesday, April 26, 2011 01:24:20 AM Robert Haas wrote:

On Mon, Apr 25, 2011 at 5:45 PM, Andres Freund <andres@anarazel.de> wrote:

[ lots of awesome test results ]

Very interesting work. I think it's interesting that there is so much
allocation happening inside MessageContext; I barely knew that
existed, let alone that it was a hotspot for memory allocation.

Yes, I was surprised as well. I haven't checked yet what the callsites for
that are.
As I just wrote to TOm the hotspots are wildly different if you use slightly
more complex statements. Which doesn't mean its not worthy of improvement ;)

I
think it would be interesting to break out the calls to new_list() by
caller. It's been vaguely bothering me for a long time that we store
so many things in using List structures. That is not a particularly
efficient representation, because a List of N nodes requires 2N+1
memory allocations - N nodes, N ListCell objects, and the List object
itself. It might be worth experimenting with some kind of array/list
hybridy thing, like this:
...

While I aggree that that might be beneficial I think in this case nearly all
of the lists are of 1 element length because its callers seem to look mostly
like:
{{{
List *
lappend(List *list, void *datum)
{
Assert(IsPointerList(list));

if (list == NIL)
list = new_list(T_List);
else
new_tail_cell(list);

lfirst(list->tail) = datum;
check_list_invariants(list);
return list;
}
}}}

Hm. Seems worthy of some further investigation...

Thanks,

Andres

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Improving the memory allocator

On Tuesday, April 26, 2011 02:22:34 AM Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 25, 2011 at 5:45 PM, Andres Freund <andres@anarazel.de> wrote:

[ lots of awesome test results ]

Very interesting work. I think it's interesting that there is so much
allocation happening inside MessageContext; I barely knew that
existed, let alone that it was a hotspot for memory allocation.

The reason it shows as a hotspot is that parsing and planning happen in
that context for the simple-Query code path. You'd get different
answers if you were using extended-query protocol, or running prepared
statements, or doing the bulk of your work in plpgsql, etc etc.

It would be interesting if Andres could dig down another level or so in
the call stack to see where the list creations etc are called from, but
I'll bet quite a bit that what we're looking at is mostly parse/plan
activity. (The high showing of bitmapset activity is a dead tipoff for
planner, in particular, since that datatype is hardly used at all
elsewhere.)

Currently thats just output via

#define MemoryContextAlloc(context, size) \
TraceMemoryContextAlloc(context, size, __FILE__, __FUNCTION__, __LINE__, \
__builtin_constant_p(size))
which in turn then calls the ordinary MemoryContextAlloc (yes, ugly I know).
But I guess it shouldn't be a problem to get more information...

Andres

#8Radosław Smogura
rsmogura@softperience.eu
In reply to: Andres Freund (#7)
Re: Improving the memory allocator

I didn't followed this topic carefully, so sorry If I wrote something
that was written, but I thought about following approach at least for
message sending, etc.:

1. When initializing MemoryContext (pool) give one parameter that will
be stack size. Stack is addition to normal operations.
2. Operations on stack are only allocations, so alloc will be just
pointer bump.
3. When allocating memory for messages, internally set size to be 1/3
greater (or 2/3 I don't remember well) for realloc, may help when
encoding change occurs.
3. free is no op (as it was pointed), stack will be released or pointer
resetted, when context is released

From one point of view this may waste some memory (no free), but for
messages should not happen, and even giving 100kb of memory for stack
may be quite enough, depending ofcourse how many data you send.

Regards,
Radek

#9Andres Freund
andres@anarazel.de
In reply to: Radosław Smogura (#8)
Re: Improving the memory allocator

On Tuesday, April 26, 2011 01:59:45 PM Radosław Smogura wrote:

I didn't followed this topic carefully, so sorry If I wrote something
that was written, but I thought about following approach at least for
message sending, etc.:

1. When initializing MemoryContext (pool) give one parameter that will
be stack size. Stack is addition to normal operations.

You can't allocate memory on the stack and return that.

Andres

#10Radosław Smogura
rsmogura@softperience.eu
In reply to: Andres Freund (#9)
Re: Improving the memory allocator

On Tue, 26 Apr 2011 14:25:10 +0200, Andres Freund wrote:

On Tuesday, April 26, 2011 01:59:45 PM Radosław Smogura wrote:

I didn't followed this topic carefully, so sorry If I wrote
something
that was written, but I thought about following approach at least
for
message sending, etc.:

1. When initializing MemoryContext (pool) give one parameter that
will
be stack size. Stack is addition to normal operations.

You can't allocate memory on the stack and return that.

Andres

I didn't wrote allocate on stack. I thought about preallocate n bytes
chunk for stack-like memory but without popping.

Regards,
Radek

#11Andres Freund
andres@anarazel.de
In reply to: Radosław Smogura (#10)
Re: Improving the memory allocator

On Tuesday, April 26, 2011 03:21:05 PM Radosław Smogura wrote:

On Tue, 26 Apr 2011 14:25:10 +0200, Andres Freund wrote:

On Tuesday, April 26, 2011 01:59:45 PM Radosław Smogura wrote:

I didn't followed this topic carefully, so sorry If I wrote

something

that was written, but I thought about following approach at least

for

message sending, etc.:

1. When initializing MemoryContext (pool) give one parameter that

will

be stack size. Stack is addition to normal operations.

You can't allocate memory on the stack and return that.

Andres

I didn't wrote allocate on stack. I thought about preallocate n bytes
chunk for stack-like memory but without popping.

The implementation already does something like that.

#12Greg Smith
greg@2ndQuadrant.com
In reply to: Andres Freund (#1)
Re: Improving the memory allocator

On 04/25/2011 05:45 PM, Andres Freund wrote:

The profile I used in this case was:

pgbench -h /tmp/ -p5433 -s 30 pgbench -S -T 20

I'd suggest collecting data from running this with "-M prepared" at some
point too, so that you can get a basic idea which of these allocations
are avoided when using prepared statements.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD