Memory Accounting v11

Started by Jeff Davisover 10 years ago26 messages
#1Jeff Davis
pgsql@j-davis.com
1 attachment(s)

This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().

It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can recurse
and sum up the allocations for all subcontexts as well.

This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.

Previous discussion here:

/messages/by-id/1407012053.15301.53.camel@jeff-desktop

Previous concerns:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See /messages/by-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.
* Previous versions of the patch updated the parent contexts'
allocations as well, which avoided the need to recurse when querying the
total allocation. That seemed to penalize cases that didn't need to
track the allocation. We discussed trying to "opt-in" to this behavior,
but it seemed more awkward than helpful. Now, the patch only updates the
allocation of the current context, and querying means recursing through
the child contexts.
* There was a concern that, if MemoryContextMemAllocated needs to
recurse to the child contexts, it will be too slow for HashAgg of
array_agg, which creates a child context per group. That was solved with
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how to
fix them. Adding one field to a struct and a few arithmetic operations
for each malloc() or realloc() seems reasonable to me.

The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem. Others
have also mentioned that we might want to use this mechanism to track
memory for other operators, like Sort or HashJoin, which might be
simpler and more accurate.

Regards,
Jeff Davis

Attachments:

memory-accounting-v11.difftext/x-patch; charset=UTF-8; name=memory-accounting-v11.diffDownload
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 500,505 **** AllocSetContextCreate(MemoryContext parent,
--- 500,508 ----
  					 errdetail("Failed while creating memory context \"%s\".",
  							   name)));
  		}
+ 
+ 		((MemoryContext) set)->mem_allocated += blksize;
+ 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***************
*** 590,595 **** AllocSetReset(MemoryContext context)
--- 593,600 ----
  		else
  		{
  			/* Normal case, release the block */
+ 			context->mem_allocated -= block->endptr - ((char*) block);
+ 
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***************
*** 632,637 **** AllocSetDelete(MemoryContext context)
--- 637,644 ----
  	{
  		AllocBlock	next = block->next;
  
+ 		context->mem_allocated -= block->endptr - ((char *) block);
+ 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***************
*** 672,677 **** AllocSetAlloc(MemoryContext context, Size size)
--- 679,687 ----
  		block = (AllocBlock) malloc(blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		context->mem_allocated += blksize;
+ 
  		block->aset = set;
  		block->freeptr = block->endptr = ((char *) block) + blksize;
  
***************
*** 861,866 **** AllocSetAlloc(MemoryContext context, Size size)
--- 871,878 ----
  		if (block == NULL)
  			return NULL;
  
+ 		context->mem_allocated += blksize;
+ 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***************
*** 964,969 **** AllocSetFree(MemoryContext context, void *pointer)
--- 976,984 ----
  			set->blocks = block->next;
  		else
  			prevblock->next = block->next;
+ 
+ 		context->mem_allocated -= block->endptr - ((char*) block);
+ 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***************
*** 1077,1082 **** AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1092,1098 ----
  		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
+ 		Size		oldblksize;
  
  		while (block != NULL)
  		{
***************
*** 1094,1102 **** AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1110,1123 ----
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
  		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ 		oldblksize = block->endptr - ((char *)block);
+ 
  		block = (AllocBlock) realloc(block, blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		context->mem_allocated += blksize - oldblksize;
+ 
  		block->freeptr = block->endptr = ((char *) block) + blksize;
  
  		/* Update pointers since block has likely been moved */
*** a/src/backend/utils/mmgr/mcxt.c
--- b/src/backend/utils/mmgr/mcxt.c
***************
*** 477,482 **** MemoryContextIsEmpty(MemoryContext context)
--- 477,505 ----
  }
  
  /*
+  * Find the memory allocated to blocks for this memory context. If recurse is
+  * true, also include children.
+  */
+ int64
+ MemoryContextMemAllocated(MemoryContext context, bool recurse)
+ {
+ 	int64 total = context->mem_allocated;
+ 
+ 	AssertArg(MemoryContextIsValid(context));
+ 
+ 	if (recurse)
+ 	{
+ 		MemoryContext child = context->firstchild;
+ 		for (child = context->firstchild;
+ 			 child != NULL;
+ 			 child = child->nextchild)
+ 			total += MemoryContextMemAllocated(child, true);
+ 	}
+ 
+ 	return total;
+ }
+ 
+ /*
   * MemoryContextStats
   *		Print statistics about the named context and all its descendants.
   *
***************
*** 636,641 **** MemoryContextCreate(NodeTag tag, Size size,
--- 659,665 ----
  	node->firstchild = NULL;
  	node->nextchild = NULL;
  	node->isReset = true;
+ 	node->mem_allocated = 0;
  	node->name = ((char *) node) + size;
  	strcpy(node->name, name);
  
*** a/src/include/nodes/memnodes.h
--- b/src/include/nodes/memnodes.h
***************
*** 63,68 **** typedef struct MemoryContextData
--- 63,69 ----
  	MemoryContext nextchild;	/* next child of same parent */
  	char	   *name;			/* context name (just for debugging) */
  	MemoryContextCallback *reset_cbs;	/* list of reset/delete callbacks */
+ 	int64		mem_allocated;	/* track memory allocated for this context */
  } MemoryContextData;
  
  /* utils/palloc.h contains typedef struct MemoryContextData *MemoryContext */
*** a/src/include/utils/memutils.h
--- b/src/include/utils/memutils.h
***************
*** 103,108 **** extern Size GetMemoryChunkSpace(void *pointer);
--- 103,109 ----
  extern MemoryContext GetMemoryChunkContext(void *pointer);
  extern MemoryContext MemoryContextGetParent(MemoryContext context);
  extern bool MemoryContextIsEmpty(MemoryContext context);
+ extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
  extern void MemoryContextStats(MemoryContext context);
  extern void MemoryContextAllowInCriticalSection(MemoryContext context,
  									bool allow);
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#1)
Re: Memory Accounting v11

Jeff Davis <pgsql@j-davis.com> writes:

This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().
...
My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how to
fix them. Adding one field to a struct and a few arithmetic operations
for each malloc() or realloc() seems reasonable to me.

Maybe, but this here is a pretty weak argument:

The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty. Moreover, this is about fourth or fifth down the
list of the implementation problems with spilling hash aggregation to
disk. It would be good to see credible solutions for the bigger issues
before we buy into adding overhead for a mechanism with no uses in core.

Others have also mentioned that we might want to use this mechanism to
track memory for other operators, like Sort or HashJoin, which might be
simpler and more accurate.

That would imply redefining the meaning of work_mem for those operations;
it would suddenly include a lot more overhead space than it used to,
causing them to spill to disk much more quickly than before, unless one
changes the work_mem setting to compensate. Not sure that people would
like such a change.

An even bigger problem is that it would pretty much break the logic around
LACKMEM() in tuplesort.c, which supposes that spilling individual tuples
to disk is a reliable and linear way to decrease the accounted-for memory.
Individual pfree's would typically not decrease the accounted total in
this implementation. Depending on what you have in mind for the
spill-to-disk behavior, this issue could be a fail for HashAgg as well,
which is another reason not to commit this patch in advance of seeing the
use-case.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#1)
Re: Memory Accounting v11

Hi,

On 06/14/15 21:43, Jeff Davis wrote:

This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().

I see it's adding the new field as int64 - wouldn't a Size be more
appropriate, considering that's what we use in mctx.h and aset.c?

It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can
recurse and sum up the allocations for all subcontexts as well.

This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.

Previous discussion here:

/messages/by-id/1407012053.15301.53.camel@jeff-desktop

Previous concerns:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See /messages/by-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.

* Previous versions of the patch updated the parent contexts'
allocations as well, which avoided the need to recurse when querying
the total allocation. That seemed to penalize cases that didn't need
to track the allocation. We discussed trying to "opt-in" to this
behavior, but it seemed more awkward than helpful. Now, the patch
only updates the allocation of the current context, and querying
means recursing through the child contexts.

I don't think the opt-in idea itself was awkward, it was rather about
the particular APIs that we came up with, especially when combined with
the 'context inheritance' thing.

I still think the opt-in approach and updating accounting for the parent
contexts was the best one, because it (a) minimizes impact in cases that
don't use the accounting, and (b) makes finding the current amount of
memory cheap ...

* There was a concern that, if MemoryContextMemAllocated needs to
recurse to the child contexts, it will be too slow for HashAgg of
array_agg, which creates a child context per group. That was solved with
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

I wouldn't say this was "solved" - we have fixed one particular example
of such aggregate implementation, because it was causing OOM issues with
many groups, but there may be other custom aggregates using the same
pattern.

Granted, built-in aggregates are probably more critical than aggregates
provided by extensions, but I wouldn't dare to mark this solved ...

My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how
to fix them. Adding one field to a struct and a few arithmetic
operations for each malloc() or realloc() seems reasonable to me.

I'm not buying this, sorry. While I agree that we should not expect the
memory accounting to be entirely free, we should be very careful about
the overhead especially if we're dropping the opt-in and thus imposing
the overhead on everyone.

But "performance concerns are not a reason to block this patch" approach
seems wrong. With any other patch a 3% regression would be considered a
serious issue IMNSHO.

The current state, where HashAgg just blows up the memory, is just
not reasonable, and we need to track the memory to fix that problem.
Others have also mentioned that we might want to use this mechanism
to track memory for other operators, like Sort or HashJoin, which
might be simpler and more accurate.

Dropping the memory accounting implementations and keeping just this new
solution would be nice, only if we agree the performance impact to be
acceptable. We already have accounting solution for each of those
places, so I don't think the unification alone outweighs the regression.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Memory Accounting v11

Hi,

On 06/14/15 22:21, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().
...
My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how to
fix them. Adding one field to a struct and a few arithmetic operations
for each malloc() or realloc() seems reasonable to me.

Maybe, but this here is a pretty weak argument:

The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

+1 to a solution like that, although I don't think that's doable without
digging the info from memory contexts somehow.

Moreover, this is about fourth or fifth down the
list of the implementation problems with spilling hash aggregation to
disk. It would be good to see credible solutions for the bigger issues
before we buy into adding overhead for a mechanism with no uses in core.

I don't think so. If we don't have an acceptable solution for tracking
memory in hashagg, there's not really much point in doing any of the
following steps. That's why Jeff is tackling this problem first.

Also, Jeff already posted a PoC of the memory-bounded hashagg, although
it only worked for aggregates with fixed-size state, and not that great
for cases like array_agg where the state grows. Maybe the improvements
to aggregate functions proposed by David Rowley last week might help in
those cases, as that'd allow spilling those states to disk.

So while the approach proposed by Jeff may turn out not to be the right
one, I think memory accounting needs to be solved first (even if it's
not committed until the whole feature is ready).

Others have also mentioned that we might want to use this mechanism
to track memory for other operators, like Sort or HashJoin, which
might be simpler and more accurate.

That would imply redefining the meaning of work_mem for those
operations; it would suddenly include a lot more overhead space than
it used to, causing them to spill to disk much more quickly than
before, unless one changes the work_mem setting to compensate. Not
sure that people would like such a change.

Don't we tweak the work_mem meaning over time anyway? Maybe not to such
extent, but for example the hashjoin 9.5 improvements certainly change
this by packing the tuples more densely, sizing the buckets differently
etc. It's true the changes are in the other direction (i.e. batching
less frequently) though.

OTOH this would make the accounting more accurate (e.g. eliminating
differences between cases with different amounts of overhead because of
different tuple width, for example). Wouldn't that be a good thing, even
if people need to tweak the work_mem? Isn't that kinda acceptable when
upgrading to the next major version?

An even bigger problem is that it would pretty much break the logic
around LACKMEM() in tuplesort.c, which supposes that spilling
individual tuples to disk is a reliable and linear way to decrease
the accounted-for memory. Individual pfree's would typically not
decrease the accounted total in this implementation. Depending on
what you have in mind for the spill-to-disk behavior, this issue
could be a fail for HashAgg as well, which is another reason not to
commit this patch in advance of seeing the use-case.

That's true, but I think the plan was always to wait for the whole
feature, and only then commit all the pieces.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#2)
Re: Memory Accounting v11

On Sun, 2015-06-14 at 16:21 -0400, Tom Lane wrote:

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

When I tried doing it outside of the MemoryContexts, it seemed to get
awkward quickly. I'm open to suggestion if you can point me in the right
direction. Maybe I can peek at the sizes of chunks holding state values
and group keys, and combine that with the hash table size-estimating
code?

Moreover, this is about fourth or fifth down the
list of the implementation problems with spilling hash aggregation to
disk. It would be good to see credible solutions for the bigger issues
before we buy into adding overhead for a mechanism with no uses in core.

I had several iterations of a full implementation of the spill-to-disk
HashAgg patch[1]/messages/by-id/1407706010.6623.16.camel@jeff-desktop. Tomas Vondra has some constructive review comments,
but all of them seemed solvable. What do you see as a major unsolved
issue?

If I recall, you were concerned about things like array_agg, where an
individual state could get larger than work_mem. That's a valid concern,
but it's not the problem I was trying to solve.

Regards,
Jeff Davis

[1]: /messages/by-id/1407706010.6623.16.camel@jeff-desktop
/messages/by-id/1407706010.6623.16.camel@jeff-desktop

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tomas Vondra (#4)
Re: Memory Accounting v11

On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

The current state, where HashAgg just blows up the memory, is just not

reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

+1 to a solution like that, although I don't think that's doable without
digging the info from memory contexts somehow.

Jeff is right, we desperately need a solution and this is the place to
start.

Tom's concern remains valid: we must not load the entire system with a
penalty.

The only questions I have are:

* If the memory allocations adapt to the usage pattern, then we expect to
see few memory chunk allocations. Why are we expecting "the entire system"
to experience a penalty?

* If we do not manage our resources, how are we certain this does not
induce a penalty? Not tracking memory could be worse than tracking it.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7CK Tan
cktan@vitessedata.com
In reply to: Simon Riggs (#6)
Re: Memory Accounting v11

On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

The current state, where HashAgg just blows up the memory, is just not

reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

+1 to a solution like that, although I don't think that's doable without
digging the info from memory contexts somehow.

I am sorry to ask questions unrelated to the subject, but how is tracking
memory going to fix the HashAgg blow up problem? Is there a plan to make
HashAgg not blow up (i.e. spill the hash table)?

Thanks,
-cktan

On Thu, Jul 2, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Show quoted text

On 14 June 2015 at 23:51, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

The current state, where HashAgg just blows up the memory, is just not

reasonable, and we need to track the memory to fix that problem.

Meh. HashAgg could track its memory usage without loading the entire
system with a penalty.

+1 to a solution like that, although I don't think that's doable without
digging the info from memory contexts somehow.

Jeff is right, we desperately need a solution and this is the place to
start.

Tom's concern remains valid: we must not load the entire system with a
penalty.

The only questions I have are:

* If the memory allocations adapt to the usage pattern, then we expect to
see few memory chunk allocations. Why are we expecting "the entire system"
to experience a penalty?

* If we do not manage our resources, how are we certain this does not
induce a penalty? Not tracking memory could be worse than tracking it.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: CK Tan (#7)
Re: Memory Accounting v11

On Thu, Jul 2, 2015 at 11:37 AM, CK Tan <cktan@vitessedata.com> wrote:

I am sorry to ask questions unrelated to the subject, but how is tracking
memory going to fix the HashAgg blow up problem? Is there a plan to make
HashAgg not blow up (i.e. spill the hash table)?

Your question is probably answered here:
/messages/by-id/1407012053.15301.53.camel@jeff-desktop

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9David Rowley
david.rowley@2ndquadrant.com
In reply to: Jeff Davis (#1)
1 attachment(s)
Re: Memory Accounting v11

On 15 June 2015 at 07:43, Jeff Davis <pgsql@j-davis.com> wrote:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See /messages/by-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com

Hi Jeff,

I've been looking over the code and reason the previous emails about this
patch.
I don't yet understand if the reported slowdown is from the increase in
struct size or from the additional work being done in palloc() calls,
however, on reading the code I did notice an existing redundant NULL check
in AllocSetAlloc() right where you put you're extra accounting code.

The attached patch should apply on top of your patch and removes the extra
NULL check.

Perhaps if some over the overhead is the extra instructions then this can
help get us back to where we were before.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

duplicate_null_check_removal.difftext/plain; charset=US-ASCII; name=duplicate_null_check_removal.diffDownload
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 3c5d651..ae733d8 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -860,17 +860,16 @@ AllocSetAlloc(MemoryContext context, Size size)
 		 * We could be asking for pretty big blocks here, so cope if malloc
 		 * fails.  But give up if there's less than a meg or so available...
 		 */
-		while (block == NULL && blksize > 1024 * 1024)
+		while (block == NULL)
 		{
 			blksize >>= 1;
-			if (blksize < required_size)
-				break;
+
+			/* has the block size gotten too small? */
+			if (blksize < required_size || blksize <= 1024 * 1024)
+				return NULL;
 			block = (AllocBlock) malloc(blksize);
 		}
 
-		if (block == NULL)
-			return NULL;
-
 		context->mem_allocated += blksize;
 
 		block->aset = set;
#10David Rowley
david.rowley@2ndquadrant.com
In reply to: Jeff Davis (#1)
1 attachment(s)
Re: Memory Accounting v11

On 15 June 2015 at 07:43, Jeff Davis <pgsql@j-davis.com> wrote:

This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().

It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can recurse
and sum up the allocations for all subcontexts as well.

This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.

Previous discussion here:

/messages/by-id/1407012053.15301.53.camel@jeff-desktop

Previous concerns:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See /messages/by-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.

I've been looking at this patch and trying to reproduce the reported
slowdown by using Tomas' function to try to exercise palloc() with minimal
overhead of other code:

https://github.com/tvondra/palloc_bench

I also wanted to attempt to determine if the slowdown was due to the
mem_allocated maintenance or if it was down to the struct size increasing.
I decided to test this just by simply commenting out all of
the context->mem_allocated += / -= and just keep the mem_allocated field in
the struct, but I've been really struggling to see any sort of performance
decrease anywhere. I'm just getting too much variation in the test results
to get any sort of idea.

I've attached a spreadsheet of the results I saw. It would be really good
if Robert was able to test with the IBM PPC just with the extra field in
the struct so we can see if the 1-3% lies there, or if it's in overhead of
keeping mem_allocated up-to-date.

My main question here is: How sure are you that none of your intended use
cases will need to call MemoryContextMemAllocated with recurse = true? I'd
imagine if you ever have to use MemoryContextMemAllocated(ctx, true) then
we'll all be sitting around with our stopwatches out to see if the call
incurs too much of a performance penalty.

Other small issues:

+ oldblksize = block->endptr - ((char *)block);
+
  block = (AllocBlock) realloc(block, blksize);
  if (block == NULL)
  return NULL;
+
+ context->mem_allocated += blksize - oldblksize;
+

Shouldn't you be setting oldblksize after the "if"? I'd imagine the
realloc() failure is rare, but it just seems better to do it just before
it's required and only when required.

+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+  child != NULL;

Here there's a double assignment of "child".

***************
*** 632,637 **** AllocSetDelete(MemoryContext context)
--- 637,644 ----
  {
  AllocBlock next = block->next;
+ context->mem_allocated -= block->endptr - ((char *) block);
+

I might be mistaken here, but can't you just set context->mem_allocted = 0;
after that loop?
Or maybe it would be an improvement to only do the decrement
if MEMORY_CONTEXT_CHECKING is defined, then Assert() that mem_allocated ==
0 after the loop...

One other thing that I don't remember seeing discussed would be to just
store a List in each context which would store all of the mem_allocated's
that need to be updated during each allocation and deallocation of memory.
The list would be setup once when the context is created. When a child
context is setup we'd just loop up the parent context chain and
list_concat() all of the parent's lists onto the child's lists. Would this
method add too much overhead when a context is deleted? Has this been
explored already?

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mem_accounting_benchmark.odsapplication/vnd.oasis.opendocument.spreadsheet; name=mem_accounting_benchmark.odsDownload
#11Jeff Davis
pgsql@j-davis.com
In reply to: David Rowley (#10)
Re: Memory Accounting v11

On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:

of performance decrease anywhere. I'm just getting too much variation
in the test results to get any sort of idea.

That was my experience as well. Thank you for taking a look.

My main question here is: How sure are you that none of your intended
use cases will need to call MemoryContextMemAllocated with recurse =
true? I'd imagine if you ever have to
use MemoryContextMemAllocated(ctx, true) then we'll all be sitting
around with our stopwatches out to see if the call incurs too much of
a performance penalty.

I don't expect HashAgg to be using many memory contexts. It did with
array_agg, but that's been changed, and was a bad pattern anyway (one
context per group is quite wasteful).

If it turns out to be slow, we can call it less frequently (e.g.
extrapolate growth rate to determine when to check).

Shouldn't you be setting oldblksize after the "if"? I'd imagine the
realloc() failure is rare, but it just seems better to do it just
before it's required and only when required.

I don't think that's safe. Realloc can return an entirely new pointer,
so the endptr would be pointing outside of the allocation. I'd have to
hang on to the original pointer before the realloc, so I'd need an extra
variable anyway.

Here there's a double assignment of "child".

Will fix.

I might be mistaken here, but can't you just set context->mem_allocted
= 0; after that loop?
Or maybe it would be an improvement to only do the decrement
if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
mem_allocated == 0 after the loop...

OK. Why not just always Assert that?

One other thing that I don't remember seeing discussed would be to
just store a List in each context which would store all of the
mem_allocated's that need to be updated during each allocation and
deallocation of memory. The list would be setup once when the context
is created. When a child context is setup we'd just loop up the parent
context chain and list_concat() all of the parent's lists onto the
child's lists. Would this method add too much overhead when a context
is deleted? Has this been explored already?

That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12David Rowley
david.rowley@2ndquadrant.com
In reply to: Jeff Davis (#11)
Re: Memory Accounting v11

On 7 July 2015 at 18:59, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:

I might be mistaken here, but can't you just set context->mem_allocted
= 0; after that loop?
Or maybe it would be an improvement to only do the decrement
if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
mem_allocated == 0 after the loop...

OK. Why not just always Assert that?

Well, I thought the per loop decrement of the mem_allocated was wasteful,
as shouldn't it always get to zero after the final loop anyway?
If so, for efficiency it would be better to zero it after the loop, but if
we do that then we can't assert for zero.

One other thing that I don't remember seeing discussed would be to
just store a List in each context which would store all of the
mem_allocated's that need to be updated during each allocation and
deallocation of memory. The list would be setup once when the context
is created. When a child context is setup we'd just loop up the parent
context chain and list_concat() all of the parent's lists onto the
child's lists. Would this method add too much overhead when a context
is deleted? Has this been explored already?

That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.

Oh an array is even better, but why more space? won't it just be a pointer
to an array? It can be NULL if there's nothing to track. Sounds like it
would be the same amount of space, at least on a 64 bit machine.

One thing to keep in mind is that I think if a parent is tracking memory,
then a child will also need to track memory too, as when the child context
is deleted, we'll need to decrement the parent's mem_allocated by that of
all its child contexts

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

#13Andres Freund
andres@anarazel.de
In reply to: David Rowley (#10)
Re: Memory Accounting v11

On 2015-07-07 16:49:58 +1200, David Rowley wrote:

I've been looking at this patch and trying to reproduce the reported
slowdown by using Tomas' function to try to exercise palloc() with minimal
overhead of other code:

https://github.com/tvondra/palloc_bench

That's not necessarily meaningful. Increased cache footprint (both data
and branch prediction) is often only noticeable with additional code
running pushing stuff out.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14David Rowley
david.rowley@2ndquadrant.com
In reply to: David Rowley (#12)
Re: Memory Accounting v11

On 7 July 2015 at 20:23, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 7 July 2015 at 18:59, Jeff Davis <pgsql@j-davis.com> wrote:

One other thing that I don't remember seeing discussed would be to

just store a List in each context which would store all of the
mem_allocated's that need to be updated during each allocation and
deallocation of memory. The list would be setup once when the context
is created. When a child context is setup we'd just loop up the parent
context chain and list_concat() all of the parent's lists onto the
child's lists. Would this method add too much overhead when a context
is deleted? Has this been explored already?

That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.

Oh an array is even better, but why more space? won't it just be a pointer
to an array? It can be NULL if there's nothing to track. Sounds like it
would be the same amount of space, at least on a 64 bit machine.

I'd imagine that the first element of the array could just store the length
of it. The type might be slightly wider for what you need for an array
length, but this'll save having an extra field in the struct for it.

Are you going to implement this? or are you happy with the single level
context tracking is the right thing?
I'm going to mark the patch as waiting on author for now.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

#15Jeff Davis
pgsql@j-davis.com
In reply to: David Rowley (#14)
1 attachment(s)
Re: Memory Accounting v11

On Thu, 2015-07-09 at 14:41 +1200, David Rowley wrote:

Are you going to implement this? or are you happy with the single
level context tracking is the right thing?
I'm going to mark the patch as waiting on author for now.

Attached a version of the patch that does multi-level tracking, v12. It
does this is a simpler way, just like an earlier version of the patch:
it simply traverses up the chain and adds to each parent in a
"total_allocated" field.

The idea you mentioned is a possible optimization of this idea, but it
only makes sense if I'm able to detect a difference between v11
(single-level) and v12 (multi-level). I tried Robert's test[1]pgbench -i -s 300, then do the following 3 times each for master, v11, and v12, and take the median of logged traces: again and
I didn't see a difference on my workstation (technically, v12 came out
the fastest, which means I'm just seeing noise anyway), so I can't
evaluate whether your idea will improve things.

After talking with a few people at PGCon, small noisy differences in CPU
timings can appear for almost any tweak to the code, and aren't
necessarily cause for major concern.

Regards,
Jeff Davis

[1]: pgbench -i -s 300, then do the following 3 times each for master, v11, and v12, and take the median of logged traces:
v11, and v12, and take the median of logged traces:

start server; set trace_sort=on; reindex index pgbench_accounts_pkey;
stop server

Attachments:

memory-accounting-v12.patchtext/x-patch; charset=UTF-8; name=memory-accounting-v12.patchDownload
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 242,247 **** typedef struct AllocChunkData
--- 242,249 ----
  #define AllocChunkGetPointer(chk)	\
  					((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_alloc(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***************
*** 500,505 **** AllocSetContextCreate(MemoryContext parent,
--- 502,510 ----
  					 errdetail("Failed while creating memory context \"%s\".",
  							   name)));
  		}
+ 
+ 		update_alloc((MemoryContext) set, blksize);
+ 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***************
*** 590,595 **** AllocSetReset(MemoryContext context)
--- 595,602 ----
  		else
  		{
  			/* Normal case, release the block */
+ 			update_alloc(context, -(block->endptr - ((char*) block)));
+ 
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***************
*** 635,640 **** AllocSetDelete(MemoryContext context)
--- 642,654 ----
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
+ 
+ 		/*
+ 		 * Parent is already unlinked from this context, so we can't use
+ 		 * update_alloc(). The caller should have already subtracted from the
+ 		 * parents' total_allocated.
+ 		 */
+ 
  		free(block);
  		block = next;
  	}
***************
*** 672,677 **** AllocSetAlloc(MemoryContext context, Size size)
--- 686,694 ----
  		block = (AllocBlock) malloc(blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		update_alloc(context, blksize);
+ 
  		block->aset = set;
  		block->freeptr = block->endptr = ((char *) block) + blksize;
  
***************
*** 861,866 **** AllocSetAlloc(MemoryContext context, Size size)
--- 878,885 ----
  		if (block == NULL)
  			return NULL;
  
+ 		update_alloc(context, blksize);
+ 
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
***************
*** 964,969 **** AllocSetFree(MemoryContext context, void *pointer)
--- 983,991 ----
  			set->blocks = block->next;
  		else
  			prevblock->next = block->next;
+ 
+ 		update_alloc(context, -(block->endptr - ((char*) block)));
+ 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
***************
*** 1077,1082 **** AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1099,1105 ----
  		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
+ 		Size		oldblksize;
  
  		while (block != NULL)
  		{
***************
*** 1094,1102 **** AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1117,1130 ----
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
  		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ 		oldblksize = block->endptr - ((char *)block);
+ 
  		block = (AllocBlock) realloc(block, blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		update_alloc(context, blksize - oldblksize);
+ 
  		block->freeptr = block->endptr = ((char *) block) + blksize;
  
  		/* Update pointers since block has likely been moved */
***************
*** 1361,1363 **** AllocSetCheck(MemoryContext context)
--- 1389,1407 ----
  }
  
  #endif   /* MEMORY_CONTEXT_CHECKING */
+ 
+ /*
+  * Update self_allocated and total_allocated for the context. Size can be
+  * positive or negative.
+  */
+ void
+ update_alloc(MemoryContext context, int64 size)
+ {
+ 	MemoryContext	parent;
+ 
+ 	context->self_allocated += size;
+ 
+ 	/* update total_allocated for self and all parents */
+ 	for (parent = context; parent != NULL; parent = parent->parent)
+ 		parent->total_allocated += size;
+ }
*** a/src/backend/utils/mmgr/mcxt.c
--- b/src/backend/utils/mmgr/mcxt.c
***************
*** 203,208 **** MemoryContextResetChildren(MemoryContext context)
--- 203,210 ----
  void
  MemoryContextDelete(MemoryContext context)
  {
+ 	MemoryContext parent;
+ 
  	AssertArg(MemoryContextIsValid(context));
  	/* We had better not be deleting TopMemoryContext ... */
  	Assert(context != TopMemoryContext);
***************
*** 223,229 **** MemoryContextDelete(MemoryContext context)
--- 225,236 ----
  	 * We delink the context from its parent before deleting it, so that if
  	 * there's an error we won't have deleted/busted contexts still attached
  	 * to the context tree.  Better a leak than a crash.
+ 	 *
+ 	 * But first we need to subtract this context's allocation from
+ 	 * total_allocated in the parent contexts.
  	 */
+ 	for(parent = context->parent; parent != NULL; parent = parent->parent)
+ 		parent->total_allocated -= context->total_allocated;
  	MemoryContextSetParent(context, NULL);
  
  	(*context->methods->delete_context) (context);
***************
*** 477,482 **** MemoryContextIsEmpty(MemoryContext context)
--- 484,504 ----
  }
  
  /*
+  * Find the memory allocated to blocks for this memory context. If total is
+  * true, also include descendants.
+  */
+ int64
+ MemoryContextMemAllocated(MemoryContext context, bool total)
+ {
+ 	AssertArg(MemoryContextIsValid(context));
+ 
+ 	if (total)
+ 		return context->total_allocated;
+ 	else
+ 		return context->self_allocated;
+ }
+ 
+ /*
   * MemoryContextStats
   *		Print statistics about the named context and all its descendants.
   *
***************
*** 636,641 **** MemoryContextCreate(NodeTag tag, Size size,
--- 658,665 ----
  	node->firstchild = NULL;
  	node->nextchild = NULL;
  	node->isReset = true;
+ 	node->self_allocated = 0;
+ 	node->total_allocated = 0;
  	node->name = ((char *) node) + size;
  	strcpy(node->name, name);
  
*** a/src/include/nodes/memnodes.h
--- b/src/include/nodes/memnodes.h
***************
*** 63,68 **** typedef struct MemoryContextData
--- 63,70 ----
  	MemoryContext nextchild;	/* next child of same parent */
  	char	   *name;			/* context name (just for debugging) */
  	MemoryContextCallback *reset_cbs;	/* list of reset/delete callbacks */
+ 	int64		self_allocated;	/* track memory allocated for this context */
+ 	int64		total_allocated;	/* this context plus all descendants */
  } MemoryContextData;
  
  /* utils/palloc.h contains typedef struct MemoryContextData *MemoryContext */
*** a/src/include/utils/memutils.h
--- b/src/include/utils/memutils.h
***************
*** 103,108 **** extern Size GetMemoryChunkSpace(void *pointer);
--- 103,109 ----
  extern MemoryContext GetMemoryChunkContext(void *pointer);
  extern MemoryContext MemoryContextGetParent(MemoryContext context);
  extern bool MemoryContextIsEmpty(MemoryContext context);
+ extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
  extern void MemoryContextStats(MemoryContext context);
  extern void MemoryContextAllowInCriticalSection(MemoryContext context,
  									bool allow);
#16Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#15)
Re: Memory Accounting v11

On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis <pgsql@j-davis.com> wrote:

After talking with a few people at PGCon, small noisy differences in CPU
timings can appear for almost any tweak to the code, and aren't
necessarily cause for major concern.

I agree with that in general, but the concern is a lot bigger when the
function is something that is called everywhere and accounts for a
measurable percentage of our total CPU usage on almost any workload.
If memory allocation got slower because, say, you added some code to
regexp.c and it caused AllocSetAlloc to split a cache line where it
hadn't previously, I wouldn't be worried about that; the next patch,
like as not, will buy the cost back again. But here you really are
adding code to a hot path.

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too. The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk. But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.
If the transition is a varlena or a fixed-length type, we can know how
much space it's consuming without hooking into the memory context
framework.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#16)
Re: Memory Accounting v11

Hi,

On 07/14/2015 10:19 PM, Robert Haas wrote:

On Sat, Jul 11, 2015 at 2:28 AM, Jeff Davis <pgsql@j-davis.com> wrote:

After talking with a few people at PGCon, small noisy differences
in CPU timings can appear for almost any tweak to the code, and
aren't necessarily cause for major concern.

I agree with that in general, but the concern is a lot bigger when the
function is something that is called everywhere and accounts for a
measurable percentage of our total CPU usage on almost any workload.
If memory allocation got slower because, say, you added some code to
regexp.c and it caused AllocSetAlloc to split a cache line where it
hadn't previously, I wouldn't be worried about that; the next patch,
like as not, will buy the cost back again. But here you really are
adding code to a hot path.

I think Jeff was suggesting that we should ignore changes measurably
affecting performance - I'm one of those he discussed this patch with at
pgcon, and I can assure you impact on performance was one of the main
topics of the discussion.

Firstly, do we really have good benchmarks and measurements? I really
doubt that. We do have some numbers for REINDEX, where we observed
0.5-1% regression on noisy results from a Power machine (and we've been
unable to reproduce that on x86). I don't think that's a representative
benchmark, and I'd like to see more thorough measurements. And I agreed
to do this, once Jeff comes up with a new version of the patch.

Secondly, the question is whether the performance is impacted more by
the additional instructions, or by other things - say, random padding,
as was explained by Andrew Gierth in [1]/messages/by-id/87vbitb2zp.fsf@news-spur.riddles.org.uk.

I don't know whether that's happening in this patch, but if it is, it
seems rather strange to use this against this patch and not the others
(because there surely will be other patches causing similar issues).

[1]: /messages/by-id/87vbitb2zp.fsf@news-spur.riddles.org.uk
/messages/by-id/87vbitb2zp.fsf@news-spur.riddles.org.uk

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too. The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk. But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.
If the transition is a varlena or a fixed-length type, we can know how
much space it's consuming without hooking into the memory context
framework.

I respectfully disagree. Our current inability to dump/load the state
has little to do with how we measure consumed memory, IMHO.

It's true that we do have two kinds of aggregates, depending on the
nature of the aggregate state:

(a) fixed-size state (data types passed by value, variable length types
that do not grow once allocated, ...)

(b) continuously growing state (as in string_agg/array_agg)

Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a
solution for dump/load of the aggregate stats - which we need to
implement anyway for parallel aggregate.

I know there was a proposal to force all aggregates to use regular data
types as aggregate stats, but I can't see how that could work without a
significant performance penalty. For example array_agg() is using
internal to pass ArrayBuildState - how do you turn that to a regular
data type without effectively serializing/deserializing the whole array
on every transition?

And even if we come up with a solution for array_agg, do we really
believe it's possible to do for all custom aggregates? Maybe I'm missing
something but I doubt that. ISTM designing ephemeral data structure
allows tweaks that are impossible otherwise.

What might be possible is extending the aggregate API with another
custom function returning size of the aggregate state. So when defining
an aggregate using 'internal' for aggregate state, you'd specify
transfunc, finalfunc and sizefunc. That seems doable, I guess.

I find the memory accounting as a way more elegant solution, though.

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#16)
Re: Memory Accounting v11

On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too. The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk. But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.

So would it be acceptable to just ignore the memory consumed by
"internal", or come up with some heuristic?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Atri Sharma
atri.jiit@gmail.com
In reply to: Jeff Davis (#18)
Re: Memory Accounting v11

On Wed, Jul 15, 2015 at 12:57 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too. The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk. But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.

So would it be acceptable to just ignore the memory consumed by
"internal", or come up with some heuristic?

Regards,
Jeff Davis

I think a heuristic would be more suited here and ignoring memory
consumption for internal types means that we are not making the memory
accounting useful for a set of usecases.

--
Regards,

Atri
*l'apprenant*

#20Jeff Davis
pgsql@j-davis.com
In reply to: Atri Sharma (#19)
Re: Memory Accounting v11

On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

I think a heuristic would be more suited here and ignoring memory
consumption for internal types means that we are not making the memory
accounting useful for a set of usecases.

OK, I will drop this patch and proceed with the HashAgg patch, with a
heuristic for internal types.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#17)
Re: Memory Accounting v11

On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Firstly, do we really have good benchmarks and measurements? I really doubt
that. We do have some numbers for REINDEX, where we observed 0.5-1%
regression on noisy results from a Power machine (and we've been unable to
reproduce that on x86). I don't think that's a representative benchmark, and
I'd like to see more thorough measurements. And I agreed to do this, once
Jeff comes up with a new version of the patch.

Secondly, the question is whether the performance is impacted more by the
additional instructions, or by other things - say, random padding, as was
explained by Andrew Gierth in [1].

I don't know whether that's happening in this patch, but if it is, it seems
rather strange to use this against this patch and not the others (because
there surely will be other patches causing similar issues).

It matters, at least to me, an awful lot where we're adding lines of
code. If you want to add modest amounts of additional code to CREATE
TABLE or CHECKPOINT or something like that, I really don't care,
because that stuff doesn't execute frequently enough to really matter
to performance even if we add a significant chunk of overhead, and
we're doing other expensive stuff at the same time, like fsync. The
same can't be said about functions like LWLockAcquire and
AllocSetAlloc that routinely show up at the top of CPU profiles.

I agree that there is room to question whether the benchmarks I did
are sufficient reason to think that the abstract concern that putting
more code into a function might make it slower translates into a
measurable drop in performance in practice. But I think when it comes
to these very hot code paths, extreme conservatism is warranted. We
can agree to disagree about that.

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too. The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk. But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.
If the transition is a varlena or a fixed-length type, we can know how
much space it's consuming without hooking into the memory context
framework.

I respectfully disagree. Our current inability to dump/load the state has
little to do with how we measure consumed memory, IMHO.

It's true that we do have two kinds of aggregates, depending on the nature
of the aggregate state:

(a) fixed-size state (data types passed by value, variable length types
that do not grow once allocated, ...)

(b) continuously growing state (as in string_agg/array_agg)

Jeff's HashAgg patch already fixes (a) and can fix (b) once we get a
solution for dump/load of the aggregate stats - which we need to implement
anyway for parallel aggregate.

I know there was a proposal to force all aggregates to use regular data
types as aggregate stats, but I can't see how that could work without a
significant performance penalty. For example array_agg() is using internal
to pass ArrayBuildState - how do you turn that to a regular data type
without effectively serializing/deserializing the whole array on every
transition?

That is a good point. Tom suggested that his new expanded-format
stuff might be able to be adapted to the purpose, but I have no idea
how possible that really is.

And even if we come up with a solution for array_agg, do we really believe
it's possible to do for all custom aggregates? Maybe I'm missing something
but I doubt that. ISTM designing ephemeral data structure allows tweaks that
are impossible otherwise.

What might be possible is extending the aggregate API with another custom
function returning size of the aggregate state. So when defining an
aggregate using 'internal' for aggregate state, you'd specify transfunc,
finalfunc and sizefunc. That seems doable, I guess.

And infunc and outfunc. If we don't use the expanded-format stuff for
aggregates with a type-internal transition state, we won't be able to
use input and output functions to serialize and deserialize them,
either.

I find the memory accounting as a way more elegant solution, though.

I think we're just going to have to agree to disagree on that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#18)
Re: Memory Accounting v11

On Wed, Jul 15, 2015 at 3:27 AM, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:

tuplesort.c does its own accounting, and TBH that seems like the right
thing to do here, too. The difficulty is, I think, that some
transition functions use an internal data type for the transition
state, which might not be a single palloc'd chunk. But since we can't
spill those aggregates to disk *anyway*, that doesn't really matter.

So would it be acceptable to just ignore the memory consumed by
"internal", or come up with some heuristic?

Either one would be OK with me. I'd probably ignore that for the
first version of the patch and just let the hash table grow without
bound in that case. Then in a later patch I'd introduce additional
infrastructure to deal with that part of the problem. But if
something else works out well while coding it up, I'd be OK with that,
too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#21)
Re: Memory Accounting v11

Hi,

On 07/15/2015 09:21 PM, Robert Haas wrote:

On Tue, Jul 14, 2015 at 9:14 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Firstly, do we really have good benchmarks and measurements? I really doubt
that. We do have some numbers for REINDEX, where we observed 0.5-1%
regression on noisy results from a Power machine (and we've been unable to
reproduce that on x86). I don't think that's a representative benchmark, and
I'd like to see more thorough measurements. And I agreed to do this, once
Jeff comes up with a new version of the patch.

Secondly, the question is whether the performance is impacted more by the
additional instructions, or by other things - say, random padding, as was
explained by Andrew Gierth in [1].

I don't know whether that's happening in this patch, but if it is,
it seems rather strange to use this against this patch and not the
others (because there surely will be other patches causing similar
issues).

It matters, at least to me, an awful lot where we're adding lines of
code. If you want to add modest amounts of additional code to CREATE
TABLE or CHECKPOINT or something like that, I really don't care,
because that stuff doesn't execute frequently enough to really
matter to performance even if we add a significant chunk of overhead,
and we're doing other expensive stuff at the same time, like fsync.
The same can't be said about functions like LWLockAcquire and
AllocSetAlloc that routinely show up at the top of CPU profiles.

I agree that there is room to question whether the benchmarks I did
are sufficient reason to think that the abstract concern that putting
more code into a function might make it slower translates into a
measurable drop in performance in practice. But I think when it comes
to these very hot code paths, extreme conservatism is warranted. We
can agree to disagree about that.

No, that is not what I tried to say. I certainly agree that we need to
pay attention when adding stuff hot paths - there's no disagreement
about this.

The problem with random padding is that adding code somewhere may
introduce padding that affects other pieces of code. That is essentially
the point of Andrew's explanation that I linked in my previous message.

And the question is - are the differences we've measured really due to
code added to the hot path, or something introduced by random padding
due to some other changes (possibly in a code that was not even executed
during the test)?

I don't know how significant impact this may have in this case, or how
serious this is in general, but we're talking about 0.5-1% difference on
a noisy benchmark. And if such cases of random padding really are a
problem in practice, we've certainly missed plenty of other patches with
the same impact.

Because effectively what Jeff's last patch did was adding a single int64
counter to MemoryContextData structure, and incrementing it for each
allocated block (not chunk). I can't really imagine a solution making it
cheaper, because there are very few faster operations. Even "opt-in"
won't make this much faster, because you'll have to check a flag and
you'll need two fields (flag + counter).

Of course, this assumes "local counter" (i.e. not updating counters the
parent contexts), but I claim that's OK. I've been previously pushing
for eagerly updating all the parent contexts, so that finding out the
allocated memory is fast even when there are many child contexts - a
prime example was array_agg() before I fixed it. But I changed my mind
on this and now say "screw them". I claim that aggregates using a
separate memory context for each group are a lost case - they already
suffer a significant overhead, and should be fixed just like we fixed
array_agg().

That was effectively the outcome of pgcon discussions - to use the
simple int64 counter, do the accounting for all contexts, and update
only the local counter. For cases with many child contexts finding out
the amount of allocated memory won't be cheap, but well - there's not
much we can do about that.

I know there was a proposal to force all aggregates to use regular
data types as aggregate stats, but I can't see how that could work
without a significant performance penalty. For example array_agg()
is using internal to pass ArrayBuildState - how do you turn that to
a regular data type without effectively serializing/deserializing
the whole array on every transition?

That is a good point. Tom suggested that his new expanded-format
stuff might be able to be adapted to the purpose, but I have no idea
how possible that really is.

Me neither, and maybe there's a good solution for that, making my
concerns unfounded.

What might be possible is extending the aggregate API with another
custom function returning size of the aggregate state. So when
defining an aggregate using 'internal' for aggregate state, you'd
specify transfunc, finalfunc and sizefunc. That seems doable, I
guess.

And infunc and outfunc. If we don't use the expanded-format stuff for
aggregates with a type-internal transition state, we won't be able to
use input and output functions to serialize and deserialize them,
either.

Sure, that is indeed necessary for spilling the aggregates to disk. But
for aggregates with fixed-size state that's not necessary (Jeff's
HashAgg patch handles them fine), so I see this as a separate thing.

I find the memory accounting as a way more elegant solution, though.

I think we're just going to have to agree to disagree on that.

Sure, it's certainly a matter of personal taste.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Davis (#20)
Re: Memory Accounting v11

Hi,

On 07/15/2015 07:07 PM, Jeff Davis wrote:

On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

I think a heuristic would be more suited here and ignoring memory
consumption for internal types means that we are not making the
memory accounting useful for a set of usecases.

OK, I will drop this patch and proceed with the HashAgg patch, with
a heuristic for internal types.

Could someone briefly explain what heuristics are we talking about?

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25David Rowley
david.rowley@2ndquadrant.com
In reply to: Jeff Davis (#20)
Re: Memory Accounting v11

On 16 July 2015 at 05:07, Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

I think a heuristic would be more suited here and ignoring memory
consumption for internal types means that we are not making the memory
accounting useful for a set of usecases.

OK, I will drop this patch and proceed with the HashAgg patch, with a
heuristic for internal types.

Should we mark the patch as "returned with feedback" in the commitfest app
then?

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

#26Jeff Davis
pgsql@j-davis.com
In reply to: David Rowley (#25)
Re: Memory Accounting v11

On Fri, 2015-07-17 at 15:52 +1200, David Rowley wrote:

Should we mark the patch as "returned with feedback" in the commitfest
app then?

I believe the memory accounting patch has been rejected. Instead, the
work will be done in the HashAgg patch.

Thank you for the review!

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers