Memory Accounting
Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktop
This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.
The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.
I ran the same tests as Robert did before[1]/messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com on my laptop[2]Linux jdavis 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux. The only
difference is that I also set max_parallel_workers[_per_gather]=0 to be
sure. I did 5 runs, alternating between memory-accounting and master,
and I got the following results for "elapsed" (as reported by
trace_sort):
regression=# select version, min(s), max(s), percentile_disc(0.5)
within group (order by s) as median, avg(s)::numeric(10,2) from tmp
group by version;
version | min | max | median | avg
-------------------+-------+-------+--------+-------
master | 13.92 | 14.40 | 14.06 | 14.12
memory accounting | 13.43 | 14.46 | 14.11 | 14.09
(2 rows)
So, I don't see any significant performance impact for this patch in
this test. That may be because:
* It was never really significant except on PPC64.
* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).
* A lot of changes to sort have happened since that time, so perhaps
it's not a good test of memory contexts any more.
pgbench didn't show any slowdown either.
I also did another test with hash aggregation that uses significant
memory (t10m is a table with 10M distinct values and work_mem is 1GB):
postgres=# select (select (i, count(*)) from t10m group by i having
count(*) > n) from (values(1),(2),(3),(4),(5)) as s(n);
I didn't see any noticable difference there, either.
Regards,
Jeff Davis
[1]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
/messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
[2]: Linux jdavis 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Attachments:
0001-Memory-accounting-at-block-level.patchtext/x-patch; charset=UTF-8; name=0001-Memory-accounting-at-block-level.patchDownload
From d7c8587620aad080afc904b54a3ebd3232f39b7c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jdavis@postgresql.org>
Date: Fri, 1 Jun 2018 13:35:21 -0700
Subject: [PATCH] Memory accounting at block level.
Track the memory allocated to a memory context in the form of blocks,
and introduce a new function MemoryContextMemAllocated() to report it
to the caller in bytes. This does not track individual chunks.
---
src/backend/utils/mmgr/aset.c | 36 +++++++++++++++++++++++++++++
src/backend/utils/mmgr/generation.c | 18 +++++++++++----
src/backend/utils/mmgr/mcxt.c | 24 +++++++++++++++++++
src/backend/utils/mmgr/slab.c | 10 ++++++++
src/include/nodes/memnodes.h | 1 +
src/include/utils/memutils.h | 1 +
6 files changed, 86 insertions(+), 4 deletions(-)
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6e4a343439..6e90493810 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,9 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated =
+ set->keeper->endptr - ((char *) set);
+
return (MemoryContext) set;
}
}
@@ -546,6 +549,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = firstBlockSize;
+
return (MemoryContext) set;
}
@@ -604,6 +609,8 @@ AllocSetReset(MemoryContext context)
else
{
/* Normal case, release the block */
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -612,6 +619,8 @@ AllocSetReset(MemoryContext context)
block = next;
}
+ Assert(context->mem_allocated == set->keeper->endptr - ((char *) set));
+
/* Reset block size allocation sequence, too */
set->nextBlockSize = set->initBlockSize;
}
@@ -688,11 +697,16 @@ AllocSetDelete(MemoryContext context)
#endif
if (block != set->keeper)
+ {
+ context->mem_allocated -= block->endptr - ((char *) block);
free(block);
+ }
block = next;
}
+ Assert(context->mem_allocated == set->keeper->endptr - ((char *) set));
+
/* Finally, free the context header, including the keeper block */
free(set);
}
@@ -733,6 +747,9 @@ AllocSetAlloc(MemoryContext context, Size size)
block = (AllocBlock) malloc(blksize);
if (block == NULL)
return NULL;
+
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = block->endptr = ((char *) block) + blksize;
@@ -928,6 +945,8 @@ AllocSetAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
block->endptr = ((char *) block) + blksize;
@@ -1028,6 +1047,9 @@ AllocSetFree(MemoryContext context, void *pointer)
set->blocks = block->next;
if (block->next)
block->next->prev = block->prev;
+
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -1144,6 +1166,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
AllocBlock block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
Size chksize;
Size blksize;
+ Size oldblksize;
/*
* Try to verify that we have a sane block pointer: it should
@@ -1159,6 +1182,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
/* Do the realloc */
chksize = MAXALIGN(size);
blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ oldblksize = block->endptr - ((char *)block);
+
block = (AllocBlock) realloc(block, blksize);
if (block == NULL)
{
@@ -1166,6 +1191,9 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
return NULL;
}
+
+ context->mem_allocated += blksize - oldblksize;
+
block->freeptr = block->endptr = ((char *) block) + blksize;
/* Update pointers since block has likely been moved */
@@ -1383,6 +1411,7 @@ AllocSetCheck(MemoryContext context)
const char *name = set->header.name;
AllocBlock prevblock;
AllocBlock block;
+ int64 total_allocated = 0;
for (prevblock = NULL, block = set->blocks;
block != NULL;
@@ -1393,6 +1422,11 @@ AllocSetCheck(MemoryContext context)
long blk_data = 0;
long nchunks = 0;
+ if (set->keeper == block)
+ total_allocated += block->endptr - ((char *) set);
+ else
+ total_allocated += block->endptr - ((char *) block);
+
/*
* Empty block - empty can be keeper-block only
*/
@@ -1479,6 +1513,8 @@ AllocSetCheck(MemoryContext context)
elog(WARNING, "problem in alloc set %s: found inconsistent memory block %p",
name, block);
}
+
+ Assert(total_allocated == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index eaacafb7be..e84599cc17 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -300,7 +300,7 @@ GenerationReset(MemoryContext context)
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->blksize);
#endif
-
+ context->mem_allocated -= block->blksize;
free(block);
}
@@ -352,6 +352,8 @@ GenerationAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
/* block with a single (used) chunk */
block->blksize = blksize;
block->nchunks = 1;
@@ -407,6 +409,8 @@ GenerationAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
block->blksize = blksize;
block->nchunks = 0;
block->nfree = 0;
@@ -522,6 +526,7 @@ GenerationFree(MemoryContext context, void *pointer)
if (set->block == block)
set->block = NULL;
+ context->mem_allocated -= block->blksize;
free(block);
}
@@ -743,9 +748,10 @@ GenerationStats(MemoryContext context,
static void
GenerationCheck(MemoryContext context)
{
- GenerationContext *gen = (GenerationContext *) context;
- const char *name = context->name;
- dlist_iter iter;
+ GenerationContext *gen = (GenerationContext *) context;
+ const char *name = context->name;
+ dlist_iter iter;
+ int64 total_allocated = 0;
/* walk all blocks in this context */
dlist_foreach(iter, &gen->blocks)
@@ -755,6 +761,8 @@ GenerationCheck(MemoryContext context)
nchunks;
char *ptr;
+ total_allocated += block->blksize;
+
/*
* nfree > nchunks is surely wrong, and we don't expect to see
* equality either, because such a block should have gotten freed.
@@ -833,6 +841,8 @@ GenerationCheck(MemoryContext context)
elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p does not match header %d",
name, nfree, block, block->nfree);
}
+
+ Assert(total_allocated == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index b07be12236..27417af548 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -462,6 +462,29 @@ MemoryContextIsEmpty(MemoryContext context)
return context->methods->is_empty(context);
}
+/*
+ * Find the memory allocated to blocks for this memory context. If recurse is
+ * true, also include children.
+ */
+int64
+MemoryContextMemAllocated(MemoryContext context, bool recurse)
+{
+ int64 total = context->mem_allocated;
+
+ AssertArg(MemoryContextIsValid(context));
+
+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+ child != NULL;
+ child = child->nextchild)
+ total += MemoryContextMemAllocated(child, true);
+ }
+
+ return total;
+}
+
/*
* MemoryContextStats
* Print statistics about the named context and all its descendants.
@@ -736,6 +759,7 @@ MemoryContextCreate(MemoryContext node,
node->methods = methods;
node->parent = parent;
node->firstchild = NULL;
+ node->mem_allocated = 0;
node->prevchild = NULL;
node->name = name;
node->ident = NULL;
diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index e23669fb4f..046515e864 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -305,12 +305,14 @@ SlabReset(MemoryContext context)
#endif
free(block);
slab->nblocks--;
+ context->mem_allocated -= slab->blockSize;
}
}
slab->minFreeChunks = 0;
Assert(slab->nblocks == 0);
+ Assert(context->mem_allocated == 0);
}
/*
@@ -388,6 +390,7 @@ SlabAlloc(MemoryContext context, Size size)
slab->minFreeChunks = slab->chunksPerBlock;
slab->nblocks += 1;
+ context->mem_allocated += slab->blockSize;
}
/* grab the block from the freelist (even the new block is there) */
@@ -480,6 +483,9 @@ SlabAlloc(MemoryContext context, Size size)
#endif
SlabAllocInfo(slab, chunk);
+
+ Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
+
return SlabChunkGetPointer(chunk);
}
@@ -555,11 +561,13 @@ SlabFree(MemoryContext context, void *pointer)
{
free(block);
slab->nblocks--;
+ context->mem_allocated -= slab->blockSize;
}
else
dlist_push_head(&slab->freelist[block->nfree], &block->node);
Assert(slab->nblocks >= 0);
+ Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
}
/*
@@ -782,6 +790,8 @@ SlabCheck(MemoryContext context)
name, block->nfree, block, nfree);
}
}
+
+ Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index dbae98d3d9..df0ae3625c 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
/* these two fields are placed here to minimize alignment wastage: */
bool isReset; /* T = no space alloced since last reset */
bool allowInCritSection; /* allow palloc in critical section */
+ int64 mem_allocated; /* track memory allocated for this context */
const MemoryContextMethods *methods; /* virtual function table */
MemoryContext parent; /* NULL if no parent (toplevel context) */
MemoryContext firstchild; /* head of linked list of children */
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ffe6de536e..6a837bc990 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -82,6 +82,7 @@ extern void MemoryContextSetParent(MemoryContext context,
extern Size GetMemoryChunkSpace(void *pointer);
extern MemoryContext MemoryContextGetParent(MemoryContext context);
extern bool MemoryContextIsEmpty(MemoryContext context);
+extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
extern void MemoryContextStats(MemoryContext context);
extern void MemoryContextStatsDetail(MemoryContext context, int max_children);
extern void MemoryContextAllowInCriticalSection(MemoryContext context,
--
2.17.1
On 18/07/2019 21:24, Jeff Davis wrote:
Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktopThis patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.
Seems handy.
* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).
Is that OK for memory bounded hash aggregation? Might there be a lot of
sub-contexts during hash aggregation?
- Heikki
On Mon, Jul 22, 2019 at 11:30:37AM +0300, Heikki Linnakangas wrote:
On 18/07/2019 21:24, Jeff Davis wrote:
Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktopThis patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.Seems handy.
Indeed.
* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).Is that OK for memory bounded hash aggregation? Might there be a lot
of sub-contexts during hash aggregation?
There shouldn't be, at least not since b419865a814abbc. There might be
cases where custom aggregates still do that, but I think that's simply a
design we should discourage.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 2019-07-22 at 18:16 +0200, Tomas Vondra wrote:
* I changed it to only update mem_allocated for the current
context,
not recursively for all parent contexts. It's now up to the
function
that reports memory usage to recurse or not (as needed).Is that OK for memory bounded hash aggregation? Might there be a
lot
of sub-contexts during hash aggregation?There shouldn't be, at least not since b419865a814abbc. There might
be
cases where custom aggregates still do that, but I think that's
simply a
design we should discourage.
Right, I don't think memory-context-per-group is something we should
optimize for.
Discussion:
/messages/by-id/3839201.Nfa2RvcheX@techfox.foxi
/messages/by-id/5334D7A5.2000907@fuzzy.cz
Commit link:
Regards,
Jeff Davis
On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote:
Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktopThis patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.
Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.
Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?
I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.
The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.
This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?
--
Melanie Plageman
On Tue, Jul 23, 2019 at 06:18:26PM -0700, Melanie Plageman wrote:
On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote:
Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktopThis patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?
I think there's plenty of places where we quickly get into a state with
enough chunks in the freelist - the per-tuple contexts are a good
example of that, I think.
I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.
I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).
The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?
Yes, that might me a good candidate (and it would be much simpler than
the manual accounting we use now).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wanted to address a couple of questions Jeff asked me off-list about
Greenplum's implementations of Memory Accounting.
Greenplum has two memory accounting sub-systems -- one is the
MemoryContext-based system proposed here.
The other memory accounting system tracks "logical" memory owners in
global accounts. For example, every node in a plan has an account,
however, there are other execution contexts, such as the parser, which
have their own logical memory accounts.
Notably, this logical memory account system also tracks chunks instead
of blocks.
The rationale for tracking memory at the logical owner level was that
memory for a logical owner may allocate memory across multiple
contexts and a single context may contain memory belonging to several
of these logical owners.
More compellingly, many of the allocations done during execution are
done directly in the per query or per tuple context--as opposed to
being done in their own uniquely named context. Arguably, this is
because those logical owners (a Result node, for example) are not
memory-intensive and thus do not require specific memory accounting.
However, when debugging a memory leak or OOM, the specificity of
logical owner accounts was seen as desirable. A discrepancy between
memory allocated and memory freed in the per query context doesn't
provide a lot of hints as to the source of the leak.
At the least, there was no meaningful way to represent MemoryContext
account balances in EXPLAIN ANALYZE. Where would the TopMemoryContext
be represented, for example.
Also, by using logical accounts, each node in the plan could be
assigned a quota at plan time--because many memory intensive operators
will not have relinquished the memory they hold when other nodes are
executing (e.g. Materialize)--so, instead of granting each node
work_mem, work_mem is divided up into quotas for each operator in a
particular way. This was meant to pave the way for work_mem
enforcement. This is a topic that has come up in various ways in other
forums. For example, in the XPRS thread, the discussion of erroring
out for queries with no "escape mechanism" brought up by Thomas Munro
[1]: /messages/by-id/CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com
more on a proposed session-level memory setting, but it is still
enforcement of a memory setting).
It was also discussed at PGCon this year in an unconference session on
OOM-detection and debugging, runaway query termination, and
session-level memory consumption tracking [2]https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference.
The motivation for tracking chunks instead of blocks was to understand
the "actual" memory consumption of different components in the
database. Then, eventually, memory consumption patterns would emerge
and improvements could be made to memory allocation strategies to suit
different use cases--perhaps other implementations of the
MemoryContext API similar to Slab and Generation were envisioned.
Apparently, it did lead to the discovery of some memory fragmentation
issues which were tuned.
I bring these up not just to answer Jeff's question but also to
provide some anecdotal evidence that the patch here is a good base for
other memory accounting and tracking schemes.
Even if HashAgg is the only initial consumer of the memory accounting
framework, we know that tuplesort can make use of it in its current
state as well. And, if another node or component requires
chunk-tracking, they could implement a different MemoryContext API
implementation which uses the MemoryContextData->mem_allocated field
to track chunks instead of blocks by tracking chunks in its alloc/free
functions.
Ideas like logical memory accounting could leverage the mem_allocated
field and build on top of it.
[1]: /messages/by-id/CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com
/messages/by-id/CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com
[2]: https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference
Import Notes
Reply to msg id not found: CADwEdor2jZitB2P94on2LUAhdi0iRHv2WNseV3LJmJPenuuNpw@mail.gmail.com
On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
Hi Jeff,
Hi Soumyadeep and Melanie,
Thank you for the review!
max_stack_depth max level lazy (ms) eager (ms) (eage
r/lazy)
2MB 82 302.715 427.554 1.4123978
3MB 3474 567.829 896.143 1.578191674
7.67MB 8694 2657.972 4903.063 1.844663149
Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?
It looks like you agree with the approach and the results. Did you find
any other issues with the patch?
I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.
Regards,
Jeff Davis
Import Notes
Reply to msg id not found: CADwEdor2jZitB2P94on2LUAhdi0iRHv2WNseV3LJmJPenuuNpw@mail.gmail.com
On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote:
I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).
The patch has been marked as ready for committer for a week or so, but
it seems to me that this comment has not been addressed, no? Are we
sure that we want this method if it proves to be inefficient when
there are many sub-contexts and shouldn't we at least test such
scenarios with a worst-case, customly-made, function?
--
Michael
On Tue, Sep 24, 2019 at 02:21:40PM +0900, Michael Paquier wrote:
On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote:
I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).The patch has been marked as ready for committer for a week or so, but
it seems to me that this comment has not been addressed, no? Are we
sure that we want this method if it proves to be inefficient when
there are many sub-contexts and shouldn't we at least test such
scenarios with a worst-case, customly-made, function?
I don't think so.
Aggregates creating many memory contexts (context for each group) was
discussed extensively in the thread about v11 [1]/messages/by-id/1434311039.4369.39.camel@jeff-desktop in 2015. And back then
the conclusion was that that's a pretty awful pattern anyway, as it uses
much more memory (no cross-context freelists), and has various other
issues. In a way, those aggregates are wrong and should be fixed just
like we fixed array_agg/string_agg (even without the memory accounting).
The way I see it we can do either eager or lazy accounting. Eager might
work better for aggregates with many contexts, but it does increase the
overhead for the "regular" aggregates with just one or two contexts.
Considering how rare those many-context aggregates are (I'm not aware of
any such aggregate at the moment), it seems reasonable to pick the lazy
accounting.
(Note: Another factor affecting the lazy vs. eager efficiency is the
number of palloc/pfree calls vs. calls to determine amount of mem used,
but that's mostly orthogonal and we cn ignore it here).
So I think the approach Jeff ended up with sensible - certainly as a
first step. We may improve it in the future, of course, once we have
more practical experience.
Barring objections, I do plan to get this committed by the end of this
CF (i.e. sometime later this week).
[1]: /messages/by-id/1434311039.4369.39.camel@jeff-desktop
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
Hi Jeff,
Hi Soumyadeep and Melanie,
Thank you for the review!
max_stack_depth max level lazy (ms) eager (ms)
(eage
r/lazy)
2MB 82 302.715 427.554 1.4123978
3MB 3474 567.829 896.143 1.578191674
7.67MB 8694 2657.972 4903.063 1.844663149Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?
We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.
It looks like you agree with the approach and the results. Did you find
any other issues with the patch?
We didn't observe any other problems with the patch and agree with the
approach. It is a good start.
I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.
I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).
Thanks,
Soumyadeep & Melanie
On Tue, Sep 24, 2019 at 02:05:51PM +0200, Tomas Vondra wrote:
The way I see it we can do either eager or lazy accounting. Eager might
work better for aggregates with many contexts, but it does increase the
overhead for the "regular" aggregates with just one or two contexts.
Considering how rare those many-context aggregates are (I'm not aware of
any such aggregate at the moment), it seems reasonable to pick the lazy
accounting.
Okay.
So I think the approach Jeff ended up with sensible - certainly as a
first step. We may improve it in the future, of course, once we have
more practical experience.Barring objections, I do plan to get this committed by the end of this
CF (i.e. sometime later this week).
Sounds good to me. Though I have not looked at the patch in details,
the arguments are sensible. Thanks for confirming.
--
Michael
On Tue, Sep 24, 2019 at 11:46:49AM -0700, Melanie Plageman wrote:
On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
Hi Jeff,
Hi Soumyadeep and Melanie,
Thank you for the review!
max_stack_depth max level lazy (ms) eager (ms)
(eage
r/lazy)
2MB 82 302.715 427.554 1.4123978
3MB 3474 567.829 896.143 1.578191674
7.67MB 8694 2657.972 4903.063 1.844663149Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.It looks like you agree with the approach and the results. Did you find
any other issues with the patch?We didn't observe any other problems with the patch and agree with the
approach. It is a good start.I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).
I agree that would be nice, but I don't have access to any Power machine
suitable for this kind of benchmarking :-( Robert, any chance you still
have access to that machine?
It's worth mentioning that those bechmarks (I'm assuming we're talking
about the numbers Rober shared in [1]/messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).
I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers demonstrating
I'wrong I'll stick to my plan to get this committed soon.
regards
[1]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:
It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.
That was my conclusion, as well.
Regards,
Jeff Davis
On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:
On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:
It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.That was my conclusion, as well.
I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:
1) aset.c (AllocSetDelete)
--------------------------
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
if (block != set->keeper)
{
context->mem_allocated -= block->endptr - ((char *) block);
free(block);
}
2) generation.c (GenerationReset)
---------------------------------
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->blksize);
#endif
context->mem_allocated -= block->blksize;
Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.
I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.
Aside from that, I've repeated the REINDEX benchmarks done by Robert in
[1]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
no difference. Both machines are x86_64, I don't have access to any
Power machine at the moment, unfortunately.
[1]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Sep 29, 2019 at 12:12:49AM +0200, Tomas Vondra wrote:
On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:
On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:
It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.That was my conclusion, as well.
I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:1) aset.c (AllocSetDelete)
--------------------------#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endifif (block != set->keeper)
{
context->mem_allocated -= block->endptr - ((char *) block);
free(block);
}2) generation.c (GenerationReset)
---------------------------------#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->blksize);
#endif
context->mem_allocated -= block->blksize;Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.
Oh, and one more thing - this probably needs to add at least some basic
explanation of the accounting to src/backend/mmgr/README.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:
Notice that when CLOBBER_FREED_MEMORY is defined, the code first
calls
wipe_mem and then accesses fields of the (wiped) block.
Interesringly
enough, the regression tests don't seem to exercise these bits -
I've
tried adding elog(ERROR) and it still passes. For (2) that's not
very
surprising because Generation context is only really used in
logical
decoding (and we don't delete the context I think). Not sure about
(1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.I'm pretty sure a custom function calling the contexts explicitly
would
fall over, but I haven't tried.
Fixed.
I tested with some custom use of memory contexts. The reason
AllocSetDelete() didn't fail before is that most memory contexts use
the free lists (the list of free memory contexts, not the free list of
chunks), so you need to specify a non-default minsize in order to
prevent that and trigger the bug.
AllocSetReset() worked, but it was reading the header of the keeper
block after wiping the contents of the keeper block. It technically
worked, because the header of the keeper block was not wiped, but it
seems more clear to explicitly save the size of the keeper block. In
AllocSetDelete(), saving the keeper size is required, because it wipes
the block headers in addition to the contents.
Oh, and one more thing - this probably needs to add at least some
basic
explanation of the accounting to src/backend/mmgr/README.
Added.
Regards,
Jeff Davis
Attachments:
0001-Memory-accounting-at-block-level.patchtext/x-patch; charset=UTF-8; name=0001-Memory-accounting-at-block-level.patchDownload
From 7184df8a928d6c0c6e41d171968a54d46f256a22 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jdavis@postgresql.org>
Date: Fri, 1 Jun 2018 13:35:21 -0700
Subject: [PATCH] Memory accounting at block level.
Track the memory allocated to a memory context in the form of blocks,
and introduce a new function MemoryContextMemAllocated() to report it
to the caller in bytes. This does not track individual chunks.
---
src/backend/utils/mmgr/README | 4 +++
src/backend/utils/mmgr/aset.c | 38 +++++++++++++++++++++++++++++
src/backend/utils/mmgr/generation.c | 19 ++++++++++++---
src/backend/utils/mmgr/mcxt.c | 24 ++++++++++++++++++
src/backend/utils/mmgr/slab.c | 10 ++++++++
src/include/nodes/memnodes.h | 1 +
src/include/utils/memutils.h | 1 +
7 files changed, 93 insertions(+), 4 deletions(-)
diff --git a/src/backend/utils/mmgr/README b/src/backend/utils/mmgr/README
index 7e6541d0dee..bd5fa6d8e53 100644
--- a/src/backend/utils/mmgr/README
+++ b/src/backend/utils/mmgr/README
@@ -23,6 +23,10 @@ The basic operations on a memory context are:
* reset a context (free all memory allocated in the context, but not the
context object itself)
+* inquire about the total amount of memory allocated to the context
+ (the raw memory from which the context allocates chunks; not the
+ chunks themselves)
+
Given a chunk of memory previously allocated from a context, one can
free it or reallocate it larger or smaller (corresponding to standard C
library's free() and realloc() routines). These operations return memory
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6b63d6f85d0..90f370570fe 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,9 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated =
+ set->keeper->endptr - ((char *) set);
+
return (MemoryContext) set;
}
}
@@ -546,6 +549,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
parent,
name);
+ ((MemoryContext) set)->mem_allocated = firstBlockSize;
+
return (MemoryContext) set;
}
@@ -566,6 +571,7 @@ AllocSetReset(MemoryContext context)
{
AllocSet set = (AllocSet) context;
AllocBlock block;
+ Size keepersize = set->keeper->endptr - ((char *) set);
AssertArg(AllocSetIsValid(set));
@@ -604,6 +610,8 @@ AllocSetReset(MemoryContext context)
else
{
/* Normal case, release the block */
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -612,6 +620,8 @@ AllocSetReset(MemoryContext context)
block = next;
}
+ Assert(context->mem_allocated == keepersize);
+
/* Reset block size allocation sequence, too */
set->nextBlockSize = set->initBlockSize;
}
@@ -628,6 +638,7 @@ AllocSetDelete(MemoryContext context)
{
AllocSet set = (AllocSet) context;
AllocBlock block = set->blocks;
+ Size keepersize = set->keeper->endptr - ((char *) set);
AssertArg(AllocSetIsValid(set));
@@ -683,6 +694,9 @@ AllocSetDelete(MemoryContext context)
{
AllocBlock next = block->next;
+ if (block != set->keeper)
+ context->mem_allocated -= block->endptr - ((char *) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -693,6 +707,8 @@ AllocSetDelete(MemoryContext context)
block = next;
}
+ Assert(context->mem_allocated == keepersize);
+
/* Finally, free the context header, including the keeper block */
free(set);
}
@@ -733,6 +749,9 @@ AllocSetAlloc(MemoryContext context, Size size)
block = (AllocBlock) malloc(blksize);
if (block == NULL)
return NULL;
+
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = block->endptr = ((char *) block) + blksize;
@@ -928,6 +947,8 @@ AllocSetAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
block->aset = set;
block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
block->endptr = ((char *) block) + blksize;
@@ -1028,6 +1049,9 @@ AllocSetFree(MemoryContext context, void *pointer)
set->blocks = block->next;
if (block->next)
block->next->prev = block->prev;
+
+ context->mem_allocated -= block->endptr - ((char*) block);
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
@@ -1144,6 +1168,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
AllocBlock block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
Size chksize;
Size blksize;
+ Size oldblksize;
/*
* Try to verify that we have a sane block pointer: it should
@@ -1159,6 +1184,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
/* Do the realloc */
chksize = MAXALIGN(size);
blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ oldblksize = block->endptr - ((char *)block);
+
block = (AllocBlock) realloc(block, blksize);
if (block == NULL)
{
@@ -1166,6 +1193,9 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
return NULL;
}
+
+ context->mem_allocated += blksize - oldblksize;
+
block->freeptr = block->endptr = ((char *) block) + blksize;
/* Update pointers since block has likely been moved */
@@ -1383,6 +1413,7 @@ AllocSetCheck(MemoryContext context)
const char *name = set->header.name;
AllocBlock prevblock;
AllocBlock block;
+ int64 total_allocated = 0;
for (prevblock = NULL, block = set->blocks;
block != NULL;
@@ -1393,6 +1424,11 @@ AllocSetCheck(MemoryContext context)
long blk_data = 0;
long nchunks = 0;
+ if (set->keeper == block)
+ total_allocated += block->endptr - ((char *) set);
+ else
+ total_allocated += block->endptr - ((char *) block);
+
/*
* Empty block - empty can be keeper-block only
*/
@@ -1479,6 +1515,8 @@ AllocSetCheck(MemoryContext context)
elog(WARNING, "problem in alloc set %s: found inconsistent memory block %p",
name, block);
}
+
+ Assert(total_allocated == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index eaacafb7be5..f1aed7e9384 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -297,10 +297,11 @@ GenerationReset(MemoryContext context)
dlist_delete(miter.cur);
+ context->mem_allocated -= block->blksize;
+
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->blksize);
#endif
-
free(block);
}
@@ -352,6 +353,8 @@ GenerationAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
/* block with a single (used) chunk */
block->blksize = blksize;
block->nchunks = 1;
@@ -407,6 +410,8 @@ GenerationAlloc(MemoryContext context, Size size)
if (block == NULL)
return NULL;
+ context->mem_allocated += blksize;
+
block->blksize = blksize;
block->nchunks = 0;
block->nfree = 0;
@@ -522,6 +527,7 @@ GenerationFree(MemoryContext context, void *pointer)
if (set->block == block)
set->block = NULL;
+ context->mem_allocated -= block->blksize;
free(block);
}
@@ -743,9 +749,10 @@ GenerationStats(MemoryContext context,
static void
GenerationCheck(MemoryContext context)
{
- GenerationContext *gen = (GenerationContext *) context;
- const char *name = context->name;
- dlist_iter iter;
+ GenerationContext *gen = (GenerationContext *) context;
+ const char *name = context->name;
+ dlist_iter iter;
+ int64 total_allocated = 0;
/* walk all blocks in this context */
dlist_foreach(iter, &gen->blocks)
@@ -755,6 +762,8 @@ GenerationCheck(MemoryContext context)
nchunks;
char *ptr;
+ total_allocated += block->blksize;
+
/*
* nfree > nchunks is surely wrong, and we don't expect to see
* equality either, because such a block should have gotten freed.
@@ -833,6 +842,8 @@ GenerationCheck(MemoryContext context)
elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p does not match header %d",
name, nfree, block, block->nfree);
}
+
+ Assert(total_allocated == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index b07be122369..27417af548d 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -462,6 +462,29 @@ MemoryContextIsEmpty(MemoryContext context)
return context->methods->is_empty(context);
}
+/*
+ * Find the memory allocated to blocks for this memory context. If recurse is
+ * true, also include children.
+ */
+int64
+MemoryContextMemAllocated(MemoryContext context, bool recurse)
+{
+ int64 total = context->mem_allocated;
+
+ AssertArg(MemoryContextIsValid(context));
+
+ if (recurse)
+ {
+ MemoryContext child = context->firstchild;
+ for (child = context->firstchild;
+ child != NULL;
+ child = child->nextchild)
+ total += MemoryContextMemAllocated(child, true);
+ }
+
+ return total;
+}
+
/*
* MemoryContextStats
* Print statistics about the named context and all its descendants.
@@ -736,6 +759,7 @@ MemoryContextCreate(MemoryContext node,
node->methods = methods;
node->parent = parent;
node->firstchild = NULL;
+ node->mem_allocated = 0;
node->prevchild = NULL;
node->name = name;
node->ident = NULL;
diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index 700a91a2a37..50deb354c28 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -305,12 +305,14 @@ SlabReset(MemoryContext context)
#endif
free(block);
slab->nblocks--;
+ context->mem_allocated -= slab->blockSize;
}
}
slab->minFreeChunks = 0;
Assert(slab->nblocks == 0);
+ Assert(context->mem_allocated == 0);
}
/*
@@ -388,6 +390,7 @@ SlabAlloc(MemoryContext context, Size size)
slab->minFreeChunks = slab->chunksPerBlock;
slab->nblocks += 1;
+ context->mem_allocated += slab->blockSize;
}
/* grab the block from the freelist (even the new block is there) */
@@ -480,6 +483,9 @@ SlabAlloc(MemoryContext context, Size size)
#endif
SlabAllocInfo(slab, chunk);
+
+ Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
+
return SlabChunkGetPointer(chunk);
}
@@ -555,11 +561,13 @@ SlabFree(MemoryContext context, void *pointer)
{
free(block);
slab->nblocks--;
+ context->mem_allocated -= slab->blockSize;
}
else
dlist_push_head(&slab->freelist[block->nfree], &block->node);
Assert(slab->nblocks >= 0);
+ Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
}
/*
@@ -782,6 +790,8 @@ SlabCheck(MemoryContext context)
name, block->nfree, block, nfree);
}
}
+
+ Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
}
#endif /* MEMORY_CONTEXT_CHECKING */
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index dbae98d3d9f..df0ae3625cb 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
/* these two fields are placed here to minimize alignment wastage: */
bool isReset; /* T = no space alloced since last reset */
bool allowInCritSection; /* allow palloc in critical section */
+ int64 mem_allocated; /* track memory allocated for this context */
const MemoryContextMethods *methods; /* virtual function table */
MemoryContext parent; /* NULL if no parent (toplevel context) */
MemoryContext firstchild; /* head of linked list of children */
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ffe6de536e2..6a837bc9902 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -82,6 +82,7 @@ extern void MemoryContextSetParent(MemoryContext context,
extern Size GetMemoryChunkSpace(void *pointer);
extern MemoryContext MemoryContextGetParent(MemoryContext context);
extern bool MemoryContextIsEmpty(MemoryContext context);
+extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
extern void MemoryContextStats(MemoryContext context);
extern void MemoryContextStatsDetail(MemoryContext context, int max_children);
extern void MemoryContextAllowInCriticalSection(MemoryContext context,
--
2.17.1
On Mon, Sep 30, 2019 at 01:34:13PM -0700, Jeff Davis wrote:
On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:
Notice that when CLOBBER_FREED_MEMORY is defined, the code first
calls
wipe_mem and then accesses fields of the (wiped) block.
Interesringly
enough, the regression tests don't seem to exercise these bits -
I've
tried adding elog(ERROR) and it still passes. For (2) that's not
very
surprising because Generation context is only really used in
logical
decoding (and we don't delete the context I think). Not sure about
(1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.I'm pretty sure a custom function calling the contexts explicitly
would
fall over, but I haven't tried.Fixed.
I tested with some custom use of memory contexts. The reason
AllocSetDelete() didn't fail before is that most memory contexts use
the free lists (the list of free memory contexts, not the free list of
chunks), so you need to specify a non-default minsize in order to
prevent that and trigger the bug.AllocSetReset() worked, but it was reading the header of the keeper
block after wiping the contents of the keeper block. It technically
worked, because the header of the keeper block was not wiped, but it
seems more clear to explicitly save the size of the keeper block. In
AllocSetDelete(), saving the keeper size is required, because it wipes
the block headers in addition to the contents.
OK, looks good to me now.
Oh, and one more thing - this probably needs to add at least some
basic
explanation of the accounting to src/backend/mmgr/README.Added.
Well, I've meant a couple of paragraphs explaining the motivation, and
relevant trade-offs considered. So I've written a brief summary of the
design as I understand it and pushed it like that. Of course, if you
could proof-read it, that'd be good.
I had a bit of a hard time deciding who to list as a reviewer - this
patch started sometime in ~2015, and it was initially discussed as part
of the larger hashagg effort, with plenty of people discussing various
ancient versions of the patch. In the end I've included just people from
the current thread. If that omits important past reviews, I'm sorry.
For the record, results of the benchmarks I've done over the past couple
of days are in [1]https://github.com/tvondra/memory-accounting-benchmarks. It includes both the reindex benchmark used by by
Robert in 2015, and a regular read-only pgbench. In general, the
overhead of the accounting is pretty much indistinguishable from noise
(at least on those two machines).
In any case, thanks for the perseverance working on this.
[1]: https://github.com/tvondra/memory-accounting-benchmarks
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So ... why exactly did this patch define MemoryContextData.mem_allocated
as int64? That seems to me to be doubly wrong: it is not the right width
on 32-bit machines, and it is not the right signedness anywhere. I think
that field ought to be of type Size (a/k/a size_t, but memnodes.h always
calls it Size).
I let this pass when the patch went in, but now I'm on the warpath
about it, because since c477f3e449 went in, some of the 32-bit buildfarm
members are failing with
2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG: statement: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533)
2019-10-04 00:42:25.505 CEST [63836:11] LOG: server process (PID 66916) was terminated by signal 6: Abort trap
2019-10-04 00:42:25.505 CEST [63836:12] DETAIL: Failed process was running: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:
context->mem_allocated += blksize - oldblksize;
to be executed in situations where blksize < oldblksize, where before
that was not possible. Of course blksize and oldblksize are of type
Size, hence unsigned, so the subtraction result underflows in this
case. If mem_allocated is of the same width as Size then this does
not matter because the final result wraps around to the proper value,
but if we're going to allow mem_allocated to be wider than Size then
we will need more logic here to add or subtract as appropriate.
(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)
I see no value in defining mem_allocated to be wider than Size.
Yes, the C standard contemplates the possibility that the total
available address space is larger than the largest chunk you can
ever ask malloc for; but nobody has built a platform like that in
this century, and they sucked to program on back in the dark ages
when they did exist. (I speak from experience.) I do not think
we need to design Postgres to allow for that.
Likewise, there's no evident value in allowing mem_allocated
to be negative.
I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.
regards, tom lane
On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
So ... why exactly did this patch define MemoryContextData.mem_allocated
as int64? That seems to me to be doubly wrong: it is not the right width
on 32-bit machines, and it is not the right signedness anywhere. I think
that field ought to be of type Size (a/k/a size_t, but memnodes.h always
calls it Size).
Yeah, I think that's an oversight. Maybe there's a reason why Jeff used
int64, but I can't think of any.
I let this pass when the patch went in, but now I'm on the warpath
about it, because since c477f3e449 went in, some of the 32-bit buildfarm
members are failing with2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG: statement: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533)
2019-10-04 00:42:25.505 CEST [63836:11] LOG: server process (PID 66916) was terminated by signal 6: Abort trap
2019-10-04 00:42:25.505 CEST [63836:12] DETAIL: Failed process was running: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:context->mem_allocated += blksize - oldblksize;
to be executed in situations where blksize < oldblksize, where before
that was not possible. Of course blksize and oldblksize are of type
Size, hence unsigned, so the subtraction result underflows in this
case. If mem_allocated is of the same width as Size then this does
not matter because the final result wraps around to the proper value,
but if we're going to allow mem_allocated to be wider than Size then
we will need more logic here to add or subtract as appropriate.(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)
Interesting failure mode (especially that it does *not* fail on some
32-bit machines).
I see no value in defining mem_allocated to be wider than Size.
Yes, the C standard contemplates the possibility that the total
available address space is larger than the largest chunk you can
ever ask malloc for; but nobody has built a platform like that in
this century, and they sucked to program on back in the dark ages
when they did exist. (I speak from experience.) I do not think
we need to design Postgres to allow for that.Likewise, there's no evident value in allowing mem_allocated
to be negative.I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.
I think so too, but I'll take a closer look in the afternoon, unless you
beat me to it.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Oct 04, 2019 at 10:26:44AM +0200, Tomas Vondra wrote:
On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.I think so too, but I'll take a closer look in the afternoon, unless you
beat me to it.
I've pushed a fix changing the type to Size, splitting the mem_allocated
to two separate updates (to prevent any underflows in the subtraction).
Hopefully this fixes the 32-bit machines ...
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:
On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
So ... why exactly did this patch define
MemoryContextData.mem_allocated
as int64? That seems to me to be doubly wrong: it is not the right
width
on 32-bit machines, and it is not the right signedness anywhere. I
think
that field ought to be of type Size (a/k/a size_t, but memnodes.h
always
calls it Size).Yeah, I think that's an oversight. Maybe there's a reason why Jeff
used
int64, but I can't think of any.
I had chosen a 64-bit value to account for the situation Tom mentioned:
that, in theory, Size might not be large enough to represent all
allocations in a memory context. Apparently, that theoretical situation
is not worth being concerned about.
The patch has been floating around for a very long time, so I don't
remember exactly why I chose a signed value. Sorry.
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:
Yeah, I think that's an oversight. Maybe there's a reason why Jeff
used int64, but I can't think of any.
I had chosen a 64-bit value to account for the situation Tom mentioned:
that, in theory, Size might not be large enough to represent all
allocations in a memory context. Apparently, that theoretical situation
is not worth being concerned about.
Well, you could also argue it the other way: maybe in our children's
time, int64 won't be as wide as Size. (Yeah, I know that sounds
ridiculous, but needing pointers wider than 32 bits was a ridiculous
idea too when I started in this business.)
The committed fix seems OK to me except that I think you should've
also changed MemoryContextMemAllocated() to return Size.
regards, tom lane
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:
context->mem_allocated += blksize - oldblksize;
to be executed in situations where blksize < oldblksize, where before
that was not possible.
...
(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)
Interesting failure mode (especially that it does *not* fail on some
32-bit machines).
Just to make things even more mysterious, prairiedog finally showed
the Assert failure on its fourth run with c477f3e449 included:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41
It's also now apparent that lapwing and locust were failing only
sometimes, as well. I totally don't understand why that failure
would've been only partially reproducible. Maybe we should dig
a bit harder, rather than just deciding that we fixed it.
regards, tom lane
I wrote:
Just to make things even more mysterious, prairiedog finally showed
the Assert failure on its fourth run with c477f3e449 included:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41
It's also now apparent that lapwing and locust were failing only
sometimes, as well. I totally don't understand why that failure
would've been only partially reproducible. Maybe we should dig
a bit harder, rather than just deciding that we fixed it.
I did that digging, reproducing the problem on florican's host
(again intermittently). Here's a stack trace from the spot where
we sometimes downsize a large chunk:
#5 0x0851c70a in AllocSetRealloc (context=0x31b35000, pointer=0x319e5020,
size=1224) at aset.c:1158
#6 0x085232eb in repalloc (pointer=0x319e5020, size=1224) at mcxt.c:1082
#7 0x31b69591 in resize_intArrayType (num=300, a=0x319e5020)
at _int_tool.c:268
#8 resize_intArrayType (a=0x319e5020, num=300) at _int_tool.c:250
#9 0x31b6995d in _int_unique (r=0x319e5020) at _int_tool.c:329
#10 0x31b66a00 in g_int_union (fcinfo=0xffbfcc5c) at _int_gist.c:146
#11 0x084ff9c4 in FunctionCall2Coll (flinfo=0x319e2bb4, collation=100,
arg1=834250780, arg2=4290759884) at fmgr.c:1162
#12 0x080db3a3 in gistMakeUnionItVec (giststate=0x319e2820, itvec=0x31bae4a4,
len=15, attr=0xffbfce5c, isnull=0xffbfcedc) at gistutil.c:204
#13 0x080e410d in gistunionsubkeyvec (giststate=giststate@entry=0x319e2820,
itvec=itvec@entry=0x31bb5ed4, gsvp=gsvp@entry=0xffbfcd4c) at gistsplit.c:64
#14 0x080e416f in gistunionsubkey (giststate=giststate@entry=0x319e2820,
itvec=itvec@entry=0x31bb5ed4, spl=spl@entry=0xffbfce3c) at gistsplit.c:91
#15 0x080e4456 in gistSplitByKey (r=<optimized out>, page=<optimized out>,
itup=<optimized out>, len=<optimized out>, giststate=<optimized out>,
v=<optimized out>, attno=<optimized out>) at gistsplit.c:689
#16 0x080d8797 in gistSplit (r=0x31bbb424, page=0x297e0b80 "",
itup=0x31bb5ed4, len=16, giststate=0x319e2820) at gist.c:1432
#17 0x080d8d9c in gistplacetopage (rel=<optimized out>,
freespace=<optimized out>, giststate=<optimized out>,
buffer=<optimized out>, itup=<optimized out>, ntup=<optimized out>,
oldoffnum=<optimized out>, newblkno=<optimized out>,
leftchildbuf=<optimized out>, splitinfo=<optimized out>,
markfollowright=<optimized out>, heapRel=<optimized out>,
is_build=<optimized out>) at gist.c:299
So the potential downsize is expected, triggered by _int_unique
being able to remove some duplicate entries from a GIST union key.
AFAICT the fact that it happens only intermittently must boil down
to the randomized insertion choices that gistchoose() sometimes makes.
In short: nothing to see here, move along.
regards, tom lane
On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis <pgsql@j-davis.com> wrote:
The patch has been floating around for a very long time, so I don't
remember exactly why I chose a signed value. Sorry.
I am reminded of the fact that int64 is used to size buffers within
tuplesort.c, because it needs to support negative availMem sizes --
when huge allocations were first supported, tuplesort.c briefly used
"Size", which didn't work. Perhaps it had something to do with that.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis <pgsql@j-davis.com> wrote:
The patch has been floating around for a very long time, so I don't
remember exactly why I chose a signed value. Sorry.
I am reminded of the fact that int64 is used to size buffers within
tuplesort.c, because it needs to support negative availMem sizes --
when huge allocations were first supported, tuplesort.c briefly used
"Size", which didn't work. Perhaps it had something to do with that.
I wonder if we should make that use ssize_t instead. Probably
not worth the trouble.
regards, tom lane
On Tue, Sep 24, 2019 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).
I still have access to a POWER machine, but it's currently being used
by somebody else for a benchmarking project, so I can't test this
immediately.
It's probably worth noting that, in addition to whatever's changed in
this patch, tuplesort.c's memory management has been altered
significantly since 2015 - see
0011c0091e886b874e485a46ff2c94222ffbf550 and
e94568ecc10f2638e542ae34f2990b821bbf90ac. I'm not immediately sure how
that would affect the REINDEX case that I tested back then, but it
seems at least possible that they would have the effect of making
palloc overhead less significant. More generally, so much of the
sorting machinery has been overhauled by Peter Geoghegan since then
that what happens now may just not be very comparable to what happened
back then.
I do agree that this approach looks pretty light weight. Tomas's point
about the difference between updating only the current context and
updating all parent contexts seems right on target to me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company