Memory Accounting

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Heikki Linnakangas (#2)

Re: Memory Accounting

On Mon, Jul 22, 2019 at 11:30:37AM +0300, Heikki Linnakangas wrote:

On 18/07/2019 21:24, Jeff Davis wrote:

Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.

Seems handy.

Indeed.

* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).

Is that OK for memory bounded hash aggregation? Might there be a lot
of sub-contexts during hash aggregation?

There shouldn't be, at least not since b419865a814abbc. There might be
cases where custom aggregates still do that, but I think that's simply a
design we should discourage.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#3)

Re: Memory Accounting

On Mon, 2019-07-22 at 18:16 +0200, Tomas Vondra wrote:

* I changed it to only update mem_allocated for the current

context,
not recursively for all parent contexts. It's now up to the
function
that reports memory usage to recurse or not (as needed).

Is that OK for memory bounded hash aggregation? Might there be a
lot
of sub-contexts during hash aggregation?

There shouldn't be, at least not since b419865a814abbc. There might
be
cases where custom aggregates still do that, but I think that's
simply a
design we should discourage.

Right, I don't think memory-context-per-group is something we should
optimize for.

Discussion:
/messages/by-id/3839201.Nfa2RvcheX@techfox.foxi
/messages/by-id/5334D7A5.2000907@fuzzy.cz

Commit link:

Regards,
Jeff Davis

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Jeff Davis (#1)

Re: Memory Accounting

On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote:

Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.

Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?

I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.

This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?

--
Melanie Plageman

[1]: /messages/by-id/CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com
/messages/by-id/CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com
[2]: https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#5)

Re: Memory Accounting

On Tue, Jul 23, 2019 at 06:18:26PM -0700, Melanie Plageman wrote:

On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote:

Previous discussion:
/messages/by-id/1407012053.15301.53.camel@jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.

Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?

I think there's plenty of places where we quickly get into a state with
enough chunks in the freelist - the per-tuple contexts are a good
example of that, I think.

I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.

I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.

This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?

Yes, that might me a good candidate (and it would be much simpler than
the manual accounting we use now).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Jeff Davis (#1)

Re: Memory Accounting

I wanted to address a couple of questions Jeff asked me off-list about
Greenplum's implementations of Memory Accounting.

Greenplum has two memory accounting sub-systems -- one is the
MemoryContext-based system proposed here.
The other memory accounting system tracks "logical" memory owners in
global accounts. For example, every node in a plan has an account,
however, there are other execution contexts, such as the parser, which
have their own logical memory accounts.
Notably, this logical memory account system also tracks chunks instead
of blocks.

The rationale for tracking memory at the logical owner level was that
memory for a logical owner may allocate memory across multiple
contexts and a single context may contain memory belonging to several
of these logical owners.

More compellingly, many of the allocations done during execution are
done directly in the per query or per tuple context--as opposed to
being done in their own uniquely named context. Arguably, this is
because those logical owners (a Result node, for example) are not
memory-intensive and thus do not require specific memory accounting.
However, when debugging a memory leak or OOM, the specificity of
logical owner accounts was seen as desirable. A discrepancy between
memory allocated and memory freed in the per query context doesn't
provide a lot of hints as to the source of the leak.
At the least, there was no meaningful way to represent MemoryContext
account balances in EXPLAIN ANALYZE. Where would the TopMemoryContext
be represented, for example.

Also, by using logical accounts, each node in the plan could be
assigned a quota at plan time--because many memory intensive operators
will not have relinquished the memory they hold when other nodes are
executing (e.g. Materialize)--so, instead of granting each node
work_mem, work_mem is divided up into quotas for each operator in a
particular way. This was meant to pave the way for work_mem
enforcement. This is a topic that has come up in various ways in other
forums. For example, in the XPRS thread, the discussion of erroring
out for queries with no "escape mechanism" brought up by Thomas Munro
[1]: /messages/by-id/CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com
more on a proposed session-level memory setting, but it is still
enforcement of a memory setting).
It was also discussed at PGCon this year in an unconference session on
OOM-detection and debugging, runaway query termination, and
session-level memory consumption tracking [2]https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference.

The motivation for tracking chunks instead of blocks was to understand
the "actual" memory consumption of different components in the
database. Then, eventually, memory consumption patterns would emerge
and improvements could be made to memory allocation strategies to suit
different use cases--perhaps other implementations of the
MemoryContext API similar to Slab and Generation were envisioned.
Apparently, it did lead to the discovery of some memory fragmentation
issues which were tuned.

I bring these up not just to answer Jeff's question but also to
provide some anecdotal evidence that the patch here is a good base for
other memory accounting and tracking schemes.

Even if HashAgg is the only initial consumer of the memory accounting
framework, we know that tuplesort can make use of it in its current
state as well. And, if another node or component requires
chunk-tracking, they could implement a different MemoryContext API
implementation which uses the MemoryContextData->mem_allocated field
to track chunks instead of blocks by tracking chunks in its alloc/free
functions.

Ideas like logical memory accounting could leverage the mem_allocated
field and build on top of it.

Import Notes

Reply to msg id not found: CADwEdor2jZitB2P94on2LUAhdi0iRHv2WNseV3LJmJPenuuNpw@mail.gmail.com

pgsql@j-davis.com

over 6 years ago

In reply to: Jeff Davis (#1)

Re: Memory Accounting

On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:

Hi Jeff,

Hi Soumyadeep and Melanie,

Thank you for the review!

max_stack_depth max level lazy (ms) eager (ms) (eage
r/lazy)
2MB 82 302.715 427.554 1.4123978
3MB 3474 567.829 896.143 1.578191674
7.67MB 8694 2657.972 4903.063 1.844663149

Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?

It looks like you agree with the approach and the results. Did you find
any other issues with the patch?

I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.

Regards,
Jeff Davis

Import Notes

Reply to msg id not found: CADwEdor2jZitB2P94on2LUAhdi0iRHv2WNseV3LJmJPenuuNpw@mail.gmail.com

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Tomas Vondra (#6)

Re: Memory Accounting

On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote:

I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).

The patch has been marked as ready for committer for a week or so, but
it seems to me that this comment has not been addressed, no? Are we
sure that we want this method if it proves to be inefficient when
there are many sub-contexts and shouldn't we at least test such
scenarios with a worst-case, customly-made, function?
--
Michael

#10

[1]: /messages/by-id/1434311039.4369.39.camel@jeff-desktop

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Michael Paquier (#9)

Re: Memory Accounting

On Tue, Sep 24, 2019 at 02:21:40PM +0900, Michael Paquier wrote:

On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote:

I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).

The patch has been marked as ready for committer for a week or so, but
it seems to me that this comment has not been addressed, no? Are we
sure that we want this method if it proves to be inefficient when
there are many sub-contexts and shouldn't we at least test such
scenarios with a worst-case, customly-made, function?

I don't think so.

Aggregates creating many memory contexts (context for each group) was
discussed extensively in the thread about v11 [1]/messages/by-id/1434311039.4369.39.camel@jeff-desktop in 2015. And back then
the conclusion was that that's a pretty awful pattern anyway, as it uses
much more memory (no cross-context freelists), and has various other
issues. In a way, those aggregates are wrong and should be fixed just
like we fixed array_agg/string_agg (even without the memory accounting).

The way I see it we can do either eager or lazy accounting. Eager might
work better for aggregates with many contexts, but it does increase the
overhead for the "regular" aggregates with just one or two contexts.
Considering how rare those many-context aggregates are (I'm not aware of
any such aggregate at the moment), it seems reasonable to pick the lazy
accounting.

(Note: Another factor affecting the lazy vs. eager efficiency is the
number of palloc/pfree calls vs. calls to determine amount of mem used,
but that's mostly orthogonal and we cn ignore it here).

So I think the approach Jeff ended up with sensible - certainly as a
first step. We may improve it in the future, of course, once we have
more practical experience.

Barring objections, I do plan to get this committed by the end of this
CF (i.e. sometime later this week).

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Melanie Plageman

melanieplageman@gmail.com

over 6 years ago

In reply to: Jeff Davis (#8)

Re: Memory Accounting

On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:

Hi Jeff,

Hi Soumyadeep and Melanie,

Thank you for the review!

max_stack_depth max level lazy (ms) eager (ms)

(eage

r/lazy)
2MB 82 302.715 427.554 1.4123978
3MB 3474 567.829 896.143 1.578191674
7.67MB 8694 2657.972 4903.063 1.844663149

Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?

We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.

It looks like you agree with the approach and the results. Did you find
any other issues with the patch?

We didn't observe any other problems with the patch and agree with the
approach. It is a good start.

I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.

I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).

Thanks,
Soumyadeep & Melanie

#12

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Tomas Vondra (#10)

Re: Memory Accounting

On Tue, Sep 24, 2019 at 02:05:51PM +0200, Tomas Vondra wrote:

The way I see it we can do either eager or lazy accounting. Eager might
work better for aggregates with many contexts, but it does increase the
overhead for the "regular" aggregates with just one or two contexts.
Considering how rare those many-context aggregates are (I'm not aware of
any such aggregate at the moment), it seems reasonable to pick the lazy
accounting.

Okay.

So I think the approach Jeff ended up with sensible - certainly as a
first step. We may improve it in the future, of course, once we have
more practical experience.

Barring objections, I do plan to get this committed by the end of this
CF (i.e. sometime later this week).

Sounds good to me. Though I have not looked at the patch in details,
the arguments are sensible. Thanks for confirming.
--
Michael

#13

[1]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Melanie Plageman (#11)

Re: Memory Accounting

On Tue, Sep 24, 2019 at 11:46:49AM -0700, Melanie Plageman wrote:

On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:

Hi Jeff,

Hi Soumyadeep and Melanie,

Thank you for the review!

max_stack_depth max level lazy (ms) eager (ms)

(eage

r/lazy)
2MB 82 302.715 427.554 1.4123978
3MB 3474 567.829 896.143 1.578191674
7.67MB 8694 2657.972 4903.063 1.844663149

Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?

We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.

It looks like you agree with the approach and the results. Did you find
any other issues with the patch?

We didn't observe any other problems with the patch and agree with the
approach. It is a good start.

I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.

I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).

I agree that would be nice, but I don't have access to any Power machine
suitable for this kind of benchmarking :-( Robert, any chance you still
have access to that machine?

It's worth mentioning that those bechmarks (I'm assuming we're talking
about the numbers Rober shared in [1]/messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers demonstrating
I'wrong I'll stick to my plan to get this committed soon.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#13)

Re: Memory Accounting

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:

It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.

That was my conclusion, as well.

Regards,
Jeff Davis

#15

[1]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#14)

Re: Memory Accounting

On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:

It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.

That was my conclusion, as well.

I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:

1) aset.c (AllocSetDelete)
--------------------------

#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif

if (block != set->keeper)
{
context->mem_allocated -= block->endptr - ((char *) block);
free(block);
}

2) generation.c (GenerationReset)
---------------------------------

#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->blksize);
#endif
context->mem_allocated -= block->blksize;

Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.

Aside from that, I've repeated the REINDEX benchmarks done by Robert in
[1]: /messages/by-id/CA+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajP3Cro9dBRaA@mail.gmail.com
no difference. Both machines are x86_64, I don't have access to any
Power machine at the moment, unfortunately.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Tomas Vondra (#15)

Re: Memory Accounting

On Sun, Sep 29, 2019 at 12:12:49AM +0200, Tomas Vondra wrote:

On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:

It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.

That was my conclusion, as well.

I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:

1) aset.c (AllocSetDelete)
--------------------------

#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif

if (block != set->keeper)
{
context->mem_allocated -= block->endptr - ((char *) block);
free(block);
}

2) generation.c (GenerationReset)
---------------------------------

#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->blksize);
#endif
context->mem_allocated -= block->blksize;

Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.

Oh, and one more thing - this probably needs to add at least some basic
explanation of the accounting to src/backend/mmgr/README.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#16)

1 attachment(s)

Re: Memory Accounting

On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:

Notice that when CLOBBER_FREED_MEMORY is defined, the code first

calls
wipe_mem and then accesses fields of the (wiped) block.
Interesringly
enough, the regression tests don't seem to exercise these bits -
I've
tried adding elog(ERROR) and it still passes. For (2) that's not
very
surprising because Generation context is only really used in
logical
decoding (and we don't delete the context I think). Not sure about
(1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly
would
fall over, but I haven't tried.

Fixed.

I tested with some custom use of memory contexts. The reason
AllocSetDelete() didn't fail before is that most memory contexts use
the free lists (the list of free memory contexts, not the free list of
chunks), so you need to specify a non-default minsize in order to
prevent that and trigger the bug.

AllocSetReset() worked, but it was reading the header of the keeper
block after wiping the contents of the keeper block. It technically
worked, because the header of the keeper block was not wiped, but it
seems more clear to explicitly save the size of the keeper block. In
AllocSetDelete(), saving the keeper size is required, because it wipes
the block headers in addition to the contents.

Oh, and one more thing - this probably needs to add at least some
basic
explanation of the accounting to src/backend/mmgr/README.

Added.

Regards,
Jeff Davis

Attachments:

0001-Memory-accounting-at-block-level.patchtext/x-patch; charset=UTF-8; name=0001-Memory-accounting-at-block-level.patchDownload

From 7184df8a928d6c0c6e41d171968a54d46f256a22 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jdavis@postgresql.org>
Date: Fri, 1 Jun 2018 13:35:21 -0700
Subject: [PATCH] Memory accounting at block level.

Track the memory allocated to a memory context in the form of blocks,
and introduce a new function MemoryContextMemAllocated() to report it
to the caller in bytes. This does not track individual chunks.
---
 src/backend/utils/mmgr/README       |  4 +++
 src/backend/utils/mmgr/aset.c       | 38 +++++++++++++++++++++++++++++
 src/backend/utils/mmgr/generation.c | 19 ++++++++++++---
 src/backend/utils/mmgr/mcxt.c       | 24 ++++++++++++++++++
 src/backend/utils/mmgr/slab.c       | 10 ++++++++
 src/include/nodes/memnodes.h        |  1 +
 src/include/utils/memutils.h        |  1 +
 7 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/mmgr/README b/src/backend/utils/mmgr/README
index 7e6541d0dee..bd5fa6d8e53 100644
--- a/src/backend/utils/mmgr/README
+++ b/src/backend/utils/mmgr/README
@@ -23,6 +23,10 @@ The basic operations on a memory context are:
 * reset a context (free all memory allocated in the context, but not the
   context object itself)
 
+* inquire about the total amount of memory allocated to the context
+  (the raw memory from which the context allocates chunks; not the
+  chunks themselves)
+
 Given a chunk of memory previously allocated from a context, one can
 free it or reallocate it larger or smaller (corresponding to standard C
 library's free() and realloc() routines).  These operations return memory
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6b63d6f85d0..90f370570fe 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,9 @@ AllocSetContextCreateInternal(MemoryContext parent,
 								parent,
 								name);
 
+			((MemoryContext) set)->mem_allocated =
+				set->keeper->endptr - ((char *) set);
+
 			return (MemoryContext) set;
 		}
 	}
@@ -546,6 +549,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
 						parent,
 						name);
 
+	((MemoryContext) set)->mem_allocated = firstBlockSize;
+
 	return (MemoryContext) set;
 }
 
@@ -566,6 +571,7 @@ AllocSetReset(MemoryContext context)
 {
 	AllocSet	set = (AllocSet) context;
 	AllocBlock	block;
+	Size		keepersize = set->keeper->endptr - ((char *) set);
 
 	AssertArg(AllocSetIsValid(set));
 
@@ -604,6 +610,8 @@ AllocSetReset(MemoryContext context)
 		else
 		{
 			/* Normal case, release the block */
+			context->mem_allocated -= block->endptr - ((char*) block);
+
 #ifdef CLOBBER_FREED_MEMORY
 			wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -612,6 +620,8 @@ AllocSetReset(MemoryContext context)
 		block = next;
 	}
 
+	Assert(context->mem_allocated == keepersize);
+
 	/* Reset block size allocation sequence, too */
 	set->nextBlockSize = set->initBlockSize;
 }
@@ -628,6 +638,7 @@ AllocSetDelete(MemoryContext context)
 {
 	AllocSet	set = (AllocSet) context;
 	AllocBlock	block = set->blocks;
+	Size		keepersize = set->keeper->endptr - ((char *) set);
 
 	AssertArg(AllocSetIsValid(set));
 
@@ -683,6 +694,9 @@ AllocSetDelete(MemoryContext context)
 	{
 		AllocBlock	next = block->next;
 
+		if (block != set->keeper)
+			context->mem_allocated -= block->endptr - ((char *) block);
+
 #ifdef CLOBBER_FREED_MEMORY
 		wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -693,6 +707,8 @@ AllocSetDelete(MemoryContext context)
 		block = next;
 	}
 
+	Assert(context->mem_allocated == keepersize);
+
 	/* Finally, free the context header, including the keeper block */
 	free(set);
 }
@@ -733,6 +749,9 @@ AllocSetAlloc(MemoryContext context, Size size)
 		block = (AllocBlock) malloc(blksize);
 		if (block == NULL)
 			return NULL;
+
+		context->mem_allocated += blksize;
+
 		block->aset = set;
 		block->freeptr = block->endptr = ((char *) block) + blksize;
 
@@ -928,6 +947,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 		if (block == NULL)
 			return NULL;
 
+		context->mem_allocated += blksize;
+
 		block->aset = set;
 		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
 		block->endptr = ((char *) block) + blksize;
@@ -1028,6 +1049,9 @@ AllocSetFree(MemoryContext context, void *pointer)
 			set->blocks = block->next;
 		if (block->next)
 			block->next->prev = block->prev;
+
+		context->mem_allocated -= block->endptr - ((char*) block);
+
 #ifdef CLOBBER_FREED_MEMORY
 		wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -1144,6 +1168,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 		AllocBlock	block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
 		Size		chksize;
 		Size		blksize;
+		Size		oldblksize;
 
 		/*
 		 * Try to verify that we have a sane block pointer: it should
@@ -1159,6 +1184,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 		/* Do the realloc */
 		chksize = MAXALIGN(size);
 		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+		oldblksize = block->endptr - ((char *)block);
+
 		block = (AllocBlock) realloc(block, blksize);
 		if (block == NULL)
 		{
@@ -1166,6 +1193,9 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 			VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
 			return NULL;
 		}
+
+		context->mem_allocated += blksize - oldblksize;
+
 		block->freeptr = block->endptr = ((char *) block) + blksize;
 
 		/* Update pointers since block has likely been moved */
@@ -1383,6 +1413,7 @@ AllocSetCheck(MemoryContext context)
 	const char *name = set->header.name;
 	AllocBlock	prevblock;
 	AllocBlock	block;
+	int64		total_allocated = 0;
 
 	for (prevblock = NULL, block = set->blocks;
 		 block != NULL;
@@ -1393,6 +1424,11 @@ AllocSetCheck(MemoryContext context)
 		long		blk_data = 0;
 		long		nchunks = 0;
 
+		if (set->keeper == block)
+			total_allocated += block->endptr - ((char *) set);
+		else
+			total_allocated += block->endptr - ((char *) block);
+
 		/*
 		 * Empty block - empty can be keeper-block only
 		 */
@@ -1479,6 +1515,8 @@ AllocSetCheck(MemoryContext context)
 			elog(WARNING, "problem in alloc set %s: found inconsistent memory block %p",
 				 name, block);
 	}
+
+	Assert(total_allocated == context->mem_allocated);
 }
 
 #endif							/* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index eaacafb7be5..f1aed7e9384 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -297,10 +297,11 @@ GenerationReset(MemoryContext context)
 
 		dlist_delete(miter.cur);
 
+		context->mem_allocated -= block->blksize;
+
 #ifdef CLOBBER_FREED_MEMORY
 		wipe_mem(block, block->blksize);
 #endif
-
 		free(block);
 	}
 
@@ -352,6 +353,8 @@ GenerationAlloc(MemoryContext context, Size size)
 		if (block == NULL)
 			return NULL;
 
+		context->mem_allocated += blksize;
+
 		/* block with a single (used) chunk */
 		block->blksize = blksize;
 		block->nchunks = 1;
@@ -407,6 +410,8 @@ GenerationAlloc(MemoryContext context, Size size)
 		if (block == NULL)
 			return NULL;
 
+		context->mem_allocated += blksize;
+
 		block->blksize = blksize;
 		block->nchunks = 0;
 		block->nfree = 0;
@@ -522,6 +527,7 @@ GenerationFree(MemoryContext context, void *pointer)
 	if (set->block == block)
 		set->block = NULL;
 
+	context->mem_allocated -= block->blksize;
 	free(block);
 }
 
@@ -743,9 +749,10 @@ GenerationStats(MemoryContext context,
 static void
 GenerationCheck(MemoryContext context)
 {
-	GenerationContext *gen = (GenerationContext *) context;
-	const char *name = context->name;
-	dlist_iter	iter;
+	GenerationContext	*gen  = (GenerationContext *) context;
+	const char			*name = context->name;
+	dlist_iter			 iter;
+	int64				 total_allocated = 0;
 
 	/* walk all blocks in this context */
 	dlist_foreach(iter, &gen->blocks)
@@ -755,6 +762,8 @@ GenerationCheck(MemoryContext context)
 					nchunks;
 		char	   *ptr;
 
+		total_allocated += block->blksize;
+
 		/*
 		 * nfree > nchunks is surely wrong, and we don't expect to see
 		 * equality either, because such a block should have gotten freed.
@@ -833,6 +842,8 @@ GenerationCheck(MemoryContext context)
 			elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p does not match header %d",
 				 name, nfree, block, block->nfree);
 	}
+
+	Assert(total_allocated == context->mem_allocated);
 }
 
 #endif							/* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index b07be122369..27417af548d 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -462,6 +462,29 @@ MemoryContextIsEmpty(MemoryContext context)
 	return context->methods->is_empty(context);
 }
 
+/*
+ * Find the memory allocated to blocks for this memory context. If recurse is
+ * true, also include children.
+ */
+int64
+MemoryContextMemAllocated(MemoryContext context, bool recurse)
+{
+	int64 total = context->mem_allocated;
+
+	AssertArg(MemoryContextIsValid(context));
+
+	if (recurse)
+	{
+		MemoryContext child = context->firstchild;
+		for (child = context->firstchild;
+			 child != NULL;
+			 child = child->nextchild)
+			total += MemoryContextMemAllocated(child, true);
+	}
+
+	return total;
+}
+
 /*
  * MemoryContextStats
  *		Print statistics about the named context and all its descendants.
@@ -736,6 +759,7 @@ MemoryContextCreate(MemoryContext node,
 	node->methods = methods;
 	node->parent = parent;
 	node->firstchild = NULL;
+	node->mem_allocated = 0;
 	node->prevchild = NULL;
 	node->name = name;
 	node->ident = NULL;
diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index 700a91a2a37..50deb354c28 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -305,12 +305,14 @@ SlabReset(MemoryContext context)
 #endif
 			free(block);
 			slab->nblocks--;
+			context->mem_allocated -= slab->blockSize;
 		}
 	}
 
 	slab->minFreeChunks = 0;
 
 	Assert(slab->nblocks == 0);
+	Assert(context->mem_allocated == 0);
 }
 
 /*
@@ -388,6 +390,7 @@ SlabAlloc(MemoryContext context, Size size)
 
 		slab->minFreeChunks = slab->chunksPerBlock;
 		slab->nblocks += 1;
+		context->mem_allocated += slab->blockSize;
 	}
 
 	/* grab the block from the freelist (even the new block is there) */
@@ -480,6 +483,9 @@ SlabAlloc(MemoryContext context, Size size)
 #endif
 
 	SlabAllocInfo(slab, chunk);
+
+	Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
+
 	return SlabChunkGetPointer(chunk);
 }
 
@@ -555,11 +561,13 @@ SlabFree(MemoryContext context, void *pointer)
 	{
 		free(block);
 		slab->nblocks--;
+		context->mem_allocated -= slab->blockSize;
 	}
 	else
 		dlist_push_head(&slab->freelist[block->nfree], &block->node);
 
 	Assert(slab->nblocks >= 0);
+	Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
 }
 
 /*
@@ -782,6 +790,8 @@ SlabCheck(MemoryContext context)
 					 name, block->nfree, block, nfree);
 		}
 	}
+
+	Assert(slab->nblocks * slab->blockSize == context->mem_allocated);
 }
 
 #endif							/* MEMORY_CONTEXT_CHECKING */
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index dbae98d3d9f..df0ae3625cb 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
 	/* these two fields are placed here to minimize alignment wastage: */
 	bool		isReset;		/* T = no space alloced since last reset */
 	bool		allowInCritSection; /* allow palloc in critical section */
+	int64		mem_allocated;	/* track memory allocated for this context */
 	const MemoryContextMethods *methods;	/* virtual function table */
 	MemoryContext parent;		/* NULL if no parent (toplevel context) */
 	MemoryContext firstchild;	/* head of linked list of children */
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ffe6de536e2..6a837bc9902 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -82,6 +82,7 @@ extern void MemoryContextSetParent(MemoryContext context,
 extern Size GetMemoryChunkSpace(void *pointer);
 extern MemoryContext MemoryContextGetParent(MemoryContext context);
 extern bool MemoryContextIsEmpty(MemoryContext context);
+extern int64 MemoryContextMemAllocated(MemoryContext context, bool recurse);
 extern void MemoryContextStats(MemoryContext context);
 extern void MemoryContextStatsDetail(MemoryContext context, int max_children);
 extern void MemoryContextAllowInCriticalSection(MemoryContext context,
-- 
2.17.1

#18

[1]: https://github.com/tvondra/memory-accounting-benchmarks

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Jeff Davis (#17)

Re: Memory Accounting

On Mon, Sep 30, 2019 at 01:34:13PM -0700, Jeff Davis wrote:

On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:

Notice that when CLOBBER_FREED_MEMORY is defined, the code first

calls
wipe_mem and then accesses fields of the (wiped) block.
Interesringly
enough, the regression tests don't seem to exercise these bits -
I've
tried adding elog(ERROR) and it still passes. For (2) that's not
very
surprising because Generation context is only really used in
logical
decoding (and we don't delete the context I think). Not sure about
(1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly
would
fall over, but I haven't tried.

Fixed.

I tested with some custom use of memory contexts. The reason
AllocSetDelete() didn't fail before is that most memory contexts use
the free lists (the list of free memory contexts, not the free list of
chunks), so you need to specify a non-default minsize in order to
prevent that and trigger the bug.

AllocSetReset() worked, but it was reading the header of the keeper
block after wiping the contents of the keeper block. It technically
worked, because the header of the keeper block was not wiped, but it
seems more clear to explicitly save the size of the keeper block. In
AllocSetDelete(), saving the keeper size is required, because it wipes
the block headers in addition to the contents.

OK, looks good to me now.

Oh, and one more thing - this probably needs to add at least some
basic
explanation of the accounting to src/backend/mmgr/README.

Added.

Well, I've meant a couple of paragraphs explaining the motivation, and
relevant trade-offs considered. So I've written a brief summary of the
design as I understand it and pushed it like that. Of course, if you
could proof-read it, that'd be good.

I had a bit of a hard time deciding who to list as a reviewer - this
patch started sometime in ~2015, and it was initially discussed as part
of the larger hashagg effort, with plenty of people discussing various
ancient versions of the patch. In the end I've included just people from
the current thread. If that omits important past reviews, I'm sorry.

For the record, results of the benchmarks I've done over the past couple
of days are in [1]https://github.com/tvondra/memory-accounting-benchmarks. It includes both the reindex benchmark used by by
Robert in 2015, and a regular read-only pgbench. In general, the
overhead of the accounting is pretty much indistinguishable from noise
(at least on those two machines).

In any case, thanks for the perseverance working on this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tomas Vondra (#18)

Re: Memory Accounting

So ... why exactly did this patch define MemoryContextData.mem_allocated
as int64? That seems to me to be doubly wrong: it is not the right width
on 32-bit machines, and it is not the right signedness anywhere. I think
that field ought to be of type Size (a/k/a size_t, but memnodes.h always
calls it Size).

I let this pass when the patch went in, but now I'm on the warpath
about it, because since c477f3e449 went in, some of the 32-bit buildfarm
members are failing with

2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG: statement: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533)
2019-10-04 00:42:25.505 CEST [63836:11] LOG: server process (PID 66916) was terminated by signal 6: Abort trap
2019-10-04 00:42:25.505 CEST [63836:12] DETAIL: Failed process was running: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:

context->mem_allocated += blksize - oldblksize;

to be executed in situations where blksize < oldblksize, where before
that was not possible. Of course blksize and oldblksize are of type
Size, hence unsigned, so the subtraction result underflows in this
case. If mem_allocated is of the same width as Size then this does
not matter because the final result wraps around to the proper value,
but if we're going to allow mem_allocated to be wider than Size then
we will need more logic here to add or subtract as appropriate.

(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)

I see no value in defining mem_allocated to be wider than Size.
Yes, the C standard contemplates the possibility that the total
available address space is larger than the largest chunk you can
ever ask malloc for; but nobody has built a platform like that in
this century, and they sucked to program on back in the dark ages
when they did exist. (I speak from experience.) I do not think
we need to design Postgres to allow for that.

Likewise, there's no evident value in allowing mem_allocated
to be negative.

I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.

regards, tom lane

#20

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#19)

Re: Memory Accounting

On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:

So ... why exactly did this patch define MemoryContextData.mem_allocated
as int64? That seems to me to be doubly wrong: it is not the right width
on 32-bit machines, and it is not the right signedness anywhere. I think
that field ought to be of type Size (a/k/a size_t, but memnodes.h always
calls it Size).

Yeah, I think that's an oversight. Maybe there's a reason why Jeff used
int64, but I can't think of any.

I let this pass when the patch went in, but now I'm on the warpath
about it, because since c477f3e449 went in, some of the 32-bit buildfarm
members are failing with

2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG: statement: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533)
2019-10-04 00:42:25.505 CEST [63836:11] LOG: server process (PID 66916) was terminated by signal 6: Abort trap
2019-10-04 00:42:25.505 CEST [63836:12] DETAIL: Failed process was running: CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:

context->mem_allocated += blksize - oldblksize;

to be executed in situations where blksize < oldblksize, where before
that was not possible. Of course blksize and oldblksize are of type
Size, hence unsigned, so the subtraction result underflows in this
case. If mem_allocated is of the same width as Size then this does
not matter because the final result wraps around to the proper value,
but if we're going to allow mem_allocated to be wider than Size then
we will need more logic here to add or subtract as appropriate.

(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)

Interesting failure mode (especially that it does *not* fail on some
32-bit machines).

I see no value in defining mem_allocated to be wider than Size.
Yes, the C standard contemplates the possibility that the total
available address space is larger than the largest chunk you can
ever ask malloc for; but nobody has built a platform like that in
this century, and they sucked to program on back in the dark ages
when they did exist. (I speak from experience.) I do not think
we need to design Postgres to allow for that.

Likewise, there's no evident value in allowing mem_allocated
to be negative.

I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.

I think so too, but I'll take a closer look in the afternoon, unless you
beat me to it.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Tomas Vondra (#20)

Re: Memory Accounting

On Fri, Oct 04, 2019 at 10:26:44AM +0200, Tomas Vondra wrote:

On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:

I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.

I think so too, but I'll take a closer look in the afternoon, unless you
beat me to it.

I've pushed a fix changing the type to Size, splitting the mem_allocated
to two separate updates (to prevent any underflows in the subtraction).
Hopefully this fixes the 32-bit machines ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#22

pgsql@j-davis.com

over 6 years ago

In reply to: Tomas Vondra (#20)

Re: Memory Accounting

On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:

On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:

So ... why exactly did this patch define
MemoryContextData.mem_allocated
as int64? That seems to me to be doubly wrong: it is not the right
width
on 32-bit machines, and it is not the right signedness anywhere. I
think
that field ought to be of type Size (a/k/a size_t, but memnodes.h
always
calls it Size).

Yeah, I think that's an oversight. Maybe there's a reason why Jeff
used
int64, but I can't think of any.

I had chosen a 64-bit value to account for the situation Tom mentioned:
that, in theory, Size might not be large enough to represent all
allocations in a memory context. Apparently, that theoretical situation
is not worth being concerned about.

The patch has been floating around for a very long time, so I don't
remember exactly why I chose a signed value. Sorry.

Regards,
Jeff Davis

#23

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Jeff Davis (#22)

Re: Memory Accounting

Jeff Davis <pgsql@j-davis.com> writes:

On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:

Yeah, I think that's an oversight. Maybe there's a reason why Jeff
used int64, but I can't think of any.

I had chosen a 64-bit value to account for the situation Tom mentioned:
that, in theory, Size might not be large enough to represent all
allocations in a memory context. Apparently, that theoretical situation
is not worth being concerned about.

Well, you could also argue it the other way: maybe in our children's
time, int64 won't be as wide as Size. (Yeah, I know that sounds
ridiculous, but needing pointers wider than 32 bits was a ridiculous
idea too when I started in this business.)

The committed fix seems OK to me except that I think you should've
also changed MemoryContextMemAllocated() to return Size.

regards, tom lane

#24

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tomas Vondra (#20)

Re: Memory Accounting

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:

What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:
context->mem_allocated += blksize - oldblksize;
to be executed in situations where blksize < oldblksize, where before
that was not possible.
...
(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)

Interesting failure mode (especially that it does *not* fail on some
32-bit machines).

Just to make things even more mysterious, prairiedog finally showed
the Assert failure on its fourth run with c477f3e449 included:

It's also now apparent that lapwing and locust were failing only
sometimes, as well. I totally don't understand why that failure
would've been only partially reproducible. Maybe we should dig
a bit harder, rather than just deciding that we fixed it.

regards, tom lane

#25

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#24)

Re: Memory Accounting

I wrote:

Just to make things even more mysterious, prairiedog finally showed
the Assert failure on its fourth run with c477f3e449 included:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41
It's also now apparent that lapwing and locust were failing only
sometimes, as well. I totally don't understand why that failure
would've been only partially reproducible. Maybe we should dig
a bit harder, rather than just deciding that we fixed it.

I did that digging, reproducing the problem on florican's host
(again intermittently). Here's a stack trace from the spot where
we sometimes downsize a large chunk:

#5 0x0851c70a in AllocSetRealloc (context=0x31b35000, pointer=0x319e5020,
size=1224) at aset.c:1158
#6 0x085232eb in repalloc (pointer=0x319e5020, size=1224) at mcxt.c:1082
#7 0x31b69591 in resize_intArrayType (num=300, a=0x319e5020)
at _int_tool.c:268
#8 resize_intArrayType (a=0x319e5020, num=300) at _int_tool.c:250
#9 0x31b6995d in _int_unique (r=0x319e5020) at _int_tool.c:329
#10 0x31b66a00 in g_int_union (fcinfo=0xffbfcc5c) at _int_gist.c:146
#11 0x084ff9c4 in FunctionCall2Coll (flinfo=0x319e2bb4, collation=100,
arg1=834250780, arg2=4290759884) at fmgr.c:1162
#12 0x080db3a3 in gistMakeUnionItVec (giststate=0x319e2820, itvec=0x31bae4a4,
len=15, attr=0xffbfce5c, isnull=0xffbfcedc) at gistutil.c:204
#13 0x080e410d in gistunionsubkeyvec (giststate=giststate@entry=0x319e2820,
itvec=itvec@entry=0x31bb5ed4, gsvp=gsvp@entry=0xffbfcd4c) at gistsplit.c:64
#14 0x080e416f in gistunionsubkey (giststate=giststate@entry=0x319e2820,
itvec=itvec@entry=0x31bb5ed4, spl=spl@entry=0xffbfce3c) at gistsplit.c:91
#15 0x080e4456 in gistSplitByKey (r=<optimized out>, page=<optimized out>,
itup=<optimized out>, len=<optimized out>, giststate=<optimized out>,
v=<optimized out>, attno=<optimized out>) at gistsplit.c:689
#16 0x080d8797 in gistSplit (r=0x31bbb424, page=0x297e0b80 "",
itup=0x31bb5ed4, len=16, giststate=0x319e2820) at gist.c:1432
#17 0x080d8d9c in gistplacetopage (rel=<optimized out>,
freespace=<optimized out>, giststate=<optimized out>,
buffer=<optimized out>, itup=<optimized out>, ntup=<optimized out>,
oldoffnum=<optimized out>, newblkno=<optimized out>,
leftchildbuf=<optimized out>, splitinfo=<optimized out>,
markfollowright=<optimized out>, heapRel=<optimized out>,
is_build=<optimized out>) at gist.c:299

So the potential downsize is expected, triggered by _int_unique
being able to remove some duplicate entries from a GIST union key.
AFAICT the fact that it happens only intermittently must boil down
to the randomized insertion choices that gistchoose() sometimes makes.

In short: nothing to see here, move along.

regards, tom lane

#26

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Jeff Davis (#22)

Re: Memory Accounting

On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis <pgsql@j-davis.com> wrote:

The patch has been floating around for a very long time, so I don't
remember exactly why I chose a signed value. Sorry.

I am reminded of the fact that int64 is used to size buffers within
tuplesort.c, because it needs to support negative availMem sizes --
when huge allocations were first supported, tuplesort.c briefly used
"Size", which didn't work. Perhaps it had something to do with that.

--
Peter Geoghegan

#27