Memory leak from ExecutorState context?

Started by Jehan-Guillaume de Rorthaisabout 3 years ago60 messageshackers

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

Hello all,

A customer is facing out of memory query which looks similar to this situation:

/messages/by-id/12064.1555298699@sss.pgh.pa.us

This PostgreSQL version is 11.18. Some settings:

* shared_buffers: 8GB
* work_mem: 64MB
* effective_cache_size: 24GB
* random/seq_page_cost are by default
* physical memory: 32GB

The query is really large and actually update kind of a materialized view.

The customer records the plans of this query on a regular basis. The explain
analyze of this query before running out of memory was:

https://explain.depesz.com/s/sGOH

The customer is aware he should rewrite this query to optimize it, but it's a
long time process he can not start immediately. To make it run in the meantime,
he actually removed the top CTE to a dedicated table. According to their
experience, it's not the first time they had to split a query this way to make
it work.

I've been able to run this query on a standby myself. I've "call
MemoryContextStats(TopMemoryContext)" every 10s on a run, see the data parsed
(best view with "less -S") and the graph associated with it in attachment. It
shows:

* HashBatchContext goes up to 1441MB after 240s then stay flat until the end
(400s as the last record)
* ALL other context are stable before 240s, but ExecutorState
* ExecutorState keeps rising up to 13GB with no interruption until the memory
exhaustion

I did another run with interactive gdb session (see the messy log session in
attachment, for what it worth). Looking at some backtraces during the memory
inflation close to the end of the query, all of them were having these frames in
common:

[...]
#6 0x0000000000621ffc in ExecHashJoinImpl (parallel=false, pstate=0x31a3378)
at nodeHashjoin.c:398 [...]

...which is not really helpful but at least, it seems to come from a hash join
node or some other hash related code. See the gdb session log for more details.
After the out of mem, pmap of this process shows:

430: postgres: postgres <dbname> [local] EXPLAIN
Address Kbytes RSS Dirty Mode Mapping
[...]
0000000002c5e000 13719620 8062376 8062376 rw--- [ anon ]
[...]

Is it usual a backend is requesting such large memory size (13 GB) and
actually use less of 60% of it (7.7GB of RSS)?

Sadly, the database is 1.5TB large and I can not test on a newer major version.
I did not try to check how large would be the required data set to reproduce
this, but it moves 10s of million of rows from multiple tables anyway...

Any idea? How could I help to have a better idea if a leak is actually
occurring and where exactly?

Regards,

pryzby@telsasoft.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#1)

Re: Memory leak from ExecutorState context?

On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:

Hello all,

A customer is facing out of memory query which looks similar to this situation:

/messages/by-id/12064.1555298699@sss.pgh.pa.us

This PostgreSQL version is 11.18. Some settings:

hash joins could exceed work_mem until v13:

|Allow hash aggregation to use disk storage for large aggregation result
|sets (Jeff Davis)
|
|Previously, hash aggregation was avoided if it was expected to use more
|than work_mem memory. Now, a hash aggregation plan can be chosen despite
|that. The hash table will be spilled to disk if it exceeds work_mem
|times hash_mem_multiplier.
|
|This behavior is normally preferable to the old behavior, in which once
|hash aggregation had been chosen, the hash table would be kept in memory
|no matter how large it got — which could be very large if the planner
|had misestimated. If necessary, behavior similar to that can be obtained
|by increasing hash_mem_multiplier.

https://explain.depesz.com/s/sGOH

This shows multiple plan nodes underestimating the row counts by factors
of ~50,000, which could lead to the issue fixed in v13.

I think you should try to improve the estimates, which might improve
other queries in addition to this one, in addition to maybe avoiding the
issue with joins.

The customer is aware he should rewrite this query to optimize it, but it's a
long time process he can not start immediately. To make it run in the meantime,
he actually removed the top CTE to a dedicated table.

Is the table analyzed ?

Is it usual a backend is requesting such large memory size (13 GB) and
actually use less of 60% of it (7.7GB of RSS)?

It's possible it's "using less" simply because it's not available. Is
the process swapping ?

--
Justin

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Justin Pryzby (#2)

Re: Memory leak from ExecutorState context?

On 2/28/23 19:25, Justin Pryzby wrote:

On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:

Hello all,

A customer is facing out of memory query which looks similar to this situation:

/messages/by-id/12064.1555298699@sss.pgh.pa.us

This PostgreSQL version is 11.18. Some settings:

hash joins could exceed work_mem until v13:

|Allow hash aggregation to use disk storage for large aggregation result
|sets (Jeff Davis)
|

That's hash aggregate, not hash join.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#1)

Re: Memory leak from ExecutorState context?

On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:

Hello all,

A customer is facing out of memory query which looks similar to this situation:

/messages/by-id/12064.1555298699@sss.pgh.pa.us

This PostgreSQL version is 11.18. Some settings:

* shared_buffers: 8GB
* work_mem: 64MB
* effective_cache_size: 24GB
* random/seq_page_cost are by default
* physical memory: 32GB

The query is really large and actually update kind of a materialized view.

The customer records the plans of this query on a regular basis. The explain
analyze of this query before running out of memory was:

https://explain.depesz.com/s/sGOH

The customer is aware he should rewrite this query to optimize it, but it's a
long time process he can not start immediately. To make it run in the meantime,
he actually removed the top CTE to a dedicated table. According to their
experience, it's not the first time they had to split a query this way to make
it work.

I've been able to run this query on a standby myself. I've "call
MemoryContextStats(TopMemoryContext)" every 10s on a run, see the data parsed
(best view with "less -S") and the graph associated with it in attachment. It
shows:

* HashBatchContext goes up to 1441MB after 240s then stay flat until the end
(400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

* ALL other context are stable before 240s, but ExecutorState
* ExecutorState keeps rising up to 13GB with no interruption until the memory
exhaustion

I did another run with interactive gdb session (see the messy log session in
attachment, for what it worth). Looking at some backtraces during the memory
inflation close to the end of the query, all of them were having these frames in
common:

[...]
#6 0x0000000000621ffc in ExecHashJoinImpl (parallel=false, pstate=0x31a3378)
at nodeHashjoin.c:398 [...]

...which is not really helpful but at least, it seems to come from a hash join
node or some other hash related code. See the gdb session log for more details.
After the out of mem, pmap of this process shows:

430: postgres: postgres <dbname> [local] EXPLAIN
Address Kbytes RSS Dirty Mode Mapping
[...]
0000000002c5e000 13719620 8062376 8062376 rw--- [ anon ]
[...]

Is it usual a backend is requesting such large memory size (13 GB) and
actually use less of 60% of it (7.7GB of RSS)?

No idea. Interpreting this info is pretty tricky, in my experience. It
might mean the memory is no longer used but sbrk couldn't return it to
the OS yet, or something like that.

Sadly, the database is 1.5TB large and I can not test on a newer major version.
I did not try to check how large would be the required data set to reproduce
this, but it moves 10s of million of rows from multiple tables anyway...

Any idea? How could I help to have a better idea if a leak is actually
occurring and where exactly?

Investigating memory leaks is tough, especially for generic memory
contexts like ExecutorState :-( Even more so when you can't reproduce it
on a machine with custom builds.

What I'd try is this:

1) attach breakpoints to all returns in AllocSetAlloc(), printing the
pointer and size for ExecutorState context, so something like

break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
commands
print MemoryChunkGetPointer(chunk) size
cont
end

2) do the same for AllocSetFree()

3) Match the palloc/pfree calls (using the pointer address), to
determine which ones are not freed and do some stats on the size.
Usually there's only a couple distinct sizes that account for most of
the leaked memory.

4) Break AllocSetAlloc on those leaked sizes, to determine where the
calls come from.

This usually gives enough info about the leak or at least allows
focusing the investigation to a particular area of code.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Justin Pryzby (#2)

Re: Memory leak from ExecutorState context?

Hi Justin,

On Tue, 28 Feb 2023 12:25:08 -0600
Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:

Hello all,

A customer is facing out of memory query which looks similar to this
situation:

/messages/by-id/12064.1555298699@sss.pgh.pa.us

This PostgreSQL version is 11.18. Some settings:

hash joins could exceed work_mem until v13:

Yes, I am aware of this. But as far as I understand Tom Lane explanations from
the discussion mentioned up thread, it should not be ExecutorState.
ExecutorState (13GB) is at least ten times bigger than any other context,
including HashBatchContext (1.4GB) or HashTableContext (16MB). So maybe some
aggregate is walking toward the wall because of bad estimation, but something
else is racing way faster to the wall. And presently it might be something
related to some JOIN node.

About your other points, you are right, there's numerous things we could do to
improve this query, and our customer is considering it as well. It's just a
matter of time now.

But in the meantime, we are facing a query with a memory behavior that seemed
suspect. Following the 4 years old thread I mentioned, my goal is to inspect
and provide all possible information to make sure it's a "normal" behavior or
something that might/should be fixed.

Thank you for your help!

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#5)

Re: Memory leak from ExecutorState context?

On 3/1/23 10:46, Jehan-Guillaume de Rorthais wrote:

Hi Justin,

On Tue, 28 Feb 2023 12:25:08 -0600
Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:

Hello all,

A customer is facing out of memory query which looks similar to this
situation:

/messages/by-id/12064.1555298699@sss.pgh.pa.us

This PostgreSQL version is 11.18. Some settings:

hash joins could exceed work_mem until v13:

Yes, I am aware of this. But as far as I understand Tom Lane explanations from
the discussion mentioned up thread, it should not be ExecutorState.
ExecutorState (13GB) is at least ten times bigger than any other context,
including HashBatchContext (1.4GB) or HashTableContext (16MB). So maybe some
aggregate is walking toward the wall because of bad estimation, but something
else is racing way faster to the wall. And presently it might be something
related to some JOIN node.

I still don't understand why would this be due to a hash aggregate. That
should not allocate memory in ExecutorState at all. And HashBatchContext
(which is the one bloated) is used by hashjoin, so the issue is likely
somewhere in that area.

About your other points, you are right, there's numerous things we could do to
improve this query, and our customer is considering it as well. It's just a
matter of time now.

But in the meantime, we are facing a query with a memory behavior that seemed
suspect. Following the 4 years old thread I mentioned, my goal is to inspect
and provide all possible information to make sure it's a "normal" behavior or
something that might/should be fixed.

It'd be interesting to see if the gdb stuff I suggested yesterday yields
some interesting info.

Furthermore, I realized the plan you posted yesterday may not be the
case used for the failing query. It'd be interesting to see what plan is
used for the case that actually fails. Can you do at least explain on
it? Or alternatively, if the query is already running and eating a lot
of memory, attach gdb and print the plan in ExecutorStart

set print elements 0
p nodeToString(queryDesc->plannedstmt->planTree)

Thinking about this, I have one suspicion. Hashjoins try to fit into
work_mem by increasing the number of batches - when a batch gets too
large, we double the number of batches (and split the batch into two, to
reduce the size). But if there's a lot of tuples for a particular key
(or at least the hash value), we quickly run into work_mem and keep
adding more and more batches.

The problem with this theory is that the batches are allocated in
HashTableContext, and that doesn't grow very much. And the 1.4GB
HashBatchContext is used for buckets - but we should not allocate that
many, because we cap that to nbuckets_optimal (see 30d7ae3c76). And it
does not explain the ExecutorState bloat either.

Nevertheless, it'd be interesting to see the hashtable parameters:

p *hashtable

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#4)

Re: Memory leak from ExecutorState context?

Hi Tomas,

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:

* HashBatchContext goes up to 1441MB after 240s then stay flat until the end
(400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of
calls, always short-cut'ed to 1048576, I guess because of the conditional block
«/* safety check to avoid overflow */» appearing early in this function.

I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the
query run faster. Enabling it at 19.1GB of memory consumption, it stayed
silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly).

The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning,
and a last time close to the end of the query execution.

Any idea? How could I help to have a better idea if a leak is actually
occurring and where exactly?

Investigating memory leaks is tough, especially for generic memory
contexts like ExecutorState :-( Even more so when you can't reproduce it
on a machine with custom builds.

What I'd try is this:

1) attach breakpoints to all returns in AllocSetAlloc(), printing the
pointer and size for ExecutorState context, so something like

break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
commands
print MemoryChunkGetPointer(chunk) size
cont
end

2) do the same for AllocSetFree()

3) Match the palloc/pfree calls (using the pointer address), to
determine which ones are not freed and do some stats on the size.
Usually there's only a couple distinct sizes that account for most of
the leaked memory.

So here is what I end up with this afternoon, using file, lines and macro from
REL_11_18:

set logging on
set pagination off

break aset.c:781 if strcmp("ExecutorState",context.name) == 0
commands 1
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break aset.c:820 if strcmp("ExecutorState",context.name) == 0
commands 2
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break aset.c:979 if strcmp("ExecutorState",context.name) == 0
commands 3
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break AllocSetFree if strcmp("ExecutorState",context.name) == 0
commands 4
print pointer
cont
end

So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The
backend had only 3'32" of CPU time:

VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2727284 2.4g 17840 R 99.0 7.7 181:25.07 gdb
9054688 220648 103056 t 1.3 0.7 3:32.05 postmaster

Interestingly, the RES memory of the backend did not explode yet, but VIRT is
already high.

I suppose the query will run for some more hours, hopefully, gdb will not
exhaust the memory in the meantime...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
# this one actually really comes from the gdb log
0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz)
# this one might be a bug in my script
0x2: break=2 num=945346 sz=2 (weird address)
* ignoring the weird size requested during the 191st call, the total amount
of non free'd memory is currently 5488MB

I couldn't print "size" as it is optimzed away, that's why I tracked
chunk->size... Is there anything wrong with my current run and gdb log?

The gdb log is 5MB compressed. I'll keep it off-list, but I can provide it if
needed.

Stay tuned...

Thank you!

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#7)

Re: Memory leak from ExecutorState context?

On Wed, 1 Mar 2023 18:48:40 +0100
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
# this one actually really comes from the gdb log
0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz)
# this one might be a bug in my script
0x2: break=2 num=945346 sz=2 (weird address)
* ignoring the weird size requested during the 191st call, the total amount
of non free'd memory is currently 5488MB

I forgot one stat. I don't know if this is expected, normal or not, but 53
chunks has been allocated on an existing address that was not free'd before.

Regards,

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#7)

Re: Memory leak from ExecutorState context?

On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:

Hi Tomas,

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:

* HashBatchContext goes up to 1441MB after 240s then stay flat until the end
(400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
fillfactor 1.0).

But 1M batches? I guess that could be problematic. It doesn't seem like
much, but we need 1M files on each side - 1M for the hash table, 1M for
the outer relation. That's 16MB of pointers, but the files are BufFile
and we keep 8kB buffer for each of them. That's ~16GB right there :-(

In practice it probably won't be that bad, because not all files will be
allocated/opened concurrently (especially if this is due to many tuples
having the same value). Assuming that's what's happening here, ofc.

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of
calls, always short-cut'ed to 1048576, I guess because of the conditional block
«/* safety check to avoid overflow */» appearing early in this function.

Hmmm, that's a bit weird, no? I mean, the check is

/* safety check to avoid overflow */
if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
return;

Why would it stop at 1048576? It certainly is not higher than INT_MAX/2
and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
happening here?

I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the
query run faster. Enabling it at 19.1GB of memory consumption, it stayed
silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly).

The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning,
and a last time close to the end of the query execution.

Any idea? How could I help to have a better idea if a leak is actually
occurring and where exactly?

Investigating memory leaks is tough, especially for generic memory
contexts like ExecutorState :-( Even more so when you can't reproduce it
on a machine with custom builds.

What I'd try is this:

1) attach breakpoints to all returns in AllocSetAlloc(), printing the
pointer and size for ExecutorState context, so something like

break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
commands
print MemoryChunkGetPointer(chunk) size
cont
end

2) do the same for AllocSetFree()

3) Match the palloc/pfree calls (using the pointer address), to
determine which ones are not freed and do some stats on the size.
Usually there's only a couple distinct sizes that account for most of
the leaked memory.

So here is what I end up with this afternoon, using file, lines and macro from
REL_11_18:

set logging on
set pagination off

break aset.c:781 if strcmp("ExecutorState",context.name) == 0
commands 1
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break aset.c:820 if strcmp("ExecutorState",context.name) == 0
commands 2
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break aset.c:979 if strcmp("ExecutorState",context.name) == 0
commands 3
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break AllocSetFree if strcmp("ExecutorState",context.name) == 0
commands 4
print pointer
cont
end

So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The
backend had only 3'32" of CPU time:

VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2727284 2.4g 17840 R 99.0 7.7 181:25.07 gdb
9054688 220648 103056 t 1.3 0.7 3:32.05 postmaster

Interestingly, the RES memory of the backend did not explode yet, but VIRT is
already high.

I suppose the query will run for some more hours, hopefully, gdb will not
exhaust the memory in the meantime...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
# this one actually really comes from the gdb log
0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz)
# this one might be a bug in my script
0x2: break=2 num=945346 sz=2 (weird address)
* ignoring the weird size requested during the 191st call, the total amount
of non free'd memory is currently 5488MB

I couldn't print "size" as it is optimzed away, that's why I tracked
chunk->size... Is there anything wrong with my current run and gdb log?

There's definitely something wrong. The size should not be 0, and
neither it should be > 1GB. I suspect it's because some of the variables
get optimized out, and gdb just uses some nonsense :-(

I guess you'll need to debug the individual breakpoints, and see what's
available. It probably depends on the compiler version, etc. For example
I don't see the "chunk" for breakpoint 3, but "chunk_size" works and I
can print the chunk pointer with a bit of arithmetics:

p (block->freeptr - chunk_size)

I suppose similar gympastics could work for the other breakpoints.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#8)

Re: Memory leak from ExecutorState context?

On 3/1/23 19:09, Jehan-Guillaume de Rorthais wrote:

On Wed, 1 Mar 2023 18:48:40 +0100
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
# this one actually really comes from the gdb log
0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz)
# this one might be a bug in my script
0x2: break=2 num=945346 sz=2 (weird address)
* ignoring the weird size requested during the 191st call, the total amount
of non free'd memory is currently 5488MB

I forgot one stat. I don't know if this is expected, normal or not, but 53
chunks has been allocated on an existing address that was not free'd before.

It's likely chunk was freed by repalloc() and not by pfree() directly.
Or maybe the whole context got destroyed/reset, in which case we don't
free individual chunks. But that's unlikely for the ExecutorState.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#10)

Re: Memory leak from ExecutorState context?

On Wed, 1 Mar 2023 20:34:08 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 3/1/23 19:09, Jehan-Guillaume de Rorthais wrote:

On Wed, 1 Mar 2023 18:48:40 +0100
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning,
before the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
# this one actually really comes from the gdb log
0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz)
# this one might be a bug in my script
0x2: break=2 num=945346 sz=2 (weird
address)
* ignoring the weird size requested during the 191st call, the total amount
of non free'd memory is currently 5488MB

I forgot one stat. I don't know if this is expected, normal or not, but 53
chunks has been allocated on an existing address that was not free'd before.

It's likely chunk was freed by repalloc() and not by pfree() directly.
Or maybe the whole context got destroyed/reset, in which case we don't
free individual chunks. But that's unlikely for the ExecutorState.

Well, as all breakpoints were conditional on ExecutorState, I suppose this
might be repalloc then.

Regards,

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#9)

Re: Memory leak from ExecutorState context?

Hi,

On Wed, 1 Mar 2023 20:29:11 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:

* HashBatchContext goes up to 1441MB after 240s then stay flat until the
end (400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
fillfactor 1.0).

But 1M batches? I guess that could be problematic. It doesn't seem like
much, but we need 1M files on each side - 1M for the hash table, 1M for
the outer relation. That's 16MB of pointers, but the files are BufFile
and we keep 8kB buffer for each of them. That's ~16GB right there :-(

In practice it probably won't be that bad, because not all files will be
allocated/opened concurrently (especially if this is due to many tuples
having the same value). Assuming that's what's happening here, ofc.

And I suppose they are close/freed concurrently as well?

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
of calls, always short-cut'ed to 1048576, I guess because of the
conditional block «/* safety check to avoid overflow */» appearing early in
this function.

Hmmm, that's a bit weird, no? I mean, the check is

/* safety check to avoid overflow */
if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
return;

Why would it stop at 1048576? It certainly is not higher than INT_MAX/2
and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
happening here?

Indeed, not the good suspect. But what about this other short-cut then?

/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
return;

[...]

/*
* If we dumped out either all or none of the tuples in the table, disable
* further expansion of nbatch. This situation implies that we have
* enough tuples of identical hashvalues to overflow spaceAllowed.
* Increasing nbatch will not fix it since there's no way to subdivide the
* group any more finely. We have to just gut it out and hope the server
* has enough RAM.
*/
if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
#ifdef HJDEBUG
printf("Hashjoin %p: disabling further increase of nbatch\n",
hashtable);
#endif
}

If I guess correctly, the function is not able to split the current batch, so
it sits and hopes. This is a much better suspect and I can surely track this
from gdb.

Being able to find what are the fields involved in the join could help as well
to check or gather some stats about them, but I hadn't time to dig this yet...

[...]

Investigating memory leaks is tough, especially for generic memory
contexts like ExecutorState :-( Even more so when you can't reproduce it
on a machine with custom builds.

What I'd try is this:

[...]

I couldn't print "size" as it is optimzed away, that's why I tracked
chunk->size... Is there anything wrong with my current run and gdb log?

There's definitely something wrong. The size should not be 0, and
neither it should be > 1GB. I suspect it's because some of the variables
get optimized out, and gdb just uses some nonsense :-(

I guess you'll need to debug the individual breakpoints, and see what's
available. It probably depends on the compiler version, etc. For example
I don't see the "chunk" for breakpoint 3, but "chunk_size" works and I
can print the chunk pointer with a bit of arithmetics:

p (block->freeptr - chunk_size)

I suppose similar gympastics could work for the other breakpoints.

OK, I'll give it a try tomorrow.

Thank you!

NB: the query has been killed by the replication.

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#12)

Re: Memory leak from ExecutorState context?

On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote:

Hi,

On Wed, 1 Mar 2023 20:29:11 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:

* HashBatchContext goes up to 1441MB after 240s then stay flat until the
end (400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
fillfactor 1.0).

But 1M batches? I guess that could be problematic. It doesn't seem like
much, but we need 1M files on each side - 1M for the hash table, 1M for
the outer relation. That's 16MB of pointers, but the files are BufFile
and we keep 8kB buffer for each of them. That's ~16GB right there :-(

In practice it probably won't be that bad, because not all files will be
allocated/opened concurrently (especially if this is due to many tuples
having the same value). Assuming that's what's happening here, ofc.

And I suppose they are close/freed concurrently as well?

Yeah. There can be different subsets of the files used, depending on
when the number of batches start to explode, etc.

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
of calls, always short-cut'ed to 1048576, I guess because of the
conditional block «/* safety check to avoid overflow */» appearing early in
this function.

Hmmm, that's a bit weird, no? I mean, the check is

/* safety check to avoid overflow */
if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
return;

Why would it stop at 1048576? It certainly is not higher than INT_MAX/2
and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
happening here?

Indeed, not the good suspect. But what about this other short-cut then?

/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
return;

[...]

/*
* If we dumped out either all or none of the tuples in the table, disable
* further expansion of nbatch. This situation implies that we have
* enough tuples of identical hashvalues to overflow spaceAllowed.
* Increasing nbatch will not fix it since there's no way to subdivide the
* group any more finely. We have to just gut it out and hope the server
* has enough RAM.
*/
if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
#ifdef HJDEBUG
printf("Hashjoin %p: disabling further increase of nbatch\n",
hashtable);
#endif
}

If I guess correctly, the function is not able to split the current batch, so
it sits and hopes. This is a much better suspect and I can surely track this
from gdb.

Yes, this would make much more sense - it'd be consistent with the
hypothesis that this is due to number of batches exploding (it's a
protection exactly against that).

You specifically mentioned the other check earlier, but now I realize
you've been just speculating it might be that.

Being able to find what are the fields involved in the join could help as well
to check or gather some stats about them, but I hadn't time to dig this yet...

It's going to be tricky, because all parts of the plan may be doing
something, and there may be multiple hash joins. So you won't know if
you're executing the part of the plan that's causing issues :-(

But I have another idea - put a breakpoint on makeBufFile() which is the
bit that allocates the temp files including the 8kB buffer, and print in
what context we allocate that. I have a hunch we may be allocating it in
the ExecutorState. That'd explain all the symptoms.

BTW with how many batches does the hash join start?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#13)

Re: Memory leak from ExecutorState context?

On Thu, 2 Mar 2023 01:30:27 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote:

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
of calls, always short-cut'ed to 1048576, I guess because of the
conditional block «/* safety check to avoid overflow */» appearing early
in this function.

[...] But what about this other short-cut then?

/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
return;

[...]

/*
* If we dumped out either all or none of the tuples in the table,
* disable
* further expansion of nbatch. This situation implies that we have
* enough tuples of identical hashvalues to overflow spaceAllowed.
* Increasing nbatch will not fix it since there's no way to subdivide
* the
* group any more finely. We have to just gut it out and hope the server
* has enough RAM.
*/
if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
#ifdef HJDEBUG
printf("Hashjoin %p: disabling further increase of nbatch\n",
hashtable);
#endif
}

If I guess correctly, the function is not able to split the current batch,
so it sits and hopes. This is a much better suspect and I can surely track
this from gdb.

Yes, this would make much more sense - it'd be consistent with the
hypothesis that this is due to number of batches exploding (it's a
protection exactly against that).

You specifically mentioned the other check earlier, but now I realize
you've been just speculating it might be that.

Yes, sorry about that, I jumped on this speculation without actually digging it
much...

[...]

But I have another idea - put a breakpoint on makeBufFile() which is the
bit that allocates the temp files including the 8kB buffer, and print in
what context we allocate that. I have a hunch we may be allocating it in
the ExecutorState. That'd explain all the symptoms.

That what I was wondering as well yesterday night.

So, on your advice, I set a breakpoint on makeBufFile:

(gdb) info br
Num Type Disp Enb Address What
1 breakpoint keep y 0x00000000007229df in makeBufFile
bt 10
p CurrentMemoryContext.name

Then, I disabled it and ran the query up to this mem usage:

VIRT RES SHR S %CPU %MEM
20.1g 7.0g 88504 t 0.0 22.5

Then, I enabled the breakpoint and look at around 600 bt and context name
before getting bored. They **all** looked like that:

Breakpoint 1, BufFileCreateTemp (...) at buffile.c:201
201 in buffile.c
#0 BufFileCreateTemp (...) buffile.c:201
#1 ExecHashJoinSaveTuple (tuple=0x1952c180, ...) nodeHashjoin.c:1238
#2 ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
#3 ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
#4 ExecProcNodeInstr (node=<optimized out>) execProcnode.c:462
#5 ExecProcNode (node=0x31a6418)
#6 ExecSort (pstate=0x31a6308)
#7 ExecProcNodeInstr (node=<optimized out>)
#8 ExecProcNode (node=0x31a6308)
#9 fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)

$421643 = 0x99d7f7 "ExecutorState"

These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
probably log much more of them if more checks/stats need to be collected, but
it really slow down the query a lot, granting it only 1-5% of CPU time instead
of the usual 100%.

So It's not exactly a leakage, as memory would be released at the end of the
query, but I suppose they should be allocated in a shorter living context,
to avoid this memory bloat, am I right?

BTW with how many batches does the hash join start?

* batches went from 32 to 1048576 before being growEnabled=false as suspected
* original and current nbuckets were set to 1048576 immediately
* allowed space is set to the work_mem, but current space usage is 1.3GB, as
measured previously close before system refuse more memory allocation.

Here are the full details about the hash associated with the previous backtrace:

(gdb) up
(gdb) up
(gdb) p *((HashJoinState*)pstate)->hj_HashTable
$421652 = {
nbuckets = 1048576,
log2_nbuckets = 20,
nbuckets_original = 1048576,
nbuckets_optimal = 1048576,
log2_nbuckets_optimal = 20,
buckets = {unshared = 0x68f12e8, shared = 0x68f12e8},
keepNulls = true,
skewEnabled = false,
skewBucket = 0x0,
skewBucketLen = 0,
nSkewBuckets = 0,
skewBucketNums = 0x0,
nbatch = 1048576,
curbatch = 0,
nbatch_original = 32,
nbatch_outstart = 1048576,
growEnabled = false,
totalTuples = 19541735,
partialTuples = 19541735,
skewTuples = 0,
innerBatchFile = 0xdfcd168,
outerBatchFile = 0xe7cd1a8,
outer_hashfunctions = 0x68ed3a0,
inner_hashfunctions = 0x68ed3f0,
hashStrict = 0x68ed440,
spaceUsed = 1302386440,
spaceAllowed = 67108864,
spacePeak = 1302386440,
spaceUsedSkew = 0,
spaceAllowedSkew = 1342177,
hashCxt = 0x68ed290,
batchCxt = 0x68ef2a0,
chunks = 0x251f28e88,
current_chunk = 0x0,
area = 0x0,
parallel_state = 0x0,
batches = 0x0,
current_chunk_shared = 1103827828993
}

For what it worth, contexts are:

(gdb) p ((HashJoinState*)pstate)->hj_HashTable->hashCxt.name
$421657 = 0x99e3c0 "HashTableContext"

(gdb) p ((HashJoinState*)pstate)->hj_HashTable->batchCxt.name
$421658 = 0x99e3d1 "HashBatchContext"

Regards,

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#14)

Re: Memory leak from ExecutorState context?

On 3/2/23 13:08, Jehan-Guillaume de Rorthais wrote:

...
[...]

But I have another idea - put a breakpoint on makeBufFile() which is the
bit that allocates the temp files including the 8kB buffer, and print in
what context we allocate that. I have a hunch we may be allocating it in
the ExecutorState. That'd explain all the symptoms.

That what I was wondering as well yesterday night.

So, on your advice, I set a breakpoint on makeBufFile:

(gdb) info br
Num Type Disp Enb Address What
1 breakpoint keep y 0x00000000007229df in makeBufFile
bt 10
p CurrentMemoryContext.name

Then, I disabled it and ran the query up to this mem usage:

VIRT RES SHR S %CPU %MEM
20.1g 7.0g 88504 t 0.0 22.5

Then, I enabled the breakpoint and look at around 600 bt and context name
before getting bored. They **all** looked like that:

Breakpoint 1, BufFileCreateTemp (...) at buffile.c:201
201 in buffile.c
#0 BufFileCreateTemp (...) buffile.c:201
#1 ExecHashJoinSaveTuple (tuple=0x1952c180, ...) nodeHashjoin.c:1238
#2 ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
#3 ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
#4 ExecProcNodeInstr (node=<optimized out>) execProcnode.c:462
#5 ExecProcNode (node=0x31a6418)
#6 ExecSort (pstate=0x31a6308)
#7 ExecProcNodeInstr (node=<optimized out>)
#8 ExecProcNode (node=0x31a6308)
#9 fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)

$421643 = 0x99d7f7 "ExecutorState"

These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
probably log much more of them if more checks/stats need to be collected, but
it really slow down the query a lot, granting it only 1-5% of CPU time instead
of the usual 100%.

Bingo!

So It's not exactly a leakage, as memory would be released at the end of the
query, but I suppose they should be allocated in a shorter living context,
to avoid this memory bloat, am I right?

Well, yeah and no.

In principle we could/should have allocated the BufFiles in a different
context (possibly hashCxt). But in practice it probably won't make any
difference, because the query will probably run all the hashjoins at the
same time. Imagine a sequence of joins - we build all the hashes, and
then tuples from the outer side bubble up through the plans. And then
you process the last tuple and release all the hashes.

This would not fix the issue. It'd be helpful for accounting purposes
(we'd know it's the buffiles and perhaps for which hashjoin node). But
we'd still have to allocate the memory etc. (so still OOM).

There's only one thing I think could help - increase the work_mem enough
not to trigger the explosive growth in number of batches. Imagine
there's one very common value, accounting for ~65MB of tuples. With
work_mem=64MB this leads to exactly the explosive growth you're
observing here. With 128MB it'd probably run just fine.

The problem is we don't know how large the work_mem would need to be :-(
So you'll have to try and experiment a bit.

I remembered there was a thread [1]/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net about *exactly* this issue in 2019.

[1]: /messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net
/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net

I even posted a couple patches that try to address this by accounting
for the BufFile memory, and increasing work_mem a bit instead of just
blindly increasing the number of batches (ignoring the fact that means
more memory will be used for the BufFile stuff).

I don't recall why it went nowhere, TBH. But I recall there were
discussions about maybe doing something like "block nestloop" at the
time, or something. Or maybe the thread just went cold.

BTW with how many batches does the hash join start?

* batches went from 32 to 1048576 before being growEnabled=false as suspected
* original and current nbuckets were set to 1048576 immediately
* allowed space is set to the work_mem, but current space usage is 1.3GB, as
measured previously close before system refuse more memory allocation.

Yeah, I think this is pretty expected. We start with multiple batches,
so we pick optimal buckets for the whole work_mem (so no growth here).

But then batches explode, in the futile hope to keep this in work_mem.
Once that growth gets disabled, we end up with 1.3GB hash table.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#15)

Re: Memory leak from ExecutorState context?

Hi!

On Thu, 2 Mar 2023 13:44:52 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Well, yeah and no.

In principle we could/should have allocated the BufFiles in a different
context (possibly hashCxt). But in practice it probably won't make any
difference, because the query will probably run all the hashjoins at the
same time. Imagine a sequence of joins - we build all the hashes, and
then tuples from the outer side bubble up through the plans. And then
you process the last tuple and release all the hashes.

This would not fix the issue. It'd be helpful for accounting purposes
(we'd know it's the buffiles and perhaps for which hashjoin node). But
we'd still have to allocate the memory etc. (so still OOM).

Well, accounting things in the correct context would already worth a patch I
suppose. At least, it help to investigate faster. Plus, you already wrote a
patch about that[1]:

/messages/by-id/20190421114618.z3mpgmimc3rmubi4@development

Note that I did reference the "Out of Memory errors are frustrating as heck!"
thread in my first email, pointing on a Tom Lane's email explaining that
ExecutorState was not supposed to be so large[2]/messages/by-id/12064.1555298699@sss.pgh.pa.us.

[2]: /messages/by-id/12064.1555298699@sss.pgh.pa.us

There's only one thing I think could help - increase the work_mem enough
not to trigger the explosive growth in number of batches. Imagine
there's one very common value, accounting for ~65MB of tuples. With
work_mem=64MB this leads to exactly the explosive growth you're
observing here. With 128MB it'd probably run just fine.

The problem is we don't know how large the work_mem would need to be :-(
So you'll have to try and experiment a bit.

I remembered there was a thread [1] about *exactly* this issue in 2019.

[1]
/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net

I even posted a couple patches that try to address this by accounting
for the BufFile memory, and increasing work_mem a bit instead of just
blindly increasing the number of batches (ignoring the fact that means
more memory will be used for the BufFile stuff).

I don't recall why it went nowhere, TBH. But I recall there were
discussions about maybe doing something like "block nestloop" at the
time, or something. Or maybe the thread just went cold.

So I read the full thread now. I'm still not sure why we try to avoid hash
collision so hard, and why a similar data subset barely larger than work_mem
makes the number of batchs explode, but I think I have a better understanding of
the discussion and the proposed solutions.

There was some thoughts about how to make a better usage of the memory. As
memory is exploding way beyond work_mem, at least, avoid to waste it with too
many buffers of BufFile. So you expand either the work_mem or the number of
batch, depending on what move is smarter. TJis is explained and tested here:

/messages/by-id/20190421161434.4hedytsadpbnglgk@development
/messages/by-id/20190422030927.3huxq7gghms4kmf4@development

And then, another patch to overflow each batch to a dedicated temp file and
stay inside work_mem (v4-per-slice-overflow-file.patch):

/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

Then, nothing more on the discussion about this last patch. So I guess it just
went cold.

For what it worth, these two patches seems really interesting to me. Do you need
any help to revive it?

Regards,

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#16)

Re: Memory leak from ExecutorState context?

On Thu, 2 Mar 2023 19:15:30 +0100
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]

For what it worth, these two patches seems really interesting to me. Do you
need any help to revive it?

To avoid confusion, the two patches I meant were:

* 0001-move-BufFile-stuff-into-separate-context.patch
* v4-per-slice-overflow-file.patch

Regards,

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#16)

Re: Memory leak from ExecutorState context?

On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:

Hi!

On Thu, 2 Mar 2023 13:44:52 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Well, yeah and no.

In principle we could/should have allocated the BufFiles in a different
context (possibly hashCxt). But in practice it probably won't make any
difference, because the query will probably run all the hashjoins at the
same time. Imagine a sequence of joins - we build all the hashes, and
then tuples from the outer side bubble up through the plans. And then
you process the last tuple and release all the hashes.

This would not fix the issue. It'd be helpful for accounting purposes
(we'd know it's the buffiles and perhaps for which hashjoin node). But
we'd still have to allocate the memory etc. (so still OOM).

Well, accounting things in the correct context would already worth a patch I
suppose. At least, it help to investigate faster. Plus, you already wrote a
patch about that[1]:

/messages/by-id/20190421114618.z3mpgmimc3rmubi4@development

Note that I did reference the "Out of Memory errors are frustrating as heck!"
thread in my first email, pointing on a Tom Lane's email explaining that
ExecutorState was not supposed to be so large[2].

[2] /messages/by-id/12064.1555298699@sss.pgh.pa.us

Ah, right, I didn't realize it's the same thread. There are far too many
threads about this sort of things, and I probably submitted half-baked
patches to most of them :-/

There's only one thing I think could help - increase the work_mem enough
not to trigger the explosive growth in number of batches. Imagine
there's one very common value, accounting for ~65MB of tuples. With
work_mem=64MB this leads to exactly the explosive growth you're
observing here. With 128MB it'd probably run just fine.

The problem is we don't know how large the work_mem would need to be :-(
So you'll have to try and experiment a bit.

I remembered there was a thread [1] about *exactly* this issue in 2019.

[1]
/messages/by-id/bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net

I even posted a couple patches that try to address this by accounting
for the BufFile memory, and increasing work_mem a bit instead of just
blindly increasing the number of batches (ignoring the fact that means
more memory will be used for the BufFile stuff).

I don't recall why it went nowhere, TBH. But I recall there were
discussions about maybe doing something like "block nestloop" at the
time, or something. Or maybe the thread just went cold.

So I read the full thread now. I'm still not sure why we try to avoid hash
collision so hard, and why a similar data subset barely larger than work_mem
makes the number of batchs explode, but I think I have a better understanding of
the discussion and the proposed solutions.

I don't think this is about hash collisions (i.e. the same hash value
being computed for different values). You can construct cases like this,
of course, particularly if you only look at a subset of the bits (for 1M
batches we only look at the first 20 bits), but I'd say it's fairly
unlikely to happen unless you do that intentionally.

(I'm assuming regular data types with reasonable hash functions. If the
query joins on custom data types with some silly hash function, it may
be more likely to have conflicts.)

IMHO a much more likely explanation is there actually is a very common
value in the data. For example there might be a value repeated 1M times,
and that'd be enough to break this.

We do build a special "skew" buckets for values from an MCV, but maybe
the stats are not updated yet, or maybe there are too many such values
to fit into MCV?

I now realize there's probably another way to get into this - oversized
rows. Could there be a huge row (e.g. with a large text/bytea value)?
Imagine a row that's 65MB - that'd be game over with work_mem=64MB. Or
there might be smaller rows, but a couple hash collisions would suffice.

There was some thoughts about how to make a better usage of the memory. As
memory is exploding way beyond work_mem, at least, avoid to waste it with too
many buffers of BufFile. So you expand either the work_mem or the number of
batch, depending on what move is smarter. TJis is explained and tested here:

/messages/by-id/20190421161434.4hedytsadpbnglgk@development
/messages/by-id/20190422030927.3huxq7gghms4kmf4@development

And then, another patch to overflow each batch to a dedicated temp file and
stay inside work_mem (v4-per-slice-overflow-file.patch):

/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

Then, nothing more on the discussion about this last patch. So I guess it just
went cold.

I think a contributing factor was that the OP did not respond for a
couple months, so the thread went cold.

For what it worth, these two patches seems really interesting to me. Do you need
any help to revive it?

I think another reason why that thread went nowhere were some that we've
been exploring a different (and likely better) approach to fix this by
falling back to a nested loop for the "problematic" batches.

As proposed in this thread:

/messages/by-id/20190421161434.4hedytsadpbnglgk@development

So I guess the best thing would be to go through these threads, see what
the status is, restart the discussion and propose what to do. If you do
that, I'm happy to rebase the patches, and maybe see if I could improve
them in some way.

I was hoping we'd solve this by the BNL, but if we didn't get that in 4
years, maybe we shouldn't stall and get at least an imperfect stop-gap
solution ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#18)

Re: Memory leak from ExecutorState context?

On Thu, 2 Mar 2023 19:53:14 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:

...

There was some thoughts about how to make a better usage of the memory. As
memory is exploding way beyond work_mem, at least, avoid to waste it with
too many buffers of BufFile. So you expand either the work_mem or the
number of batch, depending on what move is smarter. TJis is explained and
tested here:

/messages/by-id/20190421161434.4hedytsadpbnglgk@development
/messages/by-id/20190422030927.3huxq7gghms4kmf4@development

And then, another patch to overflow each batch to a dedicated temp file and
stay inside work_mem (v4-per-slice-overflow-file.patch):

/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

Then, nothing more on the discussion about this last patch. So I guess it
just went cold.

I think a contributing factor was that the OP did not respond for a
couple months, so the thread went cold.

For what it worth, these two patches seems really interesting to me. Do you
need any help to revive it?

I think another reason why that thread went nowhere were some that we've
been exploring a different (and likely better) approach to fix this by
falling back to a nested loop for the "problematic" batches.

As proposed in this thread:

/messages/by-id/20190421161434.4hedytsadpbnglgk@development

Unless I'm wrong, you are linking to the same «frustrated as heck!» discussion,
for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch
(balancing between increasing batches *and* work_mem).

No sign of turning "problematic" batches to nested loop. Did I miss something?

Do you have a link close to your hand about such algo/patch test by any chance?

I was hoping we'd solve this by the BNL, but if we didn't get that in 4
years, maybe we shouldn't stall and get at least an imperfect stop-gap
solution ...

I'll keep searching tomorrow about existing BLN discussions (is it block level
nested loops?).

Regards,

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#19)

Re: Memory leak from ExecutorState context?

On 3/2/23 23:57, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Mar 2023 19:53:14 +0100
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:

...

There was some thoughts about how to make a better usage of the memory. As
memory is exploding way beyond work_mem, at least, avoid to waste it with
too many buffers of BufFile. So you expand either the work_mem or the
number of batch, depending on what move is smarter. TJis is explained and
tested here:

/messages/by-id/20190421161434.4hedytsadpbnglgk@development
/messages/by-id/20190422030927.3huxq7gghms4kmf4@development

And then, another patch to overflow each batch to a dedicated temp file and
stay inside work_mem (v4-per-slice-overflow-file.patch):

/messages/by-id/20190428141901.5dsbge2ka3rxmpk6@development

Then, nothing more on the discussion about this last patch. So I guess it
just went cold.

I think a contributing factor was that the OP did not respond for a
couple months, so the thread went cold.

For what it worth, these two patches seems really interesting to me. Do you
need any help to revive it?

I think another reason why that thread went nowhere were some that we've
been exploring a different (and likely better) approach to fix this by
falling back to a nested loop for the "problematic" batches.

As proposed in this thread:

/messages/by-id/20190421161434.4hedytsadpbnglgk@development

Unless I'm wrong, you are linking to the same «frustrated as heck!» discussion,
for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch
(balancing between increasing batches *and* work_mem).

No sign of turning "problematic" batches to nested loop. Did I miss something?

Do you have a link close to your hand about such algo/patch test by any chance?

Gah! My apologies, I meant to post a link to this thread:

/messages/by-id/CAAKRu_b6+jC93WP+pWxqK5KAZJC5Rmxm8uquKtEf-KQ++1Li6Q@mail.gmail.com

which then points to this BNL patch

/messages/by-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA@mail.gmail.com

That discussion apparently stalled in August 2020, so maybe that's where
we should pick up and see in what shape that patch is.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#18)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#21)

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#22)

pryzby@telsasoft.com

about 3 years ago

In reply to: Tomas Vondra (#23)

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Justin Pryzby (#24)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#25)

Melanie Plageman

melanieplageman@gmail.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#26)

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Melanie Plageman (#27)

Melanie Plageman

melanieplageman@gmail.com

about 3 years ago

In reply to: Tomas Vondra (#28)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#26)

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#30)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#31)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Melanie Plageman (#27)

tomas.vondra@2ndquadrant.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#32)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Tomas Vondra (#34)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#35)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#36)

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#37)

Melanie Plageman

melanieplageman@gmail.com

about 3 years ago

In reply to: Konstantin Knizhnik (#38)

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 3 years ago

In reply to: Melanie Plageman (#39)

Melanie Plageman

melanieplageman@gmail.com

about 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#36)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Melanie Plageman (#41)

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#42)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Melanie Plageman (#43)

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#44)

tomas.vondra@2ndquadrant.com

almost 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#44)

tomas.vondra@2ndquadrant.com

almost 3 years ago

In reply to: Melanie Plageman (#45)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Tomas Vondra (#47)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Tomas Vondra (#46)

tomas.vondra@2ndquadrant.com

almost 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#49)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Tomas Vondra (#50)

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#51)

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Tomas Vondra (#47)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Melanie Plageman (#52)

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#54)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Melanie Plageman (#55)

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Jehan-Guillaume de Rorthais (#56)

tomas.vondra@2ndquadrant.com

almost 3 years ago

In reply to: Melanie Plageman (#57)

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Tomas Vondra (#58)

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 3 years ago

In reply to: Tomas Vondra (#58)