Enforce work_mem per worker

Started by Arne Rolandabout 4 years ago5 messages
#1Arne Roland
A.Roland@index.de

Hello!

Since I used a lot of my time chasing short lived processes eating away big chunks of memory in recent weeks, I am wondering about a decent way to go about this.
The problem I am facing essentially relates to the fact that work_mem settings, while they are enforced per hash and sort node, aren't enforced globally.
One common case, that causes this problem more frequently than a few years ago, is the partitionwise_join. If there are a lot of partitions hash joined, we get a lot of hash nodes, each one potentially consuming work_mem.

While avoiding oom seems a big deal to me, my search didn't turn up previous hackers discussions about this. There is a good chance I am missing something here, so I'd appreciate any pointers.

The most reasonable solution seems to me to have a data structure per worker, that 1. tracks the amount of memory used by certain nodes and 2. offers a callback to let the node spill it's contents (almost) completely to disc. I am thinking about hash and sort nodes for now, since they affect memory usage a lot.
This would allow a node to spill other nodes contents to disc to avoid exceeding work_mem.

I'd love to hear your thoughts and suggestions!

Regards
Arne

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Arne Roland (#1)
Re: Enforce work_mem per worker

On Sat, Nov 27, 2021 at 04:33:07PM +0000, Arne Roland wrote:

Hello!

Since I used a lot of my time chasing short lived processes eating away big chunks of memory in recent weeks, I am wondering about a decent way to go about this.
The problem I am facing essentially relates to the fact that work_mem settings, while they are enforced per hash and sort node, aren't enforced globally.
One common case, that causes this problem more frequently than a few years ago, is the partitionwise_join. If there are a lot of partitions hash joined, we get a lot of hash nodes, each one potentially consuming work_mem.

While avoiding oom seems a big deal to me, my search didn't turn up previous hackers discussions about this. There is a good chance I am missing something here, so I'd appreciate any pointers.

Here's some pointers ;)

/messages/by-id/20190708164401.GA22387@telsasoft.com
/messages/by-id/20191216215314.qvrtgxaz4m755geq@development
/messages/by-id/CAH2-WzmNwV=LfDRXPsmCqgmm91mp=2b4FvXNF=cCvMrb8YFLfQ@mail.gmail.com
- I don't recall reading all of this last one before, and it's got interesting
historic value, so I'm reading it myself now...

--
Justin

#3Arne Roland
A.Roland@index.de
In reply to: Justin Pryzby (#2)
Re: Enforce work_mem per worker

I did read parts of the last one back then. But thanks for the link, I plan to reread the thread as a whole.

From what I can tell, the discussions here are the attempt by very smart people to (at least partially) solve the problem of memory allocation (without sacrificing to much on the runtime front). That problem is very hard.

What I am mostly trying to do, is to provide a reliable way of preventing the operational hazard of dealing with oom and alike, e.g. massive kernel buffer eviction. I don't want to touch the planning, which is always complex and tends to introduce weird side effects.

That way we can't hope to prevent the issue from occurring generally. I'm much more concerned with containing it, if it happens.

In the case that there is only a single pass, which tends to be the case for a lot of queries, my suggested approach would even help the offender.

But my main goal is something else. I can't explain my clients, why a chanced statistics due to autovacuum suddenly leads to oom. They would be right to question postgres qualification for any serious production system.

Regards

Arne

#4Justin Pryzby
pryzby@telsasoft.com
In reply to: Arne Roland (#3)
Re: Enforce work_mem per worker

On Mon, Nov 29, 2021 at 02:01:35PM +0000, Arne Roland wrote:

But my main goal is something else. I can't explain my clients, why a chanced statistics due to autovacuum suddenly leads to oom. They would be right to question postgres qualification for any serious production system.

What version postgres was that on ?

I realize it doesn't address your question, but since PG13, HashAggregate
respects work_mem. Depending on the details of the query plan, that's arguably
a bigger problem than the absence of a global work_mem. At least that one is
resolved.

--
Justin

#5Arne Roland
A.Roland@index.de
In reply to: Justin Pryzby (#4)
Re: Enforce work_mem per worker

From: Justin Pryzby <pryzby@telsasoft.com>
Sent: Monday, November 29, 2021 16:10

On Mon, Nov 29, 2021 at 02:01:35PM +0000, Arne Roland wrote:

But my main goal is something else. I can't explain my clients, why a chanced statistics due to autovacuum suddenly leads to oom. They would be right to question postgres qualification for any serious production system.

What version postgres was that on ?

It's pg13 and pg14 mostly. I have different servers with similar problems.

I realize it doesn't address your question, but since PG13, HashAggregate
respects work_mem.

I haven't run into issues with hash agg personally.

Depending on the details of the query plan, that's arguably
a bigger problem than the absence of a global work_mem. At least that one is
resolved.

I can go around to fix issues with plans. But plans are inherently unstable. And we can't have people becoming wary of autoanalyze.
Having a single wild plan bringing down a whole cluster is just madness.

There are bunch of different problems, that can occur. But where I stand this almost invalidates partition wise hash joins, because you'd generate one hash node per partition. But you can still have sorts with merge append, without partitionwise joins.
To quote your message from 2019:

gracefully support[...ing] "thousands" of partitions

means using 1000 * work_mem?
Am I wrong here?

Regards
Arne