Automatically setting work_mem
One of the key points influencing performance of certain operations is
the amount of memory that is available to them. Sorts, Materialize, Hash
Joins and Aggs and hashed subquery plans all want lots of memory.
Static allocation of memory is good in some situations, but not in
others. In many cases, sysadmins want to be able to tell the server how
much memory it can use and then have the server work out how to allocate
that according to the work already executing. (Whatever we do, the
static allocation of memory via work_mem should remain, so I suggest
only additional options rather than change of the existing mechanisms.)
My goal is a simple and effective way of increasing performance without
needing to sweat over particular settings for individual backends, all
of which need to be individually reconsidered when we upgrade memory.
Small additional amounts of memory can make huge differences to elapsed
times; we need a flexible way to use more when it makes sense to do so.
I envisage a new setting, shared_work_mem, that would allow a sysadmin
to define a secondary pool of memory from which backends could dip into
once they run out of their primary allocation (work_mem).
shared_work_mem=0 would provide the current behaviour.
shared_work_mem would *not* be allocated until time of use; it is a
abstract concept only. As explained below it would not be shared memory
at all, but privately allocated process memory.
We would only look at dynamically changing work_mem in those few
restricted cases where we track that against the work_mem limit. If we
hit that limit, we would make a request to the central pool: "Can I be
allotted another 2MB please?" (etc). The central allotment mechanism
would then say Yes or No. If allotted the memory, the backend would then
palloc up to that limit. The backend may return later for additional
allotments, but for now it has been allowed to dynamically increase its
memory usage. This allotment would be noted in the memory context
header, so that when the memory context is freed, the allotment can be
"returned" to the central pool by a deallotment call. This is now easier
than before since each sort within a query has its own memory context.
(I use the term "allot" to differentiate it from the term "allocate"
which describes the execution of malloc etc. First the server would
conceptually allot memory, then we would physically allocate it. Once
allocated, memory would not be deallocated any earlier than normal.
Memory would never be deallotted until the memory is deallocated.)
shared_work_mem would be a SUSET parameter, allowing it to be changed
up/down while server running. All of the details around this would be
built into a new API for re/palloc calls aimed at larger requests.
Some thorny points are:
1. what is the allotment policy?
2. what do we do when the memory pool has all been used up?
3. do we make the allocation at planning time? - allowing us to
potentially use a different plan because we know we will have the memory
to use when we get to the executor.
My current thoughts are:
(1) allow different allotment policies, possibly even using an external
function call, but basically giving everybody the flexibility they want.
For now, we can provide a simple mechanism for starters, then add more
later
e.g. shared_work_mem_policy = 'firstcomefirstserved(10)'
I don't want to introduce too many new parameters here...
(2) Various options here:
a) query queues for allocation
b) query throws ERROR OOM (unlikely to be a popular one)
c) query gets nothing (good alloc scheme can prevent harshness here...)
Simplicity says c), since a) is lots of work and b) not useful
(3) lets do this simply to start with - allocation only occurs in
executor, so is tied neatly into the executor memory contexts.
Another idea is to introduce transactional statement queuing, but that
may be a sledgehammer to crack a fairly simple nut.
There are some additional technical points which need not be discussed
yet, though which would need to be addressed also.
Best Regards, Simon Riggs
"Simon Riggs" <simon@2ndquadrant.com> wrote
We would only look at dynamically changing work_mem in those few
restricted cases where we track that against the work_mem limit. If we
hit that limit, we would make a request to the central pool: "Can I be
allotted another 2MB please?" (etc). The central allotment mechanism
would then say Yes or No. If allotted the memory, the backend would then
palloc up to that limit. The backend may return later for additional
allotments, but for now it has been allowed to dynamically increase its
memory usage. This allotment would be noted in the memory context
header, so that when the memory context is freed, the allotment can be
"returned" to the central pool by a deallotment call. This is now easier
than before since each sort within a query has its own memory context.
Interesting, I understand that shared_work_mem is process-wise,
allocate-when-use, request-may-or-may-not-get-it (as you have pointed out,
this may make planner in a hard situation if we are sensitive to work_mem).
But I still have something unclear. Let's say we have a sort operation need
1024 memory. So the DBA may have the following two options:
(1) SET work_mem = 1024; SET shared_work_mem = 0; do sort;
(2) SET work_mem = 512; SET shared_work_mem = 512; do sort;
So what's the difference between these two strategy?
(1) Running time: do they use the same amount of memory? Why option 2 is
better than 1?
(2) Idle time: after sort done, option 1 will return all 1024 to the OS and
2 will still keep 512?
Regards,
Qingqing
On Fri, 2006-03-17 at 13:29 +0800, Qingqing Zhou wrote:
"Simon Riggs" <simon@2ndquadrant.com> wrote
Interesting, I understand that shared_work_mem is process-wise,
allocate-when-use, request-may-or-may-not-get-it (as you have pointed out,
this may make planner in a hard situation if we are sensitive to work_mem).
But I still have something unclear. Let's say we have a sort operation need
1024 memory. So the DBA may have the following two options:(1) SET work_mem = 1024; SET shared_work_mem = 0; do sort;
(2) SET work_mem = 512; SET shared_work_mem = 512; do sort;So what's the difference between these two strategy?
(1) Running time: do they use the same amount of memory? Why option 2 is
better than 1?
(2) Idle time: after sort done, option 1 will return all 1024 to the OS and
2 will still keep 512?
The differences are
(1) no performance difference - all memory would be allocated and
deallocated at same time in either case
(2) shared_work_mem is SUSET rather than USERSET as work_mem is...
(3) The value is set for the whole server rather than by individual
tuning, so it would not make sense to use it as you have shown, even
though you could if you were the superuser
The goal is to do this:
do sort; /* no work_mem settings at all */
with shared_work_mem set once for the whole server in postgresql.conf
Best Regards, Simon Riggs
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
So what's the difference between these two strategy?
(1) Running time: do they use the same amount of memory? Why option 2 is
better than 1?
(2) Idle time: after sort done, option 1 will return all 1024 to the OS and
2 will still keep 512?
Point 2 is actually a serious flaw in Simon's proposal, because there
is no portable way to make malloc return freed memory to the OS. Some
mallocs will do that ... in some cases ... but many simply don't ever
move the brk address down. It's not an easy thing to do when the arena
gets cluttered with a lot of different alloc chunks and only some of
them get freed.
So the semantics we'd have to adopt is that once a backend claims some
"shared work mem", it keeps it until process exit. I don't think that
makes the idea worthless, because there's usually a clear distinction
between processes doing expensive stuff and processes doing cheap
stuff. But it's definitely a limitation. Also, if you've got a process
doing expensive stuff, it's certainly possible to expect the user to
just increase work_mem locally.
(BTW, given that work_mem is locally increasable, I'm not sure what's
the point of considering that shared_work_mem has to be SUSET. It's
not to prevent users from blowing out memory.)
My own thoughts about the problems with our work_mem arrangement are
that the real problem is the rule that we can allocate work_mem per sort
or hash operation; this makes the actual total memory use per backend
pretty unpredictable for nontrivial queries. I don't know how to fix
this though. The planner needs to know the work_mem that will be used
for any one of these operations in order to estimate costs, so simply
trying to divide up work_mem among the operations of a completed plan
tree is not going to improve matters.
regards, tom lane
My own thoughts about the problems with our work_mem arrangement are
that the real problem is the rule that we can allocate work_mem per sort
or hash operation; this makes the actual total memory use per backend
pretty unpredictable for nontrivial queries. I don't know how to fix
this though. The planner needs to know the work_mem that will be used
for any one of these operations in order to estimate costs, so simply
trying to divide up work_mem among the operations of a completed plan
tree is not going to improve matters.
I know this is not "right to the point" related to what is discussed in
this thread, and that it would need some serious work, but how about a
mechanism to allow plans some flexibility at run-time ? What I mean is
not to do all the decisions at plan time, but include some "branches" in
the plan, and execute one branch or the other depending on actual
parameter values, current statistics, current memory available, ...
(name here other run-time resources).
This would make a lot more feasible to long-term cache query plans. For
e.g. you wouldn't have to worry too much about changing statistics if at
runtime you can check them again... and you could put decision points
based on current memory resources. Of course it still must be a balance
between the number of the decision points (which ultimately means the
size of the plan) and robustness against changing conditions, i.e.
branches should only go in for conditions likely to change.
Is this completely not feasible with current postgres architecture ? I
have no idea how the planning/runtime works internally.
It worths a look at how apache Derby does with query planning, where a
planned query is actually a compiled Java class, i.e. the executable
byte code which will run to fetch the results, created and compiled by
the planner... interesting approach, allows for lots of flexibility at
run-time, but probably won't work with C :-)
Cheers,
Csaba.
Tom,
My own thoughts about the problems with our work_mem arrangement are
that the real problem is the rule that we can allocate work_mem per sort
or hash operation; this makes the actual total memory use per backend
pretty unpredictable for nontrivial queries. I don't know how to fix
this though. The planner needs to know the work_mem that will be used
for any one of these operations in order to estimate costs, so simply
trying to divide up work_mem among the operations of a completed plan
tree is not going to improve matters.
Yes ... the unpredictability is the problem:
(1) We can only allocate the # of connections and default work_mem per
operation.
(2) There are a variable # of concurrent queries per connection (0..1)
(3) Each query has a variable # of operations requiring work_mem, which
will require a variable amount of work_mem. If your malloc is good, this
is limited to concurrent operations, but for some OSes this is all
operations per query. Thus the former uses 0..3xwork_mem per query and
the latter 0..7x in general practice.
Overall, this means that based on a specific max_connections and work_mem,
a variable amount between 1*work_mem and (connections*3*work_mem) memory
may be needed at once. Since the penalty for overallocating RAM is severe
on most OSes, DBAs are forced to allow for the worst case. This results
in around 2/3 underallocation on systems with unpredictable loads. This
means that, even on heavily loaded DB systems, most of the time you're
wasting a big chunk of your RAM.
Simon and I met about this (and other stuff) at the GreenPlum offices last
summer. The first plan we came up with is query queueing. Query
queueing would eliminate variability (2), which would then make the only
variability one of how much work_mem would be needed per query, reducing
(but not eliminating) the underallocation. Additionally, we thought to
tie the query queues to ROLES, which would allow the administrator to
better control how much work_mem per type of query was allowed. It would
also allow admins to balance priorities better on mixed-load machines.
Mind you, I'm also thinking that on enterprise installations with
multi-department use of the database, the fact that work_mem is
inalienably USERSET is also an allocation problem. One user with a SET
command can blow all of your resource planning away.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Mind you, I'm also thinking that on enterprise installations with
multi-department use of the database, the fact that work_mem is
inalienably USERSET is also an allocation problem. One user with a SET
command can blow all of your resource planning away.
One user with ability to enter arbitrary SQL commands can *always* blow
your resource planning away. Blaming such things on work_mem is
seriously misguided.
regards, tom lane
Ühel kenal päeval, R, 2006-03-17 kell 09:46, kirjutas Tom Lane:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
So what's the difference between these two strategy?
(1) Running time: do they use the same amount of memory? Why option 2 is
better than 1?
(2) Idle time: after sort done, option 1 will return all 1024 to the OS and
2 will still keep 512?Point 2 is actually a serious flaw in Simon's proposal, because there
is no portable way to make malloc return freed memory to the OS.
So perhaps we could keep the shaded_work_mem in actual shared memory,
and alloc it to processes from there ?
We probably can't get it into a continuous chunk, but alt least we can
give it back for other backends to use when done.
My own thoughts about the problems with our work_mem arrangement are
that the real problem is the rule that we can allocate work_mem per sort
or hash operation; this makes the actual total memory use per backend
pretty unpredictable for nontrivial queries. I don't know how to fix
this though. The planner needs to know the work_mem that will be used
for any one of these operations in order to estimate costs, so simply
trying to divide up work_mem among the operations of a completed plan
tree is not going to improve matters.
Why not maybe make the work_mem allocation one of the variable
parameters thet is fed to planner, and try optimising for different sets
of sub-work_mems ?
---------------
Hannu
Hannu Krosing <hannu@skype.net> writes:
So perhaps we could keep the shaded_work_mem in actual shared memory,
and alloc it to processes from there ?
No, that's utterly not reasonable, both from an allocation point of view
(you'd have to make shared memory enormous, and not all platforms will
like that) and from a locking point of view.
regards, tom lane
On Fri, Mar 17, 2006 at 04:45:17PM -0500, Tom Lane wrote:
Hannu Krosing <hannu@skype.net> writes:
So perhaps we could keep the shaded_work_mem in actual shared memory,
and alloc it to processes from there ?No, that's utterly not reasonable, both from an allocation point of view
(you'd have to make shared memory enormous, and not all platforms will
like that) and from a locking point of view.
Perhaps we just need to tweak the memory allocation routines to use
mmap() for large allocations rather than malloc(). Then they can be
easily returned to the system unlike the system heap. glibc does this
automatically sometimes.
Though you have to be careful, continuous mmap()/munmap() is more
expensive than malloc()/free() because mmap()ed memory must be zerod
out, which costs cycles...
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
"Luke Lonergan" <llonergan@greenplum.com> writes:
We've looked at using the open source llvm compiler to create an
intermediate representation of the plan, then generate machine code and
dispatch for execution.
This would buy what exactly?
regards, tom lane
Import Notes
Reply to msg id not found: C040FF3F.1F72F%llonergan@greenplum.comReference msg id not found: C040FF3F.1F72F%llonergan@greenplum.com | Resolved by subject fallback
Csaba,
On 3/17/06 7:07 AM, "Csaba Nagy" <nagy@ecircle-ag.com> wrote:
It worths a look at how apache Derby does with query planning, where a
planned query is actually a compiled Java class, i.e. the executable
byte code which will run to fetch the results, created and compiled by
the planner... interesting approach, allows for lots of flexibility at
run-time, but probably won't work with C :-)
We've looked at using the open source llvm compiler to create an
intermediate representation of the plan, then generate machine code and
dispatch for execution.
This would have the advantage of being able to place runtime constants into
the intermediate representation as constants (like the address of a
comparator function or operator), then let the compiler optimize them out,
hoist, etc. You can't do this at compile time, and there would be no change
of the nice abstract code in the executor.
It's on our list - anyone else interested?
- Luke
Tom,
On 3/17/06 9:59 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
This would buy what exactly?
I guess you didn't read the other 80% of the post.
In short, faster performance through more aggressive runtime compilation. A
JIT for the database kernel. It's not like I'm on shaky ground here - other
commercial DBMS have done it for over a decade.
- Luke
Tom,
On 3/17/06 12:18 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
One user with ability to enter arbitrary SQL commands can *always* blow
your resource planning away. Blaming such things on work_mem is
seriously misguided.
Agreed - that's why we need to split this discussion into the two categories
of (1) scheduling for concurrency protection and (2) dynamic resource
allocation.
Topic (1) is best handled by statement queuing IMO and as demonstrated by
other commercial DBMS. This allows queues of different resource demands to
be used for ensuring that statements can not over consume memory, temp disk,
etc, and that queries with large requirements for some or all of those can
be allocated as much as possible, and those with smaller requirements will
be run (likely at much higher rates) while longer running queries take up
the larger resource pool.
(2) is what this thread is mostly talking about, and the dynamic allocation
of memory to plan nodes (sort, hash) needs to be done so that we are much
more efficient in memory footprint and give more where it's needed. (2)
will require some way of putting an overall memory footprint to a statement,
then sub-allocating within it. I suggest we assume that the overall memory
footprint is constrained somehow, perhaps another GUC that describes a per
statement maximum memory consumption, then at plan time we determine the
sub-allocations that best achieve the plan. This would fit within a scheme
for (1) when we develop one.
- Luke
Luke Lonergan wrote:
Tom,
On 3/17/06 9:59 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
This would buy what exactly?
I guess you didn't read the other 80% of the post.
In short, faster performance through more aggressive runtime compilation. A
JIT for the database kernel. It's not like I'm on shaky ground here - other
commercial DBMS have done it for over a decade.
In exactly a fortnight from today, I'll repost my suggestion to rewrite the backend in Java ;-)
Regards,
Thomas Hallgren
Thomas Hallgren wrote:
Luke Lonergan wrote:
Tom,
On 3/17/06 9:59 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
This would buy what exactly?
I guess you didn't read the other 80% of the post.
In short, faster performance through more aggressive runtime
compilation. A
JIT for the database kernel. It's not like I'm on shaky ground here -
other
commercial DBMS have done it for over a decade.In exactly a fortnight from today, I'll repost my suggestion to rewrite
the backend in Java ;-)
Please do! we haven't seen this suggestion for quite a while, I'm
starting to miss it...
Regards,
Andreas
On Fri, 2006-03-17 at 09:46 -0500, Tom Lane wrote:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
So what's the difference between these two strategy?
(1) Running time: do they use the same amount of memory? Why option 2 is
better than 1?
(2) Idle time: after sort done, option 1 will return all 1024 to the OS and
2 will still keep 512?Point 2 is actually a serious flaw in Simon's proposal, because there
is no portable way to make malloc return freed memory to the OS. Some
mallocs will do that ... in some cases ... but many simply don't ever
move the brk address down. It's not an easy thing to do when the arena
gets cluttered with a lot of different alloc chunks and only some of
them get freed.
I'm aware of that objection and agree its an issue...
One of the situations I am looking at is larger queries with multiple
sorts in them. I'm getting some reasonable results for final merge even
after releasing lots of memory (say 50-90%). memtuples array is not
required at all for randomAccess sorts (100% reduction).
The largest requirement for memory is the run building during
performsort. That portion of the code is not concurrently executed
within the same query. If we can reduce memory usage after that phase
completes then we stand a chance of not overusing memory on a big query
and not being able to reclaim it.
So overall memory usage could be as low as work_mem + (numsorts * 0.1 *
work_mem) which is a lot less than numsorts * work_mem. (e.g. 130% of
work_mem rather than 300% work_mem).
So the semantics we'd have to adopt is that once a backend claims some
"shared work mem", it keeps it until process exit. I don't think that
makes the idea worthless, because there's usually a clear distinction
between processes doing expensive stuff and processes doing cheap
stuff. But it's definitely a limitation.
...Hopefully less so with mem reduction changes.
The other way is of course to allocate *all* sort/hash/big stuff space
out of shared memory and then let all the backends fight it out
(somehow...) to see who gets access to it. That way backends stay small
and we have well bounded memory usage.
Also, if you've got a process
doing expensive stuff, it's certainly possible to expect the user to
just increase work_mem locally.
Doing that is fine, but you have to retune the system every time the
memory usage changes for any reason. So most of the time manual tuning
has to be very conservative to avoid getting it wrong.
My own thoughts about the problems with our work_mem arrangement are
that the real problem is the rule that we can allocate work_mem per sort
or hash operation; this makes the actual total memory use per backend
pretty unpredictable for nontrivial queries. I don't know how to fix
this though. The planner needs to know the work_mem that will be used
for any one of these operations in order to estimate costs, so simply
trying to divide up work_mem among the operations of a completed plan
tree is not going to improve matters.
Spent about 5 hours discussing that and the best answer was "use
queuing"....
= = = =
Anyway, thinking just about sort, I've got the following concrete
suggestions (first two of which coded and tested):
1. I originally picked MERGE_BUFFER_SIZE at 32 blocks as a guess. Better
test results show that is indeed the optimum when we take into account
both intermediate and final merging. However, the preferred buffer size
would be about 256 blocks in the case that Nruns << Ntapes i.e. when
work_mem is set high. In this case memory is reallocated to reduce the
overall usage after "performsort done" has happened.
2. When a tape runs out of tuples, its memory is reallocated to
remaining tapes to increase their I/O efficiency. This should help to
increase performance for smaller work_mem settings with large sorts, or
anything where the final merge Nruns is close to Ntapes. You can see
this occurring in the example below. The reallocation is either done
uniformly or all onto a single tape, depending upon the preread pattern.
This mostly doesn't occur with well sorted output, so there is little
overhead from doing this in the general case.
Right now, large sort performance is very good, whereas smaller sort
perfomance is still fairly bad.
3. We implement new merge alogorithm as Tom suggested...
Best Regards, Simon Riggs
On Sat, 2006-03-18 at 13:21 -0800, Luke Lonergan wrote:
In short, faster performance through more aggressive runtime compilation. A
JIT for the database kernel. It's not like I'm on shaky ground here - other
commercial DBMS have done it for over a decade.
I think what Luke may be referring to is the ability to compile WHERE
clauses to remove the bottleneck around Eval for complex searches.
Best Regards, Simon Riggs
On Tue, Mar 21, 2006 at 08:05:50PM +0000, Simon Riggs wrote:
Point 2 is actually a serious flaw in Simon's proposal, because there
is no portable way to make malloc return freed memory to the OS. Some
mallocs will do that ... in some cases ... but many simply don't ever
move the brk address down. It's not an easy thing to do when the arena
gets cluttered with a lot of different alloc chunks and only some of
them get freed.
<snip>
The largest requirement for memory is the run building during
performsort. That portion of the code is not concurrently executed
within the same query. If we can reduce memory usage after that phase
completes then we stand a chance of not overusing memory on a big query
and not being able to reclaim it.
There is one way to guarentee the memory is released to the OS after
completion. Make the allocator allocate work_mem bytes using mmap()
rather than malloc(). munmap() will then definitly return the memory to
the OS. Unfortunatly, the coding required would probably not be
straight-forward... Glibc will only convert malloc() to an mmap() on
allocations > 128KB and I don't think PostgreSQL ever does that.
Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes:
There is one way to guarentee the memory is released to the OS after
completion. Make the allocator allocate work_mem bytes using mmap()
rather than malloc(). munmap() will then definitly return the memory to
the OS. Unfortunatly, the coding required would probably not be
straight-forward...
Nor portable.
Glibc will only convert malloc() to an mmap() on
allocations > 128KB and I don't think PostgreSQL ever does that.
Actually, we do: it doesn't take very long for the sequence of block
allocations within a context to ramp up to 128K. (And I wouldn't be
opposed to tweaking the logic in aset.c to make it happen faster, once
an initial small allocation is filled up.) Also, individual chunk
requests exceeding 8K or thereabouts are fed directly to malloc, so
stuff like the SortTuple array might well be effectively mmap'd.
I'm fairly unconvinced about Simon's underlying premise --- that we
can't make good use of work_mem in sorting after the run building phase
--- anyway. If we cut back our memory usage then we'll be forcing a
significantly more-random access pattern to the temp file(s) during
merging, because we won't be able to pre-read as much at a time.
regards, tom lane