Vacuum rate limit in KBps
So far the reaction I've gotten from my recent submission to make
autovacuum log its read/write in MB/s has been rather positive. I've
been surprised at the unprecedented (to me at least) amount of
backporting onto big production systems it's gotten. There is a whole
lot of pent up frustration among larger installs over not having good
visibility into how changing cost-based vacuum parameters turns into
real-world units.
That got me thinking: if MB/s is what everyone wants to monitor, can we
provide a UI to set these parameters that way too? The attached patch
is a bit rough still, but it does that. The key was recognizing that
the cost delay plus cost limit can be converted into an upper limit on
cost units per second, presuming the writes themselves are free. If you
then also assume the worst case--that everything will end up dirty--by
throwing in the block size, too, you compute a maximum rate in MB/s.
That represents the fastest you can possibly write.
If you then turn that equation around, making the maximum write rate the
input, for any given cost delay and dirty page cost you can solve for
the cost limit--the parameter in fictitious units everyone hates. It
works like this, with the computation internals logged every time they
run for now:
#vacuum_cost_rate_limit = 4000 # maximum write rate in kilobytes/second
LOG: cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
That's the same cost limit that was there before, except now it's
derived from that maximum write rate figure. vacuum_cost_limit is gone
as a GUC, replaced with this new vacuum_cost_rate_limit. Internally,
vacuum_cost_rate_limit hasn't gone anywhere though. All of the entry
points into vacuum and autovacuum derive an internal-only
VacuumCostLimit as part of any setup or rebalance operation. But
there's no change to underlying cost management code; the cost limit is
budgeted and accounted for in exactly the same way as it always was.
Why is this set in kilobytes/second rather than using something based on
a memory unit? That decision was made after noting these values can
also be set in relation options. Making relation options aware of
memory unit math seemed ambitious relative to its usefulness, and it's
not like KB/s is hard to work with in this context.
OK, I lied; technically this is set in kibibytes per second right now.
Ran out of energy before I got to confirming that was consistent with
all similar GUC settings, will put on my pedantic hat later to check that.
One nice thing that falls out of this is that the *vacuum_cost_delay
settings essentially turn into a boolean. If the delay is 0, cost
limits are off; set it to any other value, and the rate you get is
driven almost entirely by vacuum_cost_rate_limit (disclaimer mainly
because of issues like sleep time accuracy are possible). You can see
that at work in these examples:
LOG: cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
LOG: cost limit=100 based on rate limit=4000 KB/s delay=10 dirty cost=20
LOG: cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
LOG: cost limit=100 based on rate limit=2000 KB/s delay=20 dirty cost=20
Halve the delay to 10, and the cost limit drops in half too to keep the
same I/O rate. Halve the rate limit instead, and the cost limit halves
with it. Most sites will never need to change the delay figure from
20ms, they can just focus on tuning the more human-readable rate limit
figure instead. The main reason I thought of to keep the delay around
as an integer still is sites trying to minimize power use, they might
increase it from the normally used 20ms. I'm not as worried about
postgresql.conf settings bloat to support a valid edge use case, so long
as most sites find a setting unnecessary to tune. And the autovacuum
side of cost delay should fall into that category with this change.
Here's a full autovacuum log example. This shows how close to the KBps
rate the server actually got, along with the autovacuum cost balancing
working the same old way (this is after running the boring
autovac-big.sql test case attached here too):
2012-01-15 02:10:51.905 EST: LOG: cost limit=200 based on rate
limit=4000 KB/s delay=20 dirty cost=20
2012-01-15 02:10:51.906 EST: DEBUG: autovac_balance_cost(pid=13054
db=16384, rel=16444, cost_rate_limit=4000, cost_limit=200,
cost_limit_base=200, cost_delay=20)
2012-01-15 02:11:05.127 EST: DEBUG: "t": removed 4999999 row versions
in 22124 pages
2012-01-15 02:11:05.127 EST: DEBUG: "t": found 4999999 removable,
5000001 nonremovable row versions in 44248 out of 44248 pages
2012-01-15 02:11:05.127 EST: DETAIL: 0 dead row versions cannot be
removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.27s/0.97u sec elapsed 131.73 sec.
2012-01-15 02:11:05.127 EST: LOG: automatic vacuum of table
"gsmith.public.t": index scans: 0
pages: 0 removed, 44248 remain
tuples: 4999999 removed, 5000001 remain
buffer usage: 48253 hits, 40296 misses, 43869 dirtied
avg read rate: 2.390 MiB/s, avg write rate: 2.602 MiB/s
system usage: CPU 0.27s/0.97u sec elapsed 131.73 sec
I think this new setting suggests the recently adding logging is missing
a combined I/O figure, something that measures reads + writes over the
time period. This is good enough to demonstrate the sort of UI I was
aiming for in action though. Administrator says "don't write more than
4MiB/s", and when autovacuum kicks in it averages 2.4 read + 2.6 write.
I see this change as taking something that feels like black arts tuning
magic now and turning it into a simple interface that's for the most
part intuitive. None of the flexibility is lost here: you can still
retune the relative dirty vs. miss vs. hit costs, you have the option of
reducing the delay to a small value on a busy server where small sleep
values are possible. But you don't have to do any of that just to tweak
autovacuum up or down at a gross level; you can just turn the simple "at
most this much write I/O" knob instead.
All implementation notes and concerns from here down.
The original cost limit here defaulted to 200 and allowed a range of 1
to 10000. The new default of 4000 show these values need to be 20X as
large. The maximum was adjusted to 200000 KBps. Look at that, the
maximum rate you can run cost delay vacuum at is 200MB/s; there's
another good example of something that used to be mysterious to compute
that is obvious now.
I didn't adjust the lower limit downward, so it's actually possible to
set the new code to only operate at 1/200 the minimum speed you could
set before. On the balance this is surely a reduction in foot gun
aiming though, and I could make the minimum 200 to eliminate it. Seems
a needless detail to worry about.
This code is new and just complicated enough that there's surely some
edge cases I broke here. In particular I haven't put together a good
concurrent autovacuum test yet to really prove all the balancing logic
still works correctly. Need to test that with a settings change in the
middle of a long vacuum too.
There's one serious concern I don't have a quick answer to. What do we
do with in-place upgrade of relations that specified a custom
vacuum_cost_limit? I can easily chew on getting the right logic to
convert those to equals in the new setting style, but I am not prepared
to go solely on the hook for all in-place upgrade work one might do
here. Would this be easiest to handle as one of those dump/restore
transformations? My guess is that's more sensible than the alternative
of making an on-read converter that only writes in the new format, then
worrying about upgrading all old pages before moving forward. While
this could be an interesting small test case for that sort of thing, I'd
rather not be patient #1 for that part of the long-term in-place upgrade
path right now.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 15.01.2012 10:24, Greg Smith wrote:
That got me thinking: if MB/s is what everyone wants to monitor, can we
provide a UI to set these parameters that way too? The attached patch is
a bit rough still, but it does that. The key was recognizing that the
cost delay plus cost limit can be converted into an upper limit on cost
units per second, presuming the writes themselves are free. If you then
also assume the worst case--that everything will end up dirty--by
throwing in the block size, too, you compute a maximum rate in MB/s.
That represents the fastest you can possibly write.
+1. I've been thinking we should do that for a long time, but haven't
gotten around to it.
I think it makes more sense to use the max read rate as the main knob,
rather than write rate. That's because the max read rate is higher than
the write rate, when you don't need to dirty pages. Or do you think
saturating the I/O system with writes is so much bigger a problem than
read I/O that it makes more sense to emphasize the writes?
I was thinking of something like this, in postgresql.conf:
# - Vacuum Throttling -
#vacuum_cost_page_miss = 1.0 # measured on an arbitrary scale
#vacuum_cost_page_dirty = 2.0 # same scale as above
#vacuum_cost_page_hit = 0.1 # same scale as above
#vacuum_rate_limit = 8MB # max reads per second
This is now similar to the cost settings for the planner, which is good.
There's one serious concern I don't have a quick answer to. What do we
do with in-place upgrade of relations that specified a custom
vacuum_cost_limit? I can easily chew on getting the right logic to
convert those to equals in the new setting style, but I am not prepared
to go solely on the hook for all in-place upgrade work one might do
here. Would this be easiest to handle as one of those dump/restore
transformations?
It needs to be handled at dump/restore time. I'm not sure where that
transformation belongs to, though. Do we have any precedence for this? I
think we have two options:
1. Accept the old "autovacuum_cost_limit" setting in CREATE TABLE, and
transform it immediately into corresponding autovacuum_rate_limit setting.
2. Transform in pg_dump, so that the CREATE TABLE statements in the dump
use the new autovacuum_rate_limit setting.
The advantage of 1. option is that dumps taken with old 9.1 pg_dump
still work on a 9.2 server. We usually try to preserve that
backwards-compatibility, although we always recommend using the pg_dump
from the newer version on upgrade. However, you need to know the
vacuum_cost_page_miss setting effective in the old server to do the
transformation correctly (or vacuum_cost_page_dirty, if we use the write
max rate as the main knob as you suggested), and we don't have access
when restoring a dump.
My guess is that's more sensible than the alternative
of making an on-read converter that only writes in the new format, then
worrying about upgrading all old pages before moving forward.
This requires any page format changes, so I don't think the above
sentence makes any sense.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 01/15/2012 04:17 PM, Heikki Linnakangas wrote:
I think it makes more sense to use the max read rate as the main knob,
rather than write rate. That's because the max read rate is higher
than the write rate, when you don't need to dirty pages. Or do you
think saturating the I/O system with writes is so much bigger a
problem than read I/O that it makes more sense to emphasize the writes?
I haven't had the I/O rate logging available for long enough to have a
good feel for which is more important to emphasize. I'm agnostic on
this. I'd have no problem accepting the argument that exposing the
larger of the two rates--which is the read one--makes for a cleaner UI.
Or that it is the one more like other knobs setting precedents here.
My guess is that the changed documentation will actually be a bit
cleaner that way. I give an example in the patch of how read and write
rate are related, based on the ratio of the values for dirty vs. hit. I
wasn't perfectly happy with how that was written yet, and I think it
could be cleaner if the read rate is the primary tunable.
We usually try to preserve that backwards-compatibility, although we
always recommend using the pg_dump from the newer version on upgrade.
However, you need to know the vacuum_cost_page_miss setting effective
in the old server to do the transformation correctly (or
vacuum_cost_page_dirty, if we use the write max rate as the main knob
as you suggested), and we don't have access when restoring a dump.
If someone does a storage parameter change to
autovacuum_vacuum_cost_limit but doesn't touch
autovacuum_vacuum_cost_delay there, I think it's possible to need the
GUC value for autovacuum_vacuum_cost_delay too, which can then refer to
vacuum_cost_delay as well.
I don't think that tweaking these parameters, particularly at the
storage options level, is a popular thing to do. I've run into
customers who made changes there while trying to navigate the complexity
of autovacuum tuning, but not very many. My guess as I think about that
history is that I've ended up reverting them as often, or maybe even
slightly more often, than I've ended up keeping them around. It's been
hard to do well. And the level of PostgreSQL deployment that reaches
that stage of tuning, where they managed to tweak those productively,
doesn't seem likely to do a blind upgrade to me.
One of the reasons I thought now was a good time to work on this change
is because there's already things brewing that are going to make 9.2
break a few more things than anyone would like, all for long-term
positive benefits. recovery.conf and pg_stat_activity changes are the
two of those I've been tracking the closest.
My current thinking on this is that we ought to learn a lesson from the
8.3 casting breakage saga and provide a clear "9.2 Upgrade Migration
Guide" that goes beyond just listing what changed in the release notes.
Aim more toward having a checklist of things to look for and tools to
help find them. In this case, having the migration guide include a
query that pokes through the catalogs looking for this particular
customization would be helpful. It would be easy to produce a bit of
post-9.2 upgrade SQL that converted a customization here if you started
by running something against the existing installation.
And that should be a wiki page so it can be extended as new things are
discovered, some of which will include application specific guidance.
The rough idea in this direction I put together for 8.3 (it didn't catch
on for later versions) is at
http://wiki.postgresql.org/wiki/Version_History Note how that grew to
include tsearch2 migration info and MediaWiki specific advice at one
point (some of that was then lost in a Planet fire, but you can get the
idea just from the article titles). Needing a version migration guide
with specific details about issues like this is something I think
PostgreSQL needs to accept as part of the release cycle. Expecting that
pg_upgrade can transparently handle every possible change would be
setting a high bar to clear, higher than I think is expected by the
database industry at large.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Jan 15, 2012, at 8:13 PM, Greg Smith wrote:
On 01/15/2012 04:17 PM, Heikki Linnakangas wrote:
I think it makes more sense to use the max read rate as the main knob, rather than write rate. That's because the max read rate is higher than the write rate, when you don't need to dirty pages. Or do you think saturating the I/O system with writes is so much bigger a problem than read I/O that it makes more sense to emphasize the writes?
I haven't had the I/O rate logging available for long enough to have a good feel for which is more important to emphasize. I'm agnostic on this. I'd have no problem accepting the argument that exposing the larger of the two rates--which is the read one--makes for a cleaner UI. Or that it is the one more like other knobs setting precedents here.
Could we expose both?
On our systems writes are extremely cheap... we don't do a ton of them (relatively speaking), so they tend to just fit into BBU cache. Reads on the other hard are a lot more expensive, at least if they end up actually hitting disk. So we actually set page_dirty and page_hit the same.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On 01/17/2012 09:00 PM, Jim Nasby wrote:
Could we expose both?
On our systems writes are extremely cheap... we don't do a ton of them (relatively speaking), so they tend to just fit into BBU cache. Reads on the other hard are a lot more expensive, at least if they end up actually hitting disk. So we actually set page_dirty and page_hit the same.
My thinking had been that you set as the rate tunable, and then the
rates of the others can be adjusted by advanced users using the ratio
between the primary and the other ones. So at the defaults:
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20
Setting a read rate cap will imply a write rate cap at 1/2 the value.
Your setup would then be:
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 1
Which would still work fine if the new tunable was a read cap. If the
cap is a write one, though, this won't make any sense. It would allow
reads to happen at 10X the speed of writes, which is weird.
I need to go back and consider each of the corner cases here, where
someone wants one of [hit,miss,dirty] to be an unusual value relative to
the rest. If I can't come up with a way to make that work as it does
now in the new code, that's a problem. I don't think it really is, it's
just that people in that situation will need to all three upwards. It's
still a simpler thing to work out than the current situation, and this
is an unusual edge case.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Jan 18, 2012, at 3:49 AM, Greg Smith wrote:
On 01/17/2012 09:00 PM, Jim Nasby wrote:
Could we expose both?
On our systems writes are extremely cheap... we don't do a ton of them (relatively speaking), so they tend to just fit into BBU cache. Reads on the other hard are a lot more expensive, at least if they end up actually hitting disk. So we actually set page_dirty and page_hit the same.
My thinking had been that you set as the rate tunable, and then the rates of the others can be adjusted by advanced users using the ratio between the primary and the other ones. So at the defaults:
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20Setting a read rate cap will imply a write rate cap at 1/2 the value. Your setup would then be:
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 1Which would still work fine if the new tunable was a read cap. If the cap is a write one, though, this won't make any sense. It would allow reads to happen at 10X the speed of writes, which is weird.
I need to go back and consider each of the corner cases here, where someone wants one of [hit,miss,dirty] to be an unusual value relative to the rest. If I can't come up with a way to make that work as it does now in the new code, that's a problem. I don't think it really is, it's just that people in that situation will need to all three upwards. It's still a simpler thing to work out than the current situation, and this is an unusual edge case.
What about doing away with all the arbitrary numbers completely, and just state data rate limits for hit/miss/dirty?
BTW, this is a case where it would be damn handy to know if the miss was really a miss or not... in the case where we're already rate limiting vacuum, could we afford the cost of get_time_of_day() to see if a miss actually did have to come from disk?
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On Sun, Jan 15, 2012 at 4:17 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
+1. I've been thinking we should do that for a long time, but haven't gotten
around to it.I think it makes more sense to use the max read rate as the main knob,
rather than write rate. That's because the max read rate is higher than the
write rate, when you don't need to dirty pages. Or do you think saturating
the I/O system with writes is so much bigger a problem than read I/O that it
makes more sense to emphasize the writes?I was thinking of something like this, in postgresql.conf:
# - Vacuum Throttling -
#vacuum_cost_page_miss = 1.0 # measured on an arbitrary scale
#vacuum_cost_page_dirty = 2.0 # same scale as above
#vacuum_cost_page_hit = 0.1 # same scale as above
#vacuum_rate_limit = 8MB # max reads per secondThis is now similar to the cost settings for the planner, which is good.
I have to say that I find that intensely counterintuitive. The
current settings are not entirely easy to tune correctly, but at least
they're easy to explain. What does that 8MB mean and how does it
relate to vacuum_cost_page_miss? If I double vacuum_rate_page_miss,
does that effectively also double the cost limit, so that dirty pages
and hits become relatively cheaper? If so, then I think what that
really means is that the limit is 8MB only if there are no hits and no
dirtied pages - otherwise it's less, and the amount by which it is
less is the result of some arcane calculation. Ugh!
I can really imagine people wanting to limit two things here: either
they want to limit the amount of read I/O, or they want to limit the
amount of write I/O. If your database fits in physical memory you
probably don't care about the cost of page misses very much at all,
but you probably do care about how much data you dirty. OTOH, if your
database doesn't fit in physical memory and you have a relatively
small percentage of dirty pages because the tables are lightly
updated, dirtying might be pretty secondary; if you care at all, it's
going to be because busying the disk head with large sequential reads
eats up too much of the system's I/O capacity. If we added
vacuum_read_rate_limit and vacuum_dirty_rate_limit, totally
independently of each other, and through the current system where
those two things get mixed together in one big bucket out the window
completely, I could maybe sign onto that as an improvement to the UI.
But even then, I think we need to balance the amount of the gain
against the backward compatibility problems we're going to create. If
we start removing autovacuum options, then, as Greg notes, we have to
figure out how to make old pg_dumps load into new databases and
hopefully do something close to what the DBA intended. And the DBA
will have to learn the new system. I'm not sure we're really going to
get enough mileage out changing this to justify the hassle. It's
basically a cosmetic improvement, and I think we should be careful
about breaking compatibility for cosmetic improvements, especially at
the end of a release cycle when we're under time pressure.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Jan 15, 2012 at 9:17 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I think it makes more sense to use the max read rate as the main knob,
rather than write rate. That's because the max read rate is higher than the
write rate, when you don't need to dirty pages. Or do you think saturating
the I/O system with writes is so much bigger a problem than read I/O that it
makes more sense to emphasize the writes?
Yes, the writes are more important of the two.
Too many writes at one time can overflow hardware caches, so things
tend to get much worse beyond a certain point.
Also, rate limiting writes means we rate limit WAL rate also which is
very important.
I'd like this to apply to large DDL, not just VACUUMs.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Simon Riggs's message of jue ene 19 16:05:36 -0300 2012:
On Sun, Jan 15, 2012 at 9:17 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I think it makes more sense to use the max read rate as the main knob,
rather than write rate. That's because the max read rate is higher than the
write rate, when you don't need to dirty pages. Or do you think saturating
the I/O system with writes is so much bigger a problem than read I/O that it
makes more sense to emphasize the writes?Yes, the writes are more important of the two.
Too many writes at one time can overflow hardware caches, so things
tend to get much worse beyond a certain point.Also, rate limiting writes means we rate limit WAL rate also which is
very important.I'd like this to apply to large DDL, not just VACUUMs.
More generally, this can sometimes be useful in general queries as well.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 1/18/12 4:18 PM, Jim Nasby wrote:
What about doing away with all the arbitrary numbers completely, and just state data rate limits for hit/miss/dirty?
Since many workloads will have a mix of all three, it still seems like
there's some need for weighing these individually, even if they each got
their own rates. If someone says read=8MB/s and write=4MB/s (the
current effective defaults), I doubt they would be happy with seeing
12MB/s happen.
BTW, this is a case where it would be damn handy to know if the miss was really a miss or not... in the case where we're already rate limiting vacuum, could we afford the cost of get_time_of_day() to see if a miss actually did have to come from disk?
We certainly might if it's a system where timing information is
reasonably cheap, and measuring that exact area will be easy if the
timing test contrib module submitted into this CF gets committed. I
could see using that to re-classify some misses as hits if the read
returns fast enough.
There's not an obvious way to draw that line though. The "fast=hit" vs.
"slow=miss" transition happens at very different place on SSD vs.
regular disks, as the simplest example. I don't see any way to wander
down this path that doesn't end up introducing multiple new GUCs, which
is the opposite of what I'd hoped to do--which was at worst to keep the
same number, but reduce how many were likely to be touched.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 1/19/12 1:10 PM, Robert Haas wrote:
I have to say that I find that intensely counterintuitive. The
current settings are not entirely easy to tune correctly, but at least
they're easy to explain.
I attempt to explain those settings to people in training classes about
once a month. It's never been anything but a complete disaster. I am
barely concerned about preserving the current UI because, as far as I've
been able to tell, there are only a handful of PostgreSQL installatinos
on the planet that have managed to use it happily. Even the ones that
do have a non-default setup that works usually flailed about for some
time until they get something that works, over a few frustrating months.
And the result are settings few dare touch for fear of breaking it.
It's also worth pointing out that VACUUM problems are very close to the
top of the list of problems larger sites run into. So right now we have
an inscrutable UI around an often essential part of the database to
tune, one that any production site that gets over a few hundred GB of
data in it will run into problems with. I wouldn't care about this area
if it weren't for people screaming about how bad it is every time the
topic comes up.
If there's anyone out there who has run a larger PostgreSQL database and
not at some point been extremely frustrated with how the current VACUUM
settings are controlled, please speak up and say I'm wrong about this.
I thought it was well understood the UI was near unusably bad, it just
wasn't obvious what to do about it.
What does that 8MB mean and how does it
relate to vacuum_cost_page_miss? If I double vacuum_rate_page_miss,
does that effectively also double the cost limit, so that dirty pages
and hits become relatively cheaper? If so, then I think what that
really means is that the limit is 8MB only if there are no hits and no
dirtied pages - otherwise it's less, and the amount by which it is
less is the result of some arcane calculation. Ugh!
Saying what I suggested is an arcane calculation strikes me as pretty
weird--we'd be hard pressed to design a more arcane calculation than the
one that's already happening.
The feedback here so far seems to lead toward making independent read
and write knobs. I'm going to chew on the scenarios Robert described
and the ones Jim has been commenting on and see if I can refactor this
into something friendlier that addresses them.
As for the suggestion that I'm bringing this up a bit late in the
release cycle, I've been trying. My first submission pushing in this
direction--improving the logging first, which is needed before you can
usefully measure a behavior change--happened back in September. I've
been moving this area as fast as I can get it to budge. I'm concerned
now that much will be made of improved performance in 9.2, leading to
people converting even larger systems than they used to. And it's not
hard at all to find a large system where inability to tune vacuum easily
is the top limiting factor on overall performance.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Thu, Jan 19, 2012 at 5:39 PM, Greg Smith <greg@2ndquadrant.com> wrote:
On 1/19/12 1:10 PM, Robert Haas wrote:
I have to say that I find that intensely counterintuitive. The
current settings are not entirely easy to tune correctly, but at least
they're easy to explain.I attempt to explain those settings to people in training classes about once
a month. It's never been anything but a complete disaster. I am barely
concerned about preserving the current UI because, as far as I've been able
to tell, there are only a handful of PostgreSQL installatinos on the planet
that have managed to use it happily. Even the ones that do have a
non-default setup that works usually flailed about for some time until they
get something that works, over a few frustrating months. And the result are
settings few dare touch for fear of breaking it.It's also worth pointing out that VACUUM problems are very close to the top
of the list of problems larger sites run into. So right now we have an
inscrutable UI around an often essential part of the database to tune, one
that any production site that gets over a few hundred GB of data in it will
run into problems with. I wouldn't care about this area if it weren't for
people screaming about how bad it is every time the topic comes up.If there's anyone out there who has run a larger PostgreSQL database and not
at some point been extremely frustrated with how the current VACUUM settings
are controlled, please speak up and say I'm wrong about this. I thought it
was well understood the UI was near unusably bad, it just wasn't obvious
what to do about it.What does that 8MB mean and how does it
relate to vacuum_cost_page_miss? If I double vacuum_rate_page_miss,
does that effectively also double the cost limit, so that dirty pages
and hits become relatively cheaper? If so, then I think what that
really means is that the limit is 8MB only if there are no hits and no
dirtied pages - otherwise it's less, and the amount by which it is
less is the result of some arcane calculation. Ugh!Saying what I suggested is an arcane calculation strikes me as pretty
weird--we'd be hard pressed to design a more arcane calculation than the one
that's already happening.
Perhaps so, but I'm willing to bet that if we have a variable that
looks like a pure read limit or a pure dirty limit and really is not,
we'll have succeeded. :-)
The feedback here so far seems to lead toward making independent read and
write knobs. I'm going to chew on the scenarios Robert described and the
ones Jim has been commenting on and see if I can refactor this into
something friendlier that addresses them.As for the suggestion that I'm bringing this up a bit late in the release
cycle, I've been trying. My first submission pushing in this
direction--improving the logging first, which is needed before you can
usefully measure a behavior change--happened back in September. I've been
moving this area as fast as I can get it to budge. I'm concerned now that
much will be made of improved performance in 9.2, leading to people
converting even larger systems than they used to. And it's not hard at all
to find a large system where inability to tune vacuum easily is the top
limiting factor on overall performance.
I certainly didn't intend to come across as disparaging your work on
this topic. I understand that there are big problems with the way
things work now; I'm just cautious about trying to replace them too
hastily with something that may not turn out to be any better. Of
course, if we can replace it with something that we're sure is
actually an improvement, I'm all in favor of that. But, IMHO, the
problems in this area are too serious to be solved by renaming the
knobs. At most, we're going to buy ourselves a little time to come up
with a better solution.
IMHO, and at the risk of repeating myself, one of the big problems in
this area is that we're making the user guess something that we really
ought to be figuring out for them. Just as users want checkpoints to
run as slowly as possible while still not bumping into the next
checkpoint, they'd presumably like vacuum to run as slowly as possible
without bumping into the next vacuum. Instead, we make them tell us
how fast they'd like it tor run, which requires them to guess a value
high enough to finish soon enough but low enough to minimize the
impact on the rest of the system.
Another problem is that the vacuum algorithm itself could, I think, be
made much smarter. We could teach HOT to prune pages that contain no
HOT chains but do contain dead tuples. That would leave dead line
pointers behind, but that's not nearly as bad as leaving the entire
tuple behind. We could, as Simon and others have suggested, have one
threshold for vacuuming the heap (i.e. reclaiming dead tuples) and
another for vacuuming the indexes (i.e. reclaiming dead line
pointers). That would open the door to partial vacuuming: just vacuum
half a gigabyte or so of the heap, and then move on; the next vacuum
can pick up where that one left off, at least up to the point where we
decide we need to make an index pass; it would possibly also allow us
to permit more than one vacuum on the same table at the same time,
which is probably needed for very large tables. We could have
backends that see dead tuples on a page throw them over to the fence
to the background writer for immediate pruning. I blather, but I
guess my point is that I really hope we're going to do something
deeper here at some point in the near future, whatever becomes of the
proposals now on the table.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I chewed a bit on Heikki's comment that similarity to the query planning
parameters might be useful, and Robert's that being able to explain how
the feature works more easily has value. I have an initial adjustment
of my general idea that I think moves usefully in both those directions.
The existing VACUUM cost constants look like this:
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20
These could be adjusted to instead be ratios like the query planner ones
(seq_page_cost, random_page_cost, etc.), referenced off a value of 1.0
for page miss ~= a read is expected:
vacuum_cost_page_hit = 0.1
vacuum_cost_page_miss = 1.0
vacuum_cost_page_dirty = 2.0
Now add in the new setting, which is explicitly said to be the read value:
vacuum_cost_read_limit = 8000 # maximum page miss read rate in
kilobytes/second
And I can shuffle the numbers around internally such that things still
work exactly the same, at the default parameters. And then anyone who
spends time learning how either the query planning or vacuum cost ratio
constants work will find the learning curve to pick up the other set easier.
An interesting fall-out of this refactoring is that old postgresql.conf
settings moved forward for *all* these values will still work fine. The
ratios are right and the internal computation won't care. The math is
just more complicated to explain when vacuum_cost_page_miss is anything
but 1.0, which is a problem the manual doesn't have to address. We
don't worry about making every query planner parameter discussion
consider what happens if someone moves seq_page_cost around, this will
put vacuum_cost_page_miss into the same reference constant category.
The only problem is for someone who changed one but not all of them in
their old configuration; that's going to give an unexpected result.
It might be a bit more straightforward yet if things were renamed so it
was more obvious that page miss~=read, but I haven't seen a good way to
do that yet. Renaming the reference cost value to vacuum_cost_page_read
has two problems. It makes the backward compatibility issues larger,
and it's not quite true. The way I think this should be explained, they
really aren't the same; that's why I used ~= above. A page miss is not
guaranteed to be a read, it just is expected to be one in the worst
case. The read rate that vacuum page misses introduce will not be
exactly the same as vacuum_cost_read_limit--but it will be below that
limit, which is all it claims to be.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Thu, Jan 19, 2012 at 11:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
I chewed a bit on Heikki's comment that similarity to the query planning
parameters might be useful, and Robert's that being able to explain how the
feature works more easily has value. I have an initial adjustment of my
general idea that I think moves usefully in both those directions.The existing VACUUM cost constants look like this:
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20These could be adjusted to instead be ratios like the query planner ones
(seq_page_cost, random_page_cost, etc.), referenced off a value of 1.0 for
page miss ~= a read is expected:vacuum_cost_page_hit = 0.1
vacuum_cost_page_miss = 1.0
vacuum_cost_page_dirty = 2.0Now add in the new setting, which is explicitly said to be the read value:
vacuum_cost_read_limit = 8000 # maximum page miss read rate in
kilobytes/secondAnd I can shuffle the numbers around internally such that things still work
exactly the same, at the default parameters. And then anyone who spends
time learning how either the query planning or vacuum cost ratio constants
work will find the learning curve to pick up the other set easier.
That may be a little better, but I still don't think it's worth
breaking backward compatibility for. I mean, suppose I don't care
about read rate, but I want to limit my dirty data rate to 1MB/s.
What parameters should I set?
It might be a bit more straightforward yet if things were renamed so it was
more obvious that page miss~=read, but I haven't seen a good way to do that
yet. Renaming the reference cost value to vacuum_cost_page_read has two
problems. It makes the backward compatibility issues larger, and it's not
quite true. The way I think this should be explained, they really aren't
the same; that's why I used ~= above. A page miss is not guaranteed to be a
read, it just is expected to be one in the worst case. The read rate that
vacuum page misses introduce will not be exactly the same as
vacuum_cost_read_limit--but it will be below that limit, which is all it
claims to be.
Maybe, but I still think having the read rate limit the dirty rate or
visca versa is *really* weird.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 01/20/2012 10:37 AM, Robert Haas wrote:
On Thu, Jan 19, 2012 at 11:29 PM, Greg Smith<greg@2ndquadrant.com> wrote:
vacuum_cost_page_hit = 0.1
vacuum_cost_page_miss = 1.0
vacuum_cost_page_dirty = 2.0Now add in the new setting, which is explicitly said to be the read value:
vacuum_cost_read_limit = 8000 # maximum page miss read rate in
kilobytes/secondThat may be a little better, but I still don't think it's worth
breaking backward compatibility for. I mean, suppose I don't care
about read rate, but I want to limit my dirty data rate to 1MB/s.
What parameters should I set?
vacuum_cost_page_dirty = 8.0
The resulting maximum rates will then be:
hit = 80MB/s
miss = 8MB/s
dirty = 1MB/s
The question you should ask yourself next is "how do I limit my dirty data rate to 1MB/s in 9.1?" Working that out by hand is a good exercise, to show just how much less complicated this proposal is over the current state of things. Show me how it's possible to do that in way we can expect new DBAs to follow, then the idea of keeping strong backwards compatibility here would have some weight. I see sticking too closely to the current scheme as being more bug-level compatibility; it's fundamentally broken, by being too difficult to use, to most people in its current form.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Fri, Jan 20, 2012 at 12:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
On 01/20/2012 10:37 AM, Robert Haas wrote:
On Thu, Jan 19, 2012 at 11:29 PM, Greg Smith<greg@2ndquadrant.com> wrote:
vacuum_cost_page_hit = 0.1
vacuum_cost_page_miss = 1.0
vacuum_cost_page_dirty = 2.0Now add in the new setting, which is explicitly said to be the read
value:vacuum_cost_read_limit = 8000 # maximum page miss read rate in
kilobytes/secondThat may be a little better, but I still don't think it's worth
breaking backward compatibility for. I mean, suppose I don't care
about read rate, but I want to limit my dirty data rate to 1MB/s.
What parameters should I set?vacuum_cost_page_dirty = 8.0
The resulting maximum rates will then be:
hit = 80MB/s
miss = 8MB/s
dirty = 1MB/s
That is, of course, not quite what I asked for. In fact it's likely
that my actual rate will be less than 1MB/s, because there will
probably be a miss for ever dirty. It will probably be about 8/9ths
of a MB/s.
The question you should ask yourself next is "how do I limit my dirty data
rate to 1MB/s in 9.1?" Working that out by hand is a good exercise, to show
just how much less complicated this proposal is over the current state of
things.
OK, sure. Our block size is 8kB, so we need every 128 blocks to
involve 1000 ms of delay. Obviously there are many combinations of
parameters that will make that work, but here's one: delay for 125
milliseconds after each 16 blocks:
vacuum_cost_page_hit = 0
vacuum_cost_page_miss = 0
vacuum_cost_page_dirty = 1
vacuum_cost_limit = 16
autovacuum_vacuum_cost_delay = 125ms
Maybe that strikes you as worse than what you're proposing; it strikes
me as better. Either way I think it's not going to be a good day for
people who are bad at math. :-(
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jan 19, 2012, at 4:23 PM, Greg Smith wrote:
On 1/18/12 4:18 PM, Jim Nasby wrote:
What about doing away with all the arbitrary numbers completely, and just state data rate limits for hit/miss/dirty?
Since many workloads will have a mix of all three, it still seems like there's some need for weighing these individually, even if they each got their own rates. If someone says read=8MB/s and write=4MB/s (the current effective defaults), I doubt they would be happy with seeing 12MB/s happen.
BTW, this is a case where it would be damn handy to know if the miss was really a miss or not... in the case where we're already rate limiting vacuum, could we afford the cost of get_time_of_day() to see if a miss actually did have to come from disk?
We certainly might if it's a system where timing information is reasonably cheap, and measuring that exact area will be easy if the timing test contrib module submitted into this CF gets committed. I could see using that to re-classify some misses as hits if the read returns fast enough.
There's not an obvious way to draw that line though. The "fast=hit" vs. "slow=miss" transition happens at very different place on SSD vs. regular disks, as the simplest example. I don't see any way to wander down this path that doesn't end up introducing multiple new GUCs, which is the opposite of what I'd hoped to do--which was at worst to keep the same number, but reduce how many were likely to be touched.
Your two comments together made me realize something... at the end of the day people don't care about MB/s. They care about impact to other read and write activity in the database.
What would be interesting is if we could monitor how long all *foreground* IO requests took. If they start exceeding some number, that means the system is at or near full capacity, and we'd like background stuff to slow down.
Dealing with SSDs vs real media would be a bit challenging... though, I think it would only be an issue if the two were randomly mixed together. Kept separately I would expect them to have distinct behavior patterns that could be measured and identified.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
Jim Nasby wrote:
Your two comments together made me realize something... at the end of the day people don't care about MB/s. They care about impact to other read and write activity in the database.
What would be interesting is if we could monitor how long all *foreground* IO requests took. If they start exceeding some number, that means the system is at or near full capacity, and we'd like background stuff to slow down.
My hope for 9.2 was to get VACUUM moved over into some human-readable
units. Having the whole thing work only via these abstract cost units
is driving most of my customers with larger databases crazy. The patch
I suggested was the easiest refactoring I thought moved in the right
direction. While it may not be the perfect thing to care about, the
very positive reaction I've gotten to the already landed patch to log in
MB/s has suggested to me people are a lot more comfortable with that
than the cost limit numbers.
For 9.3, this whole mess needs to become integrated with a full-system
monitoring approach, to really solve this well. pg_stat_bgwriter knows
how many writes are coming from the various parts of the system, the
total amount of write I/O. Given that, I can turn VACUUM completely
dynamic based on what else is happening in many common situations. The
sort of end goal I was thinking about was be able to say something like
"let VACUUM use up to 4MB/s on writes, but subtract off the average
write level of everything else". Now it's a background process running
only when there's capacity to spare for it. You could turn it up a lot
higher, if you knew it was only going to run at that level when the
system wasn't as busy. That's one reason I started by suggesting a
write-based limit; it fit into that longer-range plan better. Maybe
that idea is junk and focusing on actual read I/O is the real problem
with VACUUM for most people. I can tell you once I get more data out of
systems that are logging in MB/s.
If instead or in addition we get some better field data on systems that
can afford to time a lot more things, and then start building feedback
limiters based on how long all sorts of operations take to occur, that's
a whole different parallel approach for auto-tuning this. I haven't
thought about that as much simply because it only just became clear
recently when the timing data is cheap to collect. I need to get a lot
more production server data about that overhead to work with here too.
Dealing with SSDs vs real media would be a bit challenging... though, I think it would only be an issue if the two were randomly mixed together. Kept separately I would expect them to have distinct behavior patterns that could be measured and identified
This might just turn into another one of those things where we will
eventually need to have some more information on a per-tablespace
basis. I envision allowing the server to collect more timing data as
being something you can turn on for a bit, let it populate statistics
about just what fast or slow means for each tablespace. Then you can
keep those results around to guide future decisions even after timing is
turned off. Maybe toggle it back on a day a month to make sure the
numbers are still sane, if it's too expensive to time things every day.
On Sat, Jan 21, 2012 at 5:54 PM, Jim Nasby <jim@nasby.net> wrote:
What would be interesting is if we could monitor how long all *foreground* IO requests took. If they start exceeding some number, that means the system is at or near full capacity, and we'd like background stuff to slow down.
There's something to that...
On occasion, we've had a bit of "hue and cry" when sysadmins noticed
that nearly 100% of bandwidth was being chewed up by VACUUM.
A closer look showed that there wasn't a *true* problem, as there
wasn't anything else material looking for I/O. Letting VACUUM have
all the bandwidth it wants when the system isn't otherwise busy is a
pretty fine idea.
At such times, limiting VACUUM is counterproductive; it would be
better to let it go through and clean up as many tables in as short a
time as possible.
However, as soon as the number of processes waiting for I/O starts to
grow, you want VACUUM to back off.
That's an entirely more "dynamic" kind of dynamic than the suggested
parameterization indicates. There's enough there that I'm not sure
how yet to measure that it may be necessary to start with the more
static indicator of "maximum acceptable bandwidth usage."
As long as the parameters can be changed without requiring a
postmaster restart, it should be possible to do something more
adaptive if and when measurements emerge.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
On 19/01/12 17:39, Greg Smith wrote:
On 1/19/12 1:10 PM, Robert Haas wrote:
I have to say that I find that intensely counterintuitive. The
current settings are not entirely easy to tune correctly, but at least
they're easy to explain.If there's anyone out there who has run a larger PostgreSQL database
and not at some point been extremely frustrated with how the current
VACUUM settings are controlled, please speak up and say I'm wrong
about this. I thought it was well understood the UI was near unusably
bad, it just wasn't obvious what to do about it.
We are frustrated but mostly our frustration is not about the
somewhat inscrutable knobs but the inscrutable meters or lack
there of.
Postgres (auto or manual for that matter) vacuuming and analyzing
is essentially a performance tuning problem without a good way to
measure the current performance, the fact that the knobs to turn
are confusing as well is secondary.
What I think is missing is a clear way to know if you are vacuuming
(and analyzing) enough, and how much you are paying for that.
At the moment we are basically changing the knobs blindly based on
some back of the envelope calculations and hearsay. Than sometimes
month later we find out that eps we haven't been analyzing enough
and that's why on that particular table the planner is now picking
a "bad" query.
What I want is that page
http://www.postgresql.org/docs/8.4/static/routine-vacuuming.html
to start with "Here is how you know if you are vacuuming enough..."
In an ideal world one would like some meter in a statistics table
or similar that returns a percentage 100% means just enough 50%
means you have to double 150% means 50% too much (e.g. wasted)...
But I could do with a boolean as well. A complicated extension
and the recommendation to install 3 different extensions would
be better than what is there right now but only very barely. Of
course a meter wouldn't tell you that if traffic doubled you would
still keep up and for that you need a complicated calculation or
(you just keep looking at the meter and adjust).
But at the moment there is no such meter (at least I don't know
of it) and that is the actual problem.
My 2cents,
Bene