Cost limited statements RFC
I'm working on a new project here that I wanted to announce, just to
keep from duplicating effort in this area. I've started to add a cost
limit delay for regular statements. The idea is that you set a new
statement_cost_delay setting before running something, and it will
restrict total resources the same way autovacuum does. I'll be happy
with it when it's good enough to throttle I/O on SELECT and CREATE INDEX
CONCURRENTLY.
Modifying the buffer manager to account for statement-based cost
accumulation isn't difficult. The tricky part here is finding the right
spot to put the delay at. In the vacuum case, it's easy to insert a
call to check for a delay after every block of I/O. It should be
possible to find a single or small number of spots to put a delay check
in the executor. But I expect that every utility command may need to be
modified individually to find a useful delay point. This is starting to
remind me of the SEPostgres refactoring, because all of the per-command
uniqueness ends up requiring a lot of work to modify in a unified way.
The main unintended consequences issue I've found so far is when a cost
delayed statement holds a heavy lock. Autovacuum has some protection
against letting processes with an exclusive lock on a table go to sleep.
It won't be easy to do that with arbitrary statements. There's a
certain amount of allowing the user to shoot themselves in the foot here
that will be time consuming (if not impossible) to eliminate. The
person who runs an exclusive CLUSTER that's limited by
statement_cost_delay may suffer from holding the lock too long. But
that might be their intention with setting the value. Hard to idiot
proof this without eliminating useful options too.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 23, 2013 at 8:27 PM, Greg Smith <greg@2ndquadrant.com> wrote:
The main unintended consequences issue I've found so far is when a cost
delayed statement holds a heavy lock. Autovacuum has some protection
against letting processes with an exclusive lock on a table go to sleep. It
won't be easy to do that with arbitrary statements. There's a certain
amount of allowing the user to shoot themselves in the foot here that will
be time consuming (if not impossible) to eliminate. The person who runs an
exclusive CLUSTER that's limited by statement_cost_delay may suffer from
holding the lock too long. But that might be their intention with setting
the value. Hard to idiot proof this without eliminating useful options too.
Why not make the delay conditional on the amount of concurrency, kinda
like the commit_delay? Although in this case, it should only count
unwaiting connections.
That way, if there's a "delay deadlock", the delay gets out of the way.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5/23/13 7:34 PM, Claudio Freire wrote:
Why not make the delay conditional on the amount of concurrency, kinda
like the commit_delay? Although in this case, it should only count
unwaiting connections.
The test run by commit_delay is way too heavy to run after every block
is processed. That code is only hit when there's a commit, which
already assumes a lot of overhead is going on--the disk flush to WAL--so
burning some processing/lock acquisition time isn't a big deal. The
spot where statement delay is going is so performance sensitive that
everything it touches needs to be local to the backend.
For finding cost delayed statements that are causing trouble because
they are holding locks, the only place I've thought of that runs
infrequently and is poking at the right data is the deadlock detector.
Turning that into a more general mechanism for finding priority
inversion issues is an interesting idea. It's a bit down the road from
what I'm staring at now though.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 23, 2013 at 8:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:
On 5/23/13 7:34 PM, Claudio Freire wrote:
Why not make the delay conditional on the amount of concurrency, kinda
like the commit_delay? Although in this case, it should only count
unwaiting connections.The test run by commit_delay is way too heavy to run after every block is
processed. That code is only hit when there's a commit, which already
assumes a lot of overhead is going on--the disk flush to WAL--so burning
some processing/lock acquisition time isn't a big deal. The spot where
statement delay is going is so performance sensitive that everything it
touches needs to be local to the backend.
Besides of the obvious option of making a lighter check (doesn't have
to be 100% precise), wouldn't this check be done when it would
otherwise sleep? Is it so heavy still in that context?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5/23/13 7:56 PM, Claudio Freire wrote:
Besides of the obvious option of making a lighter check (doesn't have
to be 100% precise), wouldn't this check be done when it would
otherwise sleep? Is it so heavy still in that context?
A commit to typical 7200RPM disk is about 10ms, while
autovacuum_vacuum_cost_delay is 20ms. If the statement cost limit logic
were no more complicated than commit_delay, it would be feasible to do
something similar each time a statement was being put to sleep.
I suspect that the cheapest useful thing will be more expensive than
commit_delay's test. That's a guess though. I'll have to think about
this more when I circle back toward usability. Thanks for the
implementation idea.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, May 23, 2013 at 7:27 PM, Greg Smith <greg@2ndquadrant.com> wrote:
I'm working on a new project here that I wanted to announce, just to keep
from duplicating effort in this area. I've started to add a cost limit
delay for regular statements. The idea is that you set a new
statement_cost_delay setting before running something, and it will restrict
total resources the same way autovacuum does. I'll be happy with it when
it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY.
Cool. We have an outstanding customer request for this type of
functionality; although in that case, I think the desire is more along
the lines of being able to throttle writes rather than reads.
But I wonder if we wouldn't be better off coming up with a little more
user-friendly API. Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = <amount of data you can dirty per
second, in kB> and read_rate_limit = <amount of data you can read into
shared buffers per second, in kB>. This is less powerful than what we
currently offer for autovacuum, which allows you to come up with a
"blended" measure of when vacuum has done too much work, but I don't
have a lot of confidence that it's better in practice.
Modifying the buffer manager to account for statement-based cost
accumulation isn't difficult. The tricky part here is finding the right
spot to put the delay at. In the vacuum case, it's easy to insert a call to
check for a delay after every block of I/O. It should be possible to find a
single or small number of spots to put a delay check in the executor. But I
expect that every utility command may need to be modified individually to
find a useful delay point. This is starting to remind me of the SEPostgres
refactoring, because all of the per-command uniqueness ends up requiring a
lot of work to modify in a unified way.
I haven't looked at this in detail, but I would hope it's not that
bad. For one thing, many DDL commands don't do any significant I/O in
the first place and so can probably be disregarded. Those that do are
mostly things that rewrite the table and things that build indexes. I
doubt there are more than 3 or 4 code paths to patch.
The main unintended consequences issue I've found so far is when a cost
delayed statement holds a heavy lock. Autovacuum has some protection
against letting processes with an exclusive lock on a table go to sleep. It
won't be easy to do that with arbitrary statements. There's a certain
amount of allowing the user to shoot themselves in the foot here that will
be time consuming (if not impossible) to eliminate. The person who runs an
exclusive CLUSTER that's limited by statement_cost_delay may suffer from
holding the lock too long. But that might be their intention with setting
the value. Hard to idiot proof this without eliminating useful options too.
Well, we *could* have a system where, if someone blocks waiting for a
lock held by a rate-limited process, the rate limits are raised or
abolished. But I'm pretty sure that's a bad idea. I think that the
people who want rate limits want them because allowing too much write
(or maybe read?) activity hoses the performance of the entire system,
and that's not going to be any less true if there are multiple jobs
piling up. Let's say someone has a giant COPY into a huge table, and
CLUSTER blocks behind it, waiting for AccessExclusiveLock. Well...
making the COPY run faster so that we can hurry up and start
CLUSTER-ing seems pretty clearly wrong. We want the COPY to run
slower, and we want the CLUSTER to run slower, too. If we don't want
that, then, as you say, we shouldn't set the GUC in the first place.
Long story short, I'm inclined to define this as expected behavior.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5/24/13 8:21 AM, Robert Haas wrote:
On Thu, May 23, 2013 at 7:27 PM, Greg Smith<greg@2ndquadrant.com> wrote:
I'm working on a new project here that I wanted to announce, just to keep
from duplicating effort in this area. I've started to add a cost limit
delay for regular statements. The idea is that you set a new
statement_cost_delay setting before running something, and it will restrict
total resources the same way autovacuum does. I'll be happy with it when
it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY.Cool. We have an outstanding customer request for this type of
functionality; although in that case, I think the desire is more along
the lines of being able to throttle writes rather than reads.But I wonder if we wouldn't be better off coming up with a little more
user-friendly API. Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = <amount of data you can dirty per
second, in kB> and read_rate_limit = <amount of data you can read into
shared buffers per second, in kB>. This is less powerful than what we
currently offer for autovacuum, which allows you to come up with a
"blended" measure of when vacuum has done too much work, but I don't
have a lot of confidence that it's better in practice.
Doesn't that hit the old issue of not knowing if a read came from FS cache or disk? I realize that the current cost_delay mechanism suffers from that too, but since the API is lower level that restriction is much more apparent.
Instead of KB/s, could we look at how much time one process is spending waiting on IO vs the rest of the cluster? Is it reasonable for us to measure IO wait time for every request, at least on the most popular OSes?
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5/24/13 10:36 AM, Jim Nasby wrote:
Instead of KB/s, could we look at how much time one process is spending
waiting on IO vs the rest of the cluster? Is it reasonable for us to
measure IO wait time for every request, at least on the most popular OSes?
It's not just an OS specific issue. The overhead of collecting timing
data varies massively based on your hardware, which is why there's the
pg_test_timing tool now to help quantify that.
I have a design I'm working on that exposes the system load to the
database usefully. That's what I think people really want if the goal
is to be adaptive based on what else is going on. My idea is to use
what "uptime" collects as a starting useful set of numbers to quantify
what's going on. If you have both a short term load measurement and a
longer term one like uptime provides, you can quantify both the overall
load and whether it's rising or falling. I want to swipe some ideas on
how moving averages are used to determine trend in stock trading
systems:
http://www.onlinetradingconcepts.com/TechnicalAnalysis/MASimple2.html
Dynamic load-sensitive statement limits and autovacuum are completely
feasible on UNIX-like systems. The work to insert a cost delay point
needs to get done before building more complicated logic on top of it
though, so I'm starting with this part.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5/24/13 9:21 AM, Robert Haas wrote:
But I wonder if we wouldn't be better off coming up with a little more
user-friendly API. Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = <amount of data you can dirty per
second, in kB> and read_rate_limit = <amount of data you can read into
shared buffers per second, in kB>.
I already made and lost the argument for doing vacuum in KB/s units, so
I wasn't planning on putting that in the way of this one. I still think
it's possible to switch to real world units and simplify all of those
parameters. Maybe I'll get the energy to fight this battle again for
9.4. I do have a lot more tuning data from production deployments to
use as evidence now.
I don't think the UI end changes the bulk of the implementation work
though. The time consuming part of this development is inserting all of
the cost delay hooks and validating they work. Exactly what parameters
and logic fires when they are called can easily be refactored later.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, May 24, 2013 at 10:36 AM, Jim Nasby <jim@nasby.net> wrote:
Doesn't that hit the old issue of not knowing if a read came from FS cache
or disk? I realize that the current cost_delay mechanism suffers from that
too, but since the API is lower level that restriction is much more
apparent.
Sure, but I think it's still useful despite that limitation.
Instead of KB/s, could we look at how much time one process is spending
waiting on IO vs the rest of the cluster? Is it reasonable for us to measure
IO wait time for every request, at least on the most popular OSes?
I doubt that's going to be very meaningful. The backend that dirties
the buffer is fairly likely to be different from the backend that
writes it out.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, May 24, 2013 at 11:51 AM, Greg Smith <greg@2ndquadrant.com> wrote:
On 5/24/13 9:21 AM, Robert Haas wrote:
But I wonder if we wouldn't be better off coming up with a little more
user-friendly API. Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = <amount of data you can dirty per
second, in kB> and read_rate_limit = <amount of data you can read into
shared buffers per second, in kB>.I already made and lost the argument for doing vacuum in KB/s units, so I
wasn't planning on putting that in the way of this one.
I think the problem is that making that change would force people to
relearn something that was already long established, and it was far from
clear that the improvement, though real, was big enough to justify forcing
people to do that. That objection would not apply to a new feature, as
there would be nothing to re-learn. The other objection was that (at that
time) we had some hope that the entire workings would be redone for 9.3,
and it seemed unfriendly to re-name things in 9.2 without much change in
functionality, and then redo them completely in 9.3.
Cheers,
Jeff
On Thu, Jun 6, 2013 at 3:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, May 24, 2013 at 11:51 AM, Greg Smith <greg@2ndquadrant.com> wrote:
On 5/24/13 9:21 AM, Robert Haas wrote:
But I wonder if we wouldn't be better off coming up with a little more
user-friendly API. Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = <amount of data you can dirty per
second, in kB> and read_rate_limit = <amount of data you can read into
shared buffers per second, in kB>.I already made and lost the argument for doing vacuum in KB/s units, so I
wasn't planning on putting that in the way of this one.I think the problem is that making that change would force people to relearn
something that was already long established, and it was far from clear that
the improvement, though real, was big enough to justify forcing people to do
that. That objection would not apply to a new feature, as there would be
nothing to re-learn. The other objection was that (at that time) we had
some hope that the entire workings would be redone for 9.3, and it seemed
unfriendly to re-name things in 9.2 without much change in functionality,
and then redo them completely in 9.3.
Right. Also, IIRC, the limits didn't really mean what they purported
to mean. You set either a read or a dirty rate in KB/s, but what was
really limited was the combination of the two, and the relative
importance of the two factors was based on other settings in a
severely non-obvious way.
If we can see our way clear to ripping out the autovacuum costing
stuff and replacing them with a read rate limit and a dirty rate
limit, I'd be in favor of that. The current system limits the linear
combination of those with user-specified coefficients, which is more
powerful but less intuitive. If we need that, we'll have to keep it
the way it is, but I'm hoping we don't.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-06-06 12:34:01 -0700, Jeff Janes wrote:
On Fri, May 24, 2013 at 11:51 AM, Greg Smith <greg@2ndquadrant.com> wrote:
On 5/24/13 9:21 AM, Robert Haas wrote:
But I wonder if we wouldn't be better off coming up with a little more
user-friendly API. Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = <amount of data you can dirty per
second, in kB> and read_rate_limit = <amount of data you can read into
shared buffers per second, in kB>.I already made and lost the argument for doing vacuum in KB/s units, so I
wasn't planning on putting that in the way of this one.I think the problem is that making that change would force people to
relearn something that was already long established, and it was far from
clear that the improvement, though real, was big enough to justify forcing
people to do that.
I don't find that argument very convincing. Since you basically can
translate the current variables into something like the above variables
with some squinting we sure could have come up with some way to keep the
old definition and automatically set the new GUCs and the other way
round. guc.c should even have enough information to prohibit setting
both in the config file...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 6/6/13 4:02 PM, Robert Haas wrote:
If we can see our way clear to ripping out the autovacuum costing
stuff and replacing them with a read rate limit and a dirty rate
limit, I'd be in favor of that. The current system limits the linear
combination of those with user-specified coefficients, which is more
powerful but less intuitive.
There is also an implied memory bandwidth limit via the costing for a
hit, which was the constraint keeping me from just going this way last
time this came up. It essentially limits vacuum to 78MB/s of scanning
memory even when there's no disk I/O involved. I wasn't sure if that
was still important, you can also control it now via these coefficients,
and most of the useful disk rate refactorings simplify a lot if that's
gone. The rest of this message is some evidence that's not worth
keeping though, which leads into a much cleaner plan than I tried to
pitch before.
I can now tell you that a busy server with decent memory can easily chug
through 7.8GB/s of activity against shared_buffers, making the existing
78MB/s limit is a pretty tight one. 7.8GB/s of memory access is 1M
buffers/second as measured by pg_stat_database.blks_read. I've attached
a sample showing the highest rate I've seen as evidence of how fast
servers can really go now, from a mainstream 24 Intel cores in 2 sockets
system. Nice hardware, but by no means exotic stuff. And I can hit
500M buffers/s = 4GB/s of memory even with my laptop.
I have also subjected some busy sites to a field test here since the
original discussion, to try and nail down if this is really necessary.
So far I haven't gotten any objections, and I've seen one serious
improvement, after setting vacuum_cost_page_hit to 0. The much improved
server is the one I'm showing here. When a page hit doesn't cost
anything, the new limiter on how fast vacuum can churn through a well
cached relation usually becomes the CPU speed of a single core.
Nowadays, you can peg any single core like that and still not disrupt
the whole server.
If the page hit limit goes away, the user with a single core server who
used to having autovacuum only pillage shared_buffers at 78MB/s might
complain that if it became unbounded. I'm not scared of that impacting
any sort of mainstream hardware from the last few years though. I think
you'd have to be targeting PostgreSQL on embedded or weak mobile chips
to even notice the vacuum page hit rate here in 2013. And even if your
database is all in shared_buffers so it's possible to chug through it
non-stop, you're way more likely to suffer from an excess dirty page
write rate than this.
Buying that it's OK to scrap the hit limit leads toward a simple to code
implementation of read/write rate limits implemented like this:
-vacuum_cost_page_* are removed as external GUCs. Maybe the internal
accounting for them stays the same for now, just to keep the number of
changes happening at once easier.
-vacuum_cost_delay becomes an internal parameter fixed at 20ms. That's
worked out OK in the field, there's not a lot of value to a higher
setting, and lower settings are impractical due to the effective 10ms
lower limit on sleeping some systems have.
-vacuum_cost_limit goes away as an external GUC, and instead the actual
cost limit becomes an internal value computed from the other parameters.
At the default values the value that pops out will still be close to
200. Not messing with that will keep all of the autovacuum worker cost
splitting logic functional.
-New vacuum_read_limit and vacuum_write_limit are added as a kB value
for the per second maximum rate. -1 means unlimited. The pair replaces
changing the cost delay as the parameters that turns cost limiting on or
off.
That's 5 GUCs with complicated setting logic removed, replaced by 2
simple knobs, plus some churn in the autovacuum_* versions. Backwards
compatibility for tuned systems will be shot. My position is that
anyone smart enough to have navigated the existing mess of these
settings and done something useful with them will happily take having
their custom tuning go away, if it's in return for the simplification.
At this point I feel exactly the same way I did about the parameters
removed by the BGW auto-tuning stuff that went away in 8.3, with zero
missing the old knobs that I heard. Another year of experiments and
feedback has convinced me nobody is setting this usefully in the field
who wouldn't prefer the new interface.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachments:
On Thu, Jun 6, 2013 at 7:36 PM, Greg Smith <greg@2ndquadrant.com> wrote:
I have also subjected some busy sites to a field test here since the
original discussion, to try and nail down if this is really necessary. So
far I haven't gotten any objections, and I've seen one serious improvement,
after setting vacuum_cost_page_hit to 0. The much improved server is the
one I'm showing here. When a page hit doesn't cost anything, the new
limiter on how fast vacuum can churn through a well cached relation usually
becomes the CPU speed of a single core. Nowadays, you can peg any single
core like that and still not disrupt the whole server.
Check. I have no trouble believing that limit is hurting us more than
it's helping us.
If the page hit limit goes away, the user with a single core server who used
to having autovacuum only pillage shared_buffers at 78MB/s might complain
that if it became unbounded.
Except that it shouldn't become unbounded, because of the ring-buffer
stuff. Vacuum can pillage the OS cache, but the degree to which a
scan of a single relation can pillage shared_buffers should be sharply
limited.
Buying that it's OK to scrap the hit limit leads toward a simple to code
implementation of read/write rate limits implemented like this:-vacuum_cost_page_* are removed as external GUCs. Maybe the internal
accounting for them stays the same for now, just to keep the number of
changes happening at once easier.-vacuum_cost_delay becomes an internal parameter fixed at 20ms. That's
worked out OK in the field, there's not a lot of value to a higher setting,
and lower settings are impractical due to the effective 10ms lower limit on
sleeping some systems have.-vacuum_cost_limit goes away as an external GUC, and instead the actual cost
limit becomes an internal value computed from the other parameters. At the
default values the value that pops out will still be close to 200. Not
messing with that will keep all of the autovacuum worker cost splitting
logic functional.
I think you're missing my point here, which is is that we shouldn't
have any such things as a "cost limit". We should limit reads and
writes *completely separately*. IMHO, there should be a limit on
reading, and a limit on dirtying data, and those two limits should not
be tied to any common underlying "cost limit". If they are, they will
not actually enforce precisely the set limit, but some other composite
limit which will just be weird.
IOW, we'll need new logic to sleep when we exceed either the limit on
read-rate OR when we exceed the limit on dirty-rate. The existing
smushed-together "cost limit" should just go away entirely.
If you want, I can mock up what I have in mind. I am pretty sure it
won't be very hard.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 6/7/13 10:14 AM, Robert Haas wrote:
If the page hit limit goes away, the user with a single core server who used
to having autovacuum only pillage shared_buffers at 78MB/s might complain
that if it became unbounded.Except that it shouldn't become unbounded, because of the ring-buffer
stuff. Vacuum can pillage the OS cache, but the degree to which a
scan of a single relation can pillage shared_buffers should be sharply
limited.
I wasn't talking about disruption of the data that's in the buffer
cache. The only time the scenario I was describing plays out is when
the data is already in shared_buffers. The concern is damage done to
the CPU's data cache by this activity. Right now you can't even reach
100MB/s of damage to your CPU caches in an autovacuum process. Ripping
out the page hit cost will eliminate that cap. Autovacuum could
introduce gigabytes per second of memory -> L1 cache transfers. That's
what all my details about memory bandwidth were trying to put into
context. I don't think it really matter much because the new bottleneck
will be the processing speed of a single core, and that's still a decent
cap to most people now.
I think you're missing my point here, which is is that we shouldn't
have any such things as a "cost limit". We should limit reads and
writes *completely separately*. IMHO, there should be a limit on
reading, and a limit on dirtying data, and those two limits should not
be tied to any common underlying "cost limit". If they are, they will
not actually enforce precisely the set limit, but some other composite
limit which will just be weird.
I see the distinction you're making now, don't need a mock up to follow
you. The main challenge of moving this way is that read and write rates
never end up being completely disconnected from one another. A read
will only cost some fraction of what a write does, but they shouldn't be
completely independent.
Just because I'm comfortable doing 10MB/s of reads and 5MB/s of writes,
I may not be happy with the server doing 9MB/s read + 5MB/s write=14MB/s
of I/O in an implementation where they float independently. It's
certainly possible to disconnect the two like that, and people will be
able to work something out anyway. I personally would prefer not to
lose some ability to specify how expensive read and write operations
should be considered in relation to one another.
Related aside: shared_buffers is becoming a decreasing fraction of
total RAM each release, because it's stuck with this rough 8GB limit
right now. As the OS cache becomes a larger multiple of the
shared_buffers size, the expense of the average read is dropping. Reads
are getting more likely to be in the OS cache but not shared_buffers,
which makes the average cost of any one read shrink. But writes are as
expensive as ever.
Real-world tunings I'm doing now reflecting that, typically in servers
with >128GB of RAM, have gone this far in that direction:
vacuum_cost_page_hit = 0
vacuum_cost_page_hit = 2
vacuum_cost_page_hit = 20
That's 4MB/s of writes, 40MB/s of reads, or some blended mix that
considers writes 10X as expensive as reads. The blend is a feature.
The logic here is starting to remind me of how the random_page_cost
default has been justified. Read-world random reads are actually close
to 50X as expensive as sequential ones. But the average read from the
executor's perspective is effectively discounted by OS cache hits, so
4.0 is still working OK. In large memory servers, random reads keep
getting cheaper via better OS cache hit odds, and it's increasingly
becoming something important to tune for.
Some of this mess would go away if we could crack the shared_buffers
scaling issues for 9.4. There's finally enough dedicated hardware
around to see the issue and work on it, but I haven't gotten a clear
picture of any reproducible test workload that gets slower with large
buffer cache sizes. If anyone has a public test case that gets slower
when shared_buffers goes from 8GB to 16GB, please let me know; I've got
two systems setup I could chase that down on now.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jun 7, 2013 at 11:35 AM, Greg Smith <greg@2ndquadrant.com> wrote:
I wasn't talking about disruption of the data that's in the buffer cache.
The only time the scenario I was describing plays out is when the data is
already in shared_buffers. The concern is damage done to the CPU's data
cache by this activity. Right now you can't even reach 100MB/s of damage to
your CPU caches in an autovacuum process. Ripping out the page hit cost
will eliminate that cap. Autovacuum could introduce gigabytes per second of
memory -> L1 cache transfers. That's what all my details about memory
bandwidth were trying to put into context. I don't think it really matter
much because the new bottleneck will be the processing speed of a single
core, and that's still a decent cap to most people now.
OK, I see. No objection here; not sure how others feel.
I think you're missing my point here, which is is that we shouldn't
have any such things as a "cost limit". We should limit reads and
writes *completely separately*. IMHO, there should be a limit on
reading, and a limit on dirtying data, and those two limits should not
be tied to any common underlying "cost limit". If they are, they will
not actually enforce precisely the set limit, but some other composite
limit which will just be weird.I see the distinction you're making now, don't need a mock up to follow you.
The main challenge of moving this way is that read and write rates never end
up being completely disconnected from one another. A read will only cost
some fraction of what a write does, but they shouldn't be completely
independent.Just because I'm comfortable doing 10MB/s of reads and 5MB/s of writes, I
may not be happy with the server doing 9MB/s read + 5MB/s write=14MB/s of
I/O in an implementation where they float independently. It's certainly
possible to disconnect the two like that, and people will be able to work
something out anyway. I personally would prefer not to lose some ability to
specify how expensive read and write operations should be considered in
relation to one another.
OK. I was hoping that wasn't a distinction that we needed to
preserve, but if it is, it is.
The trouble, though, is that I think it makes it hard to structure the
GUCs in terms of units that are meaningful to the user. One could
have something like io_rate_limit (measured in MB/s),
io_read_multiplier = 1.0, io_dirty_multiplier = 1.0, and I think that
would be reasonably clear. By default io_rate_limit would govern the
sum of read activity and dirtying activity, but you could overweight
or underweight either of those two things by adjusting the multiplier.
That's not a huge improvement in clarity, though, especially if the
default values aren't anywhere close to 1.0.
If the limits aren't independent, I really *don't* think it's OK to
name them as if they are. That just seems like a POLA violation.
Related aside: shared_buffers is becoming a decreasing fraction of total
RAM each release, because it's stuck with this rough 8GB limit right now.
As the OS cache becomes a larger multiple of the shared_buffers size, the
expense of the average read is dropping. Reads are getting more likely to
be in the OS cache but not shared_buffers, which makes the average cost of
any one read shrink. But writes are as expensive as ever.Real-world tunings I'm doing now reflecting that, typically in servers with
128GB of RAM, have gone this far in that direction:
vacuum_cost_page_hit = 0
vacuum_cost_page_hit = 2
vacuum_cost_page_hit = 20That's 4MB/s of writes, 40MB/s of reads, or some blended mix that considers
writes 10X as expensive as reads. The blend is a feature.
Fair enough, but note that limiting the two things independently, to
4MB/s and 40MB/s, would not be significantly different. If the
workload is all reads or all writes, it won't be different at all.
The biggest difference would many or all writes also require reads, in
which case the write rate will drop from 4MB/s to perhaps as low as
3.6MB/s. That's not a big difference.
In general, the benefits of the current system are greatest when the
costs of reads and writes are similar. If reads and writes have equal
cost, it's clearly very important to have a blended cost. But the
more the cost of writes dominates the costs of reads, the less it
really matters. It sounds like we're already well on the way to a
situation where only the write cost really matters most of the time -
except for large scans that read a lot of data without changing it,
when only the read cost will matter.
I'm not really questioning your conclusion that we need to keep the
blended limit. I just want to make sure we're keeping it for a good
reason, because I think it increases the user-perceived complexity
here quite a bit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 6/7/13 12:42 PM, Robert Haas wrote:
GUCs in terms of units that are meaningful to the user. One could
have something like io_rate_limit (measured in MB/s),
io_read_multiplier = 1.0, io_dirty_multiplier = 1.0, and I think that
would be reasonably clear.
There's one other way to frame this:
io_read_limit = 7.8MB/s # Maximum read rate
io_dirty_multiplier = 2.0 # How expensive writes are considered
relative to reads
That still gives all of the behavior I'd like to preserve, as well as
not changing the default I/O pattern. I don't think it's too
complicated to ask people to grapple with that pair.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jun 7, 2013 at 12:55 PM, Greg Smith <greg@2ndquadrant.com> wrote:
On 6/7/13 12:42 PM, Robert Haas wrote:
GUCs in terms of units that are meaningful to the user. One could
have something like io_rate_limit (measured in MB/s),
io_read_multiplier = 1.0, io_dirty_multiplier = 1.0, and I think that
would be reasonably clear.There's one other way to frame this:
io_read_limit = 7.8MB/s # Maximum read rate
io_dirty_multiplier = 2.0 # How expensive writes are considered relative to
readsThat still gives all of the behavior I'd like to preserve, as well as not
changing the default I/O pattern. I don't think it's too complicated to ask
people to grapple with that pair.
That's unsatisfying to me because the io_read_limit is not really an
io_read_limit at all. It is some kind of combined limit, but the name
doesn't indicate that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 6, 2013 at 2:27 PM, Andres Freund <andres@2ndquadrant.com>wrote:
On 2013-06-06 12:34:01 -0700, Jeff Janes wrote:
On Fri, May 24, 2013 at 11:51 AM, Greg Smith <greg@2ndquadrant.com>
wrote:
On 5/24/13 9:21 AM, Robert Haas wrote:
But I wonder if we wouldn't be better off coming up with a little more
user-friendly API. Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = <amount of data you can dirty per
second, in kB> and read_rate_limit = <amount of data you can read into
shared buffers per second, in kB>.I already made and lost the argument for doing vacuum in KB/s units,
so I
wasn't planning on putting that in the way of this one.
I think the problem is that making that change would force people to
relearn something that was already long established, and it was far from
clear that the improvement, though real, was big enough to justifyforcing
people to do that.
I don't find that argument very convincing. Since you basically can
translate the current variables into something like the above variables
with some squinting we sure could have come up with some way to keep the
old definition and automatically set the new GUCs and the other way
round.
That may be, but it was not what the patch that was submitted did. And I
don't think the author or the reviewers were eager to put in the effort to
make that change, which would surely be quite a bit more work than the
original patch was in the first place. Also, I'm not sure that such a
complexity would even be welcomed. It sounds like an ongoing maintenance
cost, and I'm sure the word "baroque" would get thrown around.
Anyway, I don't think that resistance to making user visible changes to old
features should inhibit us from incorporating lessons from them into new
features.
guc.c should even have enough information to prohibit setting
both in the config file...
Is there precedence/infrastructure for things like that? I could see uses
for mutually exclusive complexes of configuration variables, but I wouldn't
even know where to start in implementing such.
Cheers,
Jeff