Checkpoint throttling issues

Started by Andres Freundabout 10 years ago6 messages

andres@anarazel.de

about 10 years ago

Hi,

working on the checkpoint sorting/flushing patch I noticed a number of
preexisting issues:

1) The progress passed to CheckpointWriteDelay() will often be wrong -
it's calculated as num_written / num_to_write, but num_written is only
incremented if the buffer hasn't since independently been written
out. That's bad because it mean's we'll think we're further and
further behind if there's independent writeout activity.

Simple enough to fix, we gotta split num_written into num_written
(for stats purposes) and num_processed (for progress).

This is pretty much a bug, but I'm a slightly worried about
backpatching a fix because it can have a rather noticeable
behavioural impact.

2) CheckpointWriteDelay()'s handling of config file changes et al is
pretty weird: The config is reloaded during a checkpoing iff it's not
an immediate checkpoint and we're on schedule. I see very little
justification for having the got_SIGHUP block inside that if block.

3) The static pg_usleep(100000L); in CheckpointWriteDelay() is a bad
idea.

On a system with low write activity (< 10 writes sec) this will lead
to checkpoints finishing too early as there'll always be around ~10
writes a second. On slow IO, say a sdcard, that's bad.

On system with a very high write throughput (say 1k+ buffers/sec) the
unconditional 100ms sleep while on schedule will mean we're far
behind after the sleep. This leads to rather bursty IO, and makes it
more likely to run into dirty buffer limits and such. Just reducing
the sleep from 100ms to 10ms leads to significant latency
improvements. On pgbench rate limited to 5k tps:

before:
number of transactions skipped: 70352 (1.408 %)
number of transactions above the 100.0 ms latency limit: 2817 (0.056 %)
latency average: 5.266 ms
latency stddev: 11.723 ms
rate limit schedule lag: avg 3.138 (max 0.000) ms

after:
number of transactions skipped: 41696 (0.834 %)
number of transactions above the 100.0 ms latency limit: 1736 (0.035 %)
latency average: 4.929 ms
latency stddev: 8.929 ms
rate limit schedule lag: avg 2.835 (max 0.000) ms

I think the sleep time should be computed adaptively based on the
number of buffers remaining and the remaining time. There's probably
better formulations, but that seems like an easy enough improvement
and considerably better than now.

4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#1)

Re: Checkpoint throttling issues

preexisting issues:

1) The progress passed to CheckpointWriteDelay() will often be wrong -
it's calculated as num_written / num_to_write, but num_written is only
incremented if the buffer hasn't since independently been written
out. That's bad because it mean's we'll think we're further and
further behind if there's independent writeout activity.

Simple enough to fix, we gotta split num_written into num_written
(for stats purposes) and num_processed (for progress).

This is pretty much a bug, but I'm a slightly worried about
backpatching a fix because it can have a rather noticeable
behavioural impact.

I noticed this discrepancy, but decided agains doing anything about it,
because it meant more arguing and explaning. In the pgbench run I used the
case does not seem to arise often because most I/O is performed through
the checkpointer, so it does not matter much.

Another discrepancy is that there are two progress computations, one
against time and one against wal buffer, and when one is used and why it
is better than the other is not too clear to me.

2) CheckpointWriteDelay()'s handling of config file changes et al is
pretty weird: The config is reloaded during a checkpoing iff it's not
an immediate checkpoint and we're on schedule. I see very little
justification for having the got_SIGHUP block inside that if block.

No opinion.

3) The static pg_usleep(100000L); in CheckpointWriteDelay() is a bad
idea.

On a system with low write activity (< 10 writes sec) this will lead
to checkpoints finishing too early as there'll always be around ~10
writes a second. On slow IO, say a sdcard, that's bad.

On system with a very high write throughput (say 1k+ buffers/sec) the
unconditional 100ms sleep while on schedule will mean we're far
behind after the sleep. This leads to rather bursty IO, and makes it
more likely to run into dirty buffer limits and such. Just reducing
the sleep from 100ms to 10ms leads to significant latency
improvements. On pgbench rate limited to 5k tps:

before:
number of transactions skipped: 70352 (1.408 %)
number of transactions above the 100.0 ms latency limit: 2817 (0.056 %)
latency average: 5.266 ms
latency stddev: 11.723 ms
rate limit schedule lag: avg 3.138 (max 0.000) ms

after:
number of transactions skipped: 41696 (0.834 %)
number of transactions above the 100.0 ms latency limit: 1736 (0.035 %)
latency average: 4.929 ms
latency stddev: 8.929 ms
rate limit schedule lag: avg 2.835 (max 0.000) ms

Do these figures include sorting & flushing ?

I think the sleep time should be computed adaptively based on the
number of buffers remaining and the remaining time. There's probably
better formulations, but that seems like an easy enough improvement
and considerably better than now.

One this run. If there are few I/Os then the 1 0ms will result in more
random/discontinuous I/Os, because there would be only a few writes in
each bucket, so I'm not sure whether this would be an unconditionnal
improvement. I would be interested to have figures *with* sorting and
flushing on and several kind of run.

4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

No opinion!

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

about 10 years ago

In reply to: Fabien COELHO (#2)

Re: Checkpoint throttling issues

Fabien COELHO wrote:

4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

No opinion!

My guess here, without looking, is that this was based on the idea of
"oops, we're late here for the checkpoint, let's do as less work as
possible to avoid getting even later", and thus skip some "unnecessary"
tasks. I would vote for fixing this (and the SIGHUP issue too). I'm
not sure how bad a behavioral change this is; IMO these fixes should be
backpatched all the way back.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Alvaro Herrera (#3)

Re: Checkpoint throttling issues

On Tue, Oct 20, 2015 at 1:05 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Fabien COELHO wrote:

4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

No opinion!

My guess here, without looking, is that this was based on the idea of
"oops, we're late here for the checkpoint, let's do as less work as
possible to avoid getting even later", and thus skip some "unnecessary"
tasks.

I also think thats what could be the reason for writing the code that way,
it seems to me that processing config file when not on schedule has
an advantage such that if user has modified checkpoint related parameters
(like checkpoint_completion_target), that could improve the checkpoint
behaviour. Now there could be some negative impact of this change as
well if user changes full_page_writes parameter, as to process that, it
needs to write a WAL record which could be costly when the checkpoint is
not on schedule, but I think the odds of that are less, so there doesn't
seem to be any harm in changing this behaviour.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#1)

Re: Checkpoint throttling issues

On Mon, Oct 19, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de> wrote:

1) The progress passed to CheckpointWriteDelay() will often be wrong -
it's calculated as num_written / num_to_write, but num_written is only
incremented if the buffer hasn't since independently been written
out. That's bad because it mean's we'll think we're further and
further behind if there's independent writeout activity.

Simple enough to fix, we gotta split num_written into num_written
(for stats purposes) and num_processed (for progress).

This is pretty much a bug, but I'm a slightly worried about
backpatching a fix because it can have a rather noticeable
behavioural impact.

I think this is an algorithmic improvement, not a bug fix. Actually,
I don't really think any of these things are bugs, properly considered
- they all look pretty intentional to me, even if we no longer agree
with the reasoning. Maybe some of them could be back-patched anyway,
but at any rate I definitely wouldn't backpatch this or #3, because
even though changing this is probably better on the average, it's hard
to be sure that it won't be worse for somebody. In the back-branches,
I think stability takes priority over improvements.

I think the sleep time should be computed adaptively based on the
number of buffers remaining and the remaining time. There's probably
better formulations, but that seems like an easy enough improvement
and considerably better than now.

One thing to keep in mind here is that somebody did work a few years
ago to reduce the number of wake-ups per second that PostgreSQL
generates when idle. Now obviously getting the checkpointing behavior
correct is more important, and obviously also the system is not idle
if we're checkpointing, but it's something to keep in mind. I like
the idea of an adaptive sleep time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Robert Haas (#5)

Re: Checkpoint throttling issues

On 2015-10-22 09:52:25 -0400, Robert Haas wrote:

On Mon, Oct 19, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de> wrote:

1) The progress passed to CheckpointWriteDelay() will often be wrong -
it's calculated as num_written / num_to_write, but num_written is only
incremented if the buffer hasn't since independently been written
out. That's bad because it mean's we'll think we're further and
further behind if there's independent writeout activity.

Simple enough to fix, we gotta split num_written into num_written
(for stats purposes) and num_processed (for progress).

This is pretty much a bug, but I'm a slightly worried about
backpatching a fix because it can have a rather noticeable
behavioural impact.

I think this is an algorithmic improvement, not a bug fix.

progress, passed to CheckpointWriteDelay, is defined as:
* 'progress' is an estimate of how much of the work has been done, as a
* fraction between 0.0 meaning none, and 1.0 meaning all done.

passing a potentially wildly inaccurate value (like close to 0, because
most buffers have since been displaced) seems more like a bug than an
algorithmic improvement... I can't see how that could have been
intentional.

But I'm now wild on backpatching these, so it's probably a moot
distinction.

I think the sleep time should be computed adaptively based on the
number of buffers remaining and the remaining time. There's probably
better formulations, but that seems like an easy enough improvement
and considerably better than now.

One thing to keep in mind here is that somebody did work a few years
ago to reduce the number of wake-ups per second that PostgreSQL
generates when idle.

As far as I can see checkpointing hasn't been touched in the course of
that work. For power usage it's probably not bad to complete checkpoints
faster than necessary and then sleep for longer :/. So it might be good
to have an upper limit to the sleep time.

I like the idea of an adaptive sleep time.

Good. I do too ;)

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers