postgresql latency & bgwriter not doing its job
Hello pgdevs,
I've been playing with pg for some time now to try to reduce the maximum
latency of simple requests, to have a responsive server under small to
medium load.
On an old computer with a software RAID5 HDD attached, pgbench
simple update script run for a some time (scale 100, fillfactor 95)
pgbench -M prepared -N -c 2 -T 500 -P 1 ...
gives 300 tps. However this performance is really +1000 tps for a few
seconds followed by 16 seconds at about 0 tps for the checkpoint induced
IO storm. The server is totally unresponsive 75% of the time. That's
bandwidth optimization for you. Hmmm... why not.
Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
max):
pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
The same thing more or less happens in a delayed fashion... You get 50 tps
for some time, followed by sections of 15 seconds at 0 tps for the
checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).
It is not satisfying, pg should be able to handle that load easily.
The culprit I found is "bgwriter", which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, "pg_stat_bgwriter" says:
buffers_checkpoint = 19046
buffers_clean = 2995
Which suggest that bgwriter took on him to write only 6 pages per second,
where at least 50 would have been needed for the load, and could have been
handled by the harware without any problem.
I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends "CHECKPOINT" every 0.2
seconds... it works more or less, but this is not my point:-)
Bgwriter control parameters allow to control the maximum number of pages
(bgwriter_lru_maxpages) written per round (bgwriter_delay), and a
multiplier (bgwriter_lru_multiplier) which controls some heuristics to
estimate how many pages should be needed so as to make them available.
This may be nice in some settings, but is not adapted to the write
oriented OTPL load tested with pgbench.
The problem is that with the update load on a fitting in memory database
there is not that much need of "new" pages, even if pages are being
dirtied (about 50 per seconds), so it seems that the heuristics decides
not to write much. The net result of all this cleverness is that when the
checkpoint arrives, several thousand pages have to be written and the
server is offline for some time.
ISTM that bgwriter lacks at least some "min page" setting it could be
induced to free this many pages if it can. That would be a start.
A better feature would be that it adapts itself to take advantage of the
available IOPS, depending on the load induce by other tasks (vacuum,
queries...), in a preventive manner, so as to avoid delaying what can be
done right now under a small load, and thus avoid later IO storms. This
would suggest that some setting would provide the expected IOPS capability
of the underlying hardware, as some already suggest the expected available
memory.
Note that this preventive approach could also improve the "bandwith"
measure: currently when pgbench is running at maximum speed before the
checkpoint storm, nothing is written to disk but WAL, although it could
probably also write some dirty pages. When the checkpoints arrives, less
pages would need to be written, so the storm would be shorter.
Any thoughts on this latency issue? Am I missing something?
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/25/2014 01:23 PM, Fabien COELHO wrote:
Hello pgdevs,
I've been playing with pg for some time now to try to reduce the maximum
latency of simple requests, to have a responsive server under small to
medium load.On an old computer with a software RAID5 HDD attached, pgbench simple
update script run for a some time (scale 100, fillfactor 95)pgbench -M prepared -N -c 2 -T 500 -P 1 ...
gives 300 tps. However this performance is really +1000 tps for a few
seconds followed by 16 seconds at about 0 tps for the checkpoint induced
IO storm. The server is totally unresponsive 75% of the time. That's
bandwidth optimization for you. Hmmm... why not.
So I think that you're confusing the roles of bgwriter vs. spread
checkpoint. What you're experiencing above is pretty common for
nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
no matter how fast the disks are), or for attempts to do spread
checkpoint on filesystems which don't support it (e.g. Ext3, HFS+). In
either case, what's happening is that the *OS* is freezing all logical
and physical IO while it works to write out all of RAM, which makes me
suspect you're using Ext3 or HFS+.
Making the bgwriter more aggressive adds a significant risk of writing
the same pages multiple times between checkpoints, so it's not a simple fix.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMda06311ed358627fb78151ffc3a141f424d24473eb23080fbecfc1583189e3ea5a58a24fb84ae2784e4131f04a7e214a@asav-2.01.com
Hi,
On 2014-08-25 22:23:40 +0200, Fabien COELHO wrote:
seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO
storm. The server is totally unresponsive 75% of the time. That's bandwidth
optimization for you. Hmmm... why not.Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
max):pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
The same thing more or less happens in a delayed fashion... You get 50 tps
for some time, followed by sections of 15 seconds at 0 tps for the
checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).
That's ext4 I guess? Did you check whether xfs yields a, err, more
predictable performance?
It is not satisfying, pg should be able to handle that load easily.
The culprit I found is "bgwriter", which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink.
While I agree that the current bgwriter implementation is far from good,
note that this isn't the bgwriter's job. Its goal is to avoid backends
from having to write out buffers themselves. I.e. that there are clean
victim buffers when shared_buffers < working set.
Note that it would *not* be a good idea to make the bgwriter write out
everything, as much as possible - that'd turn sequential write io into
random write io.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Monday, August 25, 2014, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
The culprit I found is "bgwriter", which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, "pg_stat_bgwriter" says:
Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?
You don't want your steady state stats polluted by the bulk load.
buffers_checkpoint = 19046
buffers_clean = 2995
Out of curiosity, what does buffers_backend show?
In any event, this almost certainly is a red herring. Whichever of the
three ways are being used to write out the buffers, it is the checkpointer
that is responsible for fsyncing them, and that is where your drama is
almost certainly occurring. Writing out with one path rather than a
different isn't going to change things, unless you change the fsync.
Also, are you familiar with checkpoint_completion_target, and what is it
set to?
Cheers,
Jeff
On Tue, Aug 26, 2014 at 1:53 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello pgdevs,
I've been playing with pg for some time now to try to reduce the maximum
latency of simple requests, to have a responsive server under small to
medium load.
On an old computer with a software RAID5 HDD attached, pgbench simple
update script run for a some time (scale 100, fillfactor 95)
pgbench -M prepared -N -c 2 -T 500 -P 1 ...
gives 300 tps. However this performance is really +1000 tps for a few
seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO
storm. The server is totally unresponsive 75% of the time. That's bandwidth
optimization for you. Hmmm... why not.
Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
max):
pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
The same thing more or less happens in a delayed fashion... You get 50
tps for some time, followed by sections of 15 seconds at 0 tps for the
checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).
I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply. One reason could be
that backends might need to write more WAL due Full_Page_Writes,
another could be contention around buffer content_lock.
To dig more about the reason, the same tests can be tried
by making Full_Page_Writes = off and/or
synchronous_commit = off to see if WAL writes are causing
tps to go down.
Similarly for checkpoints, use checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hello Josh,
So I think that you're confusing the roles of bgwriter vs. spread
checkpoint. What you're experiencing above is pretty common for
nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
no matter how fast the disks are), or for attempts to do spread
checkpoint on filesystems which don't support it (e.g. Ext3, HFS+). In
either case, what's happening is that the *OS* is freezing all logical
and physical IO while it works to write out all of RAM, which makes me
suspect you're using Ext3 or HFS+.
I'm using ext4 on debian wheezy with postgresqk 9.4b2.
I agree that the OS may be able to help, but this aspect does not
currently work for me at all out of the box. The "all of RAM" is really a
few thousands 8 kB pages written randomly, a few dozen MB.
Also, if pg needs advanced OS tweaking to handle a small load, ISTM that
it fails at simplicity:-(
As for checkpoint spreading, raising checkpoint_completion_target to 0.9
degrades the situation (20% of transactions are more than 200 ms late
instead of 10%, bgwriter wrote less that 1 page per second, on on 500s
run). So maybe there is a bug here somewhere.
Making the bgwriter more aggressive adds a significant risk of writing
the same pages multiple times between checkpoints, so it's not a simple fix.
Hmmm... This must be balanced with the risk of being offline. Not all
people are interested in throughput at the price of latency, so there
could be settings that help latency, even at the price of reducing
throughput (average tps). After that, it is the administrator choice to
set pg for higher throughput or lower latency.
Note that writing some "least recently used" page multiple times does not
seems to be any issue at all for me under small/medium load, especially as
the system has nothing else to do: if you have nothing else to do, there
is no cost in writing a page, even if you may have to write it again some
time later, and it helps prevent dirty pages accumulation. So it seems to
me that pg can help, it is not only/merely an OS issue.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Jeff,
The culprit I found is "bgwriter", which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, "pg_stat_bgwriter" says:Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?
Yes, I did.
You don't want your steady state stats polluted by the bulk load.
Sure!
buffers_checkpoint = 19046
buffers_clean = 2995Out of curiosity, what does buffers_backend show?
buffers_backend = 157
In any event, this almost certainly is a red herring.
Possibly! It is pretty easy to reproduce... can anyone try?
Whichever of the three ways are being used to write out the buffers, it
is the checkpointer that is responsible for fsyncing them, and that is
where your drama is almost certainly occurring. Writing out with one
path rather than a different isn't going to change things, unless you
change the fsync.
Well, ISTM that the OS does not need to wait for fsync to start writing
pages if it has received one minute of buffer writes at 50 writes per
second, some scheduler should start handling the flow somewhere... So if
bgwriter was to write the buffers is would help, but maybe there is a
better way.
Also, are you familiar with checkpoint_completion_target, and what is it
set to?
The default 0.5. Moving to 0.9 seems to worsen the situation.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Monday, August 25, 2014, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends "CHECKPOINT" every 0.2
seconds... it works more or less, but this is not my point:-)
There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes. It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.
Cheers,
Jeff
[oops, wrong from, resent...]
Hello Jeff,
The culprit I found is "bgwriter", which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, "pg_stat_bgwriter" says:Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?
Yes, I did.
You don't want your steady state stats polluted by the bulk load.
Sure!
buffers_checkpoint = 19046
buffers_clean = 2995Out of curiosity, what does buffers_backend show?
buffers_backend = 157
In any event, this almost certainly is a red herring.
Possibly. It is pretty easy to reproduce, though.
Whichever of the three ways are being used to write out the buffers, it
is the checkpointer that is responsible for fsyncing them, and that is
where your drama is almost certainly occurring. Writing out with one
path rather than a different isn't going to change things, unless you
change the fsync.
Well, I agree partially. ISTM that the OS does not need to wait for fsync
to start writing pages if it is receiving one minute of buffer writes at
50 writes per second, I would have thought that some scheduler should
start handling the flow before fsync... So I thought that if bgwriter was
to write the buffers is would help, but maybe there is a better way.
Also, are you familiar with checkpoint_completion_target, and what is it
set to?
The default 0.5. Moving to 0.9 seems to worsen the situation.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Amit,
I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply.
Yep. Actually it is not strictly 0, but a "few" tps that I rounded to 0.
progress: 63.0 s, 47.0 tps, lat 2.810 ms stddev 5.194, lag 0.354 ms
progress: 64.1 s, 11.9 tps, lat 81.481 ms stddev 218.026, lag 74.172 ms
progress: 65.2 s, 1.9 tps, lat 950.455 ms stddev 125.602, lag 1421.203 ms
progress: 66.1 s, 4.5 tps, lat 604.790 ms stddev 440.073, lag 2418.128 ms
progress: 67.2 s, 6.0 tps, lat 322.440 ms stddev 68.276, lag 3146.302 ms
progress: 68.0 s, 2.4 tps, lat 759.509 ms stddev 62.141, lag 4229.239 ms
progress: 69.4 s, 3.6 tps, lat 440.335 ms stddev 369.207, lag 4842.371 ms
Transactions are 4.8 seconds behind schedule at this point.
One reason could be that backends might need to write more WAL due
Full_Page_Writes, another could be contention around buffer
content_lock. To dig more about the reason, the same tests can be tried
by making Full_Page_Writes = off and/or synchronous_commit = off to see
if WAL writes are causing tps to go down.
Given the small flow of updates, I do not think that there should be
reason to get that big a write contention between WAL & checkpoint.
If tried with "full_page_write = off" for 500 seconds: same overall
behavior, 8.5% of transactions are stuck (instead of 10%). However in
details pg_stat_bgwriter is quite different:
buffers_checkpoint = 13906
buffers_clean = 20748
buffers_backend = 472
That seems to suggest that bgwriter did some stuff for once, but that did
not change much the result in the end. This would imply that my suggestion
to make bgwriter write more would not fix the problem alone.
With "synchronous_commit = off", the situation is much improved, with only
0.3% of transactions stuck. Not a surprise. However, I would not recommand
that as a solution:-)
Currently, the only way I was able to "solve" the issue while still
writing to disk is to send "CHECKPOINT" every 0.2s, as if I had set
"checkpoint_timeout = 0.2s" (although this is not currently allowed).
Similarly for checkpoints, use checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.
I had already tried, and retried after Jeff suggestion. This does not seem
to mitigate anything, on the contrary.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello again,
I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends "CHECKPOINT" every 0.2
seconds... it works more or less, but this is not my point:-)There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes. It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.
I recompiled with the variable coldly set to 1000 instead of 120000. The
situation is slightly degraded (15% of transactions were above 200 ms
late). However it seems that bgwriter did not write much more pages:
buffers_checkpoint = 26065
buffers_clean = 5263
buffers_backend = 367
Or I may have a problem interpreting pg_stat_bgwriter.
It seems that changing this value is not enough to persuade bgwriter to
write more pages. Or I may have done something wrong, but I do not know
what.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-08-26 08:12:48 +0200, Fabien COELHO wrote:
As for checkpoint spreading, raising checkpoint_completion_target to 0.9
degrades the situation (20% of transactions are more than 200 ms late
instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run).
So maybe there is a bug here somewhere.
What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?
Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by "time" and not by "xlog".
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Andres,
checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).That's ext4 I guess?
Yes!
Did you check whether xfs yields a, err, more predictable performance?
No. I cannot test that easily without reinstalling the box. I did some
quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in
the very same conditions. Maybe I could try again.
[...] Note that it would *not* be a good idea to make the bgwriter write
out everything, as much as possible - that'd turn sequential write io
into random write io.
Hmmm. I'm not sure it would be necessary the case, it depends on how
bgwriter would choose the pages to write? If they are chosen randomly then
indeed that could be bad. If there is a big sequential write, should not
the backend do the write directly anyway? ISTM that currently checkpoint
is mostly random writes anyway, at least with the OLTP write load of
pgbench. I'm just trying to be able to start them ealier so that they can
be completed quickly.
So although bgwriter is not the solution, ISTM that pg has no reason to
wait for minutes before starting to write dirty pages, if it has nothing
else to do. If the OS does some retention later and cannot spread the
load, as Josh suggest, this could also be a problem, but currently the OS
seems not to have much to write (but WAL) till the checkpoint.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote:
Did you check whether xfs yields a, err, more predictable performance?
No. I cannot test that easily without reinstalling the box. I did some quick
tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very
same conditions. Maybe I could try again.
After Robert and I went to LSF/MM this spring I sent a test program for
precisely this problem and while it could *crash* machines when using
ext4, xfs yielded much more predictable performance. There's a problem
with priorization of write vs read IO that's apparently FS dependent.
[...] Note that it would *not* be a good idea to make the bgwriter write
out everything, as much as possible - that'd turn sequential write io into
random write io.Hmmm. I'm not sure it would be necessary the case, it depends on how
bgwriter would choose the pages to write? If they are chosen randomly then
indeed that could be bad.
The essentially have to be random to fulfil it's roal of reducing the
likelihood of a backend having to write out a buffer itself. Consider
how the clock sweep algorithm (not that I am happy with it) works. When
looking for a new victim buffer all backends scan the buffer cache in
one continuous cycle. If they find a buffer with a usagecount==0 they'll
use that one and throw away its contents. Otherwise they reduce
usagecount by 1 and move on. What the bgwriter *tries* to do is to write
out buffers with usagecount==0 that are dirty and will soon be visited
in the clock cycle. To avoid having the backends to do that.
If there is a big sequential write, should not the
backend do the write directly anyway? ISTM that currently checkpoint is
mostly random writes anyway, at least with the OLTP write load of pgbench.
I'm just trying to be able to start them ealier so that they can be
completed quickly.
If the IO scheduling worked - which it really doesn't in many cases -
there'd really be no need to make it finish fast. I think you should try
to tune spread checkpoints to have less impact, not make bgwriter do
something it's not written for.
So although bgwriter is not the solution, ISTM that pg has no reason to wait
for minutes before starting to write dirty pages, if it has nothing else to
do.
That precisely *IS* a spread checkpoint.
If the OS does some retention later and cannot spread the load, as Josh
suggest, this could also be a problem, but currently the OS seems not to
have much to write (but WAL) till the checkpoint.
The actual problem is that the writes by the checkpointer - done in the
background - aren't flushed out eagerly enough out of the OS's page
cache. Then, when the final phase of the checkpoint comes, where
relation files need to be fsynced, some filesystems essentially stal
while trying to write out lots and lots of dirty buffers.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?
They simply are the defaults:
checkpoint_segments = 3
checkpoint_timeout = 5min
wal_buffers = -1
I did some test checkpoint_segments = 1, the problem is just more frequent
but shorter. I also reduced wal_segsize down to 1MB, which also made it
even more frequent but much shorter, so the overall result was an
improvement with 5% to 3% of transactions lost instead of 10-14%, if I
recall correctly. I have found no solution on this path.
Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by "time" and not by "xlog".
I do 500 seconds tests, so there could be at most 2 timeout triggered
checkpoints. Given the write load it takes about 2 minutes to fill the 3
16 MB buffers (8 kb * 50 tps (there is one page modified per transaction)
* 120 s ~ 48 MB), so checkpoints are triggered by xlog. The maths are
consistent with logs (not sure which prooves which, though:-):
LOG: received SIGHUP, reloading configuration files
LOG: parameter "log_checkpoints" changed to "on"
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 5713 buffers (34.9%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=51.449 s, sync=4.857 s,
total=56.485 s; sync files=12, longest=2.160 s, average=0.404 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6235 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.500 s, sync=5.102 s,
total=58.670 s; sync files=8, longest=2.689 s, average=0.637 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6250 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.888 s, sync=4.504 s,
total=58.495 s; sync files=8, longest=2.627 s, average=0.563 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6148 buffers (37.5%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.313 s, sync=6.437 s,
total=59.834 s; sync files=8, longest=3.680 s, average=0.804 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6240 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=149.008 s, sync=5.448 s,
total=154.566 s; sync files=9, longest=3.788 s, average=0.605 s
Note that my current effective solution is to do as if
"checkpoints_timeout = 0.2s": it works fine if I do my own spreading.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-08-26 10:49:31 +0200, Fabien COELHO wrote:
What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?They simply are the defaults:
checkpoint_segments = 3
checkpoint_timeout = 5min
wal_buffers = -1I did some test checkpoint_segments = 1, the problem is just more frequent
but shorter. I also reduced wal_segsize down to 1MB, which also made it even
more frequent but much shorter, so the overall result was an improvement
with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly.
I have found no solution on this path.
Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance you likely will have to make checkpoint_timeout
*longer* and increase checkpoint_segments until *all* checkpoints are
started because of "time".
There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.
b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.
c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance
I do not seek "high performance" per se, I seek "lower maximum latency".
I think that the current settings and parameters are designed for high
throughput, but do not allow to control the latency even with a small
load.
you likely will have to make checkpoint_timeout *longer* and increase
checkpoint_segments until *all* checkpoints are started because of
"time".
Well, as I want to test a *small* load in a *reasonable* time, so I did
not enlarge the number of segments, otherwise it would take ages.
If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target =
0.9" so that the checkpoints are triggered by the timeout,
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
...
The result is basically the same (well 18% transactions lost, but the
result do not seem to be stable one run after the other), only there are
more checkpoints.
I fail to understand how multiplying both the segments and time would
solve the latency problem. If I set 30 segments than it takes 20 minutes
to fill them, and if I put timeout to 15min then I'll have to wait for 15
minutes to test.
There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.
Why would they be written by the kernel if bgwriter has not sent them??
b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.
Sure, I understand that, but ISTM that this test does not exercise this
issue, the load is small, the full page writes do not matter much.
c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.
I think that what you suggest would only delay the issue, not solve it.
I'll try to ran a long test.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote:
Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performanceI do not seek "high performance" per se, I seek "lower maximum latency".
So?
I think that the current settings and parameters are designed for high
throughput, but do not allow to control the latency even with a small load.
The way're you're setting them is tuned for 'basically no write
activity'.
you likely will have to make checkpoint_timeout *longer* and increase
checkpoint_segments until *all* checkpoints are started because of "time".Well, as I want to test a *small* load in a *reasonable* time, so I did not
enlarge the number of segments, otherwise it would take ages.
Well, that way you're testing something basically meaningless. That's
not helpful either.
If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target =
0.9" so that the checkpoints are triggered by the timeout,LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
...The result is basically the same (well 18% transactions lost, but the result
do not seem to be stable one run after the other), only there are more
checkpoints.
With these settings you're fsyncing the entire data directy once a
minute. Nearly entirely from the OS's buffer cache, because the OS's
writeback logic didn't have time to kick in.
I fail to understand how multiplying both the segments and time would solve
the latency problem. If I set 30 segments than it takes 20 minutes to fill
them, and if I put timeout to 15min then I'll have to wait for 15 minutes to
test.
a) The kernel's writeback logic only kicks in with delay. b) The amount
of writes you're doing with short checkpoint intervals is overall
significantly higher than with longer intervals. That obviously has
impact on latency as well as throughput. c) the time it fills for
segments to be filled is mostly irrelevant. The phase that's very likely
causing troubles for you is the fsyncs issued at the end of a
checkpoint.
There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.Why would they be written by the kernel if bgwriter has not sent them??
I think you're misunderstanding how spread checkpoints work. When the
checkpointer process starts a spread checkpoint it first writes all
buffers to the kernel in a paced manner. That pace is determined by
checkpoint_completion_target and checkpoint_timeout. Once all buffers
that are old enough to need to be checkpointed written out, the
checkpointer fsync()s all the on disk files. That part is *NOT*
paced. Then it can go on to remove old WAL files.
The latency problem is almost guaranteedly created by the fsync()s
mentioned above. When they're execute the kernel starts flushing out a
lot of dirty buffers at once - creating very deep IO queues which makes
it take long to process synchronous additions (WAL flushes, reads) to
that queue.
c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.I think that what you suggest would only delay the issue, not solve it.
The amount of dirty data that needs to be flushed is essentially
bounded. If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 26, 2014 at 12:53 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Given the small flow of updates, I do not think that there should be
reason to get that big a write contention between WAL & checkpoint.
If tried with "full_page_write = off" for 500 seconds: same overall
behavior, 8.5% of transactions are stuck (instead of 10%). However in
details pg_stat_bgwriter is quite different:
buffers_checkpoint = 13906
buffers_clean = 20748
buffers_backend = 472That seems to suggest that bgwriter did some stuff for once, but that did
not change much the result in the end. This would imply that my suggestion
to make bgwriter write more would not fix the problem alone.
I think the reason could be that in most cases bgwriter
passes the sync responsibility to checkpointer rather than
doing by itself which causes IO storm during checkpoint
and another thing is that it will not proceed to even write
the dirty buffer unless the refcounf and usage_count of
buffer is zero. I see there is some merit in your point which
is to make bgwriter more useful than its current form.
I could see 3 top level points to think about whether improvement
in any of those can improve the current situation:
a. Scanning of buffer pool to find the dirty buffers that can
be flushed.
b. Deciding on what is criteria to flush a buffer
c. Sync of buffers
With "synchronous_commit = off", the situation is much improved, with
only 0.3% of transactions stuck. Not a surprise. However, I would not
recommand that as a solution:-)
Yeah, actually it was just to test what is the actual problem,
I will also not recommend such a solution, however it gives
us atleast the indication that due to IO in checkpoint, backends
are not even able to flush comparatively small WAL data.
How about if you keep WAL (pg_xlog) on a separate filesystem
may be via creating symlink or if possible with -X option of initdb?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hello Andres,
[...]
I think you're misunderstanding how spread checkpoints work.
Yep, definitely:-) On the other hand I though I was seeking something
"simple", namely correct latency under small load, that I would expect out
of the box.
What you describe is reasonable, and is more or less what I was hoping
for, although I thought that bgwriter was involved from the start and
checkpoint would only do what is needed in the end. My mistake.
When the checkpointer process starts a spread checkpoint it first writes
all buffers to the kernel in a paced manner.
That pace is determined by checkpoint_completion_target and
checkpoint_timeout.
This pacing does not seem to work, even at slow pace.
If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.
I do not believe in delaying as much as possible writing do disk to handle
a small load as a viable strategy. However, to show my good will, I have
tried to follow your advices: I've launched a 5000 seconds test with 50
segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less
than 1/10 of the maximum throughput.
There are only two time-triggered checkpoints:
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 48725 buffers (47.6%);
1 transaction log file(s) added, 0 removed, 0 recycled;
write=1619.750 s, sync=27.675 s, total=1647.932 s;
sync files=14, longest=27.593 s, average=1.976 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 22533 buffers (22.0%);
0 transaction log file(s) added, 0 removed, 23 recycled;
write=826.919 s, sync=9.989 s, total=837.023 s;
sync files=8, longest=6.742 s, average=1.248 s
For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to
complete, so it means 30 buffer writes per second... should be ok. However
sync costs 27 seconds nevertheless, and the server was more or less
offline for about 30 seconds flat. For the second one, 180 MB to write, 10
seconds offline. For some reason the target time is reduced. I have also
tried with the "deadline" IO scheduler which make more sense than the
default "cfq", but the result was similar. Not sure how software RAID
interacts with IO scheduling, though.
Overall result: over the 5000s test, I have lost (i.e. more than 200ms
behind schedule) more than 2.5% of transactions (1/40). Due to the
unfinished cycle, the long term average is probably about 3%. Although it
is better than 10%, it is not good. I would expect/hope for something
pretty close to 0, even with ext4 on Linux, for a dedicated host which has
nothing else to do but handle two dozen transactions per second.
Current conclusion: I have not found any way to improve the situation to
"good" with parameters from the configuration. Currently a small load
results in periodic offline time, that can be delayed but not avoided. The
delaying tactic results in less frequent but longer downtime. I prefer
frequent very short downtime instead.
I really think that something is amiss. Maybe pg does not handle pacing as
it should.
For the record, a 25tps bench with a "small" config (default 3 segments,
5min timeout, 0.5 completion target) and with a parallel:
while true ; do echo "CHECKPOINT;"; sleep 0.2s; done | psql
results in "losing" only 0.01% of transactions (12 transactions out of
125893 where behind more than 200ms in 5000 seconds). Although you may
think it stupid, from my point of view it shows that it is possible to
coerce pg to behave.
With respect to the current status:
(1) the ability to put checkpoint_timeout to values smaller than 30s could
help, although obviously there would be other consequences. But the
ability to avoid periodic offline time looks like a desirable objective.
(2) I still think that a parameter to force bgwriter to write more stuff
could help, but this is not tested.
(3) Any other effective idea to configure for responsiveness is welcome!
If someone wants to repeat these tests, it is easy and only takes a few
minutes:
sh> createdb test
sh> pgbench -i -s 100 -F 95 test
sh> pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test > pgb.out
Note: the -L to limit latency is a submitted patch. Without this,
unresponsiveness shows as increasing laging time.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers