checkpointer continuous flushing
Hello pg-devs,
This patch is a simplified and generalized version of Andres Freund's
August 2014 patch for flushing while writing during checkpoints, with some
documentation and configuration warnings added.
For the initial patch, see:
/messages/by-id/20140827091922.GD21544@awork2.anarazel.de
For the whole thread:
/messages/by-id/alpine.DEB.2.10.1408251900211.11151@sto
The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.
Flushes are managed with pg throttled writes instead of waiting for the
checkpointer final "fsync" which induces occasional stalls. From
"pgbench -P 1 ...", such stalls look like this:
progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043 # ok
progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev 123.828 # stalled
progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346 # ...
progress: 38.0 s, 161.0 tps, lat 6.968 ms stddev 32.964 # restart
progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326 # ok
I've seen similar behavior on FreeBSD with its native FS, so it is not a
Linux-specific or ext4-specific issue, even if both factor may contribute.
There are two implementations, first one based on "sync_file_range" is Linux
specific, while the other relies on "posix_fadvise". Tests below ran on Linux.
If someone could test the posix_fadvise version on relevant platforms, that
would be great...
The Linux specific "sync_file_range" approach was suggested among other ideas
by Theodore Ts'o on Robert Haas blog in March 2014:
http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.html
Two guc variables control whether the feature is activated for writes of
dirty pages issued by checkpointer and bgwriter. Given that the settings
may improve or degrade performance, having GUC seems justified. In
particular the stalling issue disappears with SSD.
The effect is significant on a series of tests shown below with scale 10
pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw
RAID), with shared_buffers=1GB checkpoint_completion_target=0.8
completion_timeout=30s, unless stated otherwise.
Note: I know that this completion_timeout is too small for a normal
config, but the point is to test how checkpoints behave, so the test
triggers as many checkpoints as possible, hence the minimum timeout
setting. I have also done some tests with larger timeout.
(1) THROTTLED PGBENCH
The objective of the patch is to be able to reduce the latency of transactions
under a moderate load. These first serie of tests focuses on this point with
the help of pgbench -R (rate) and -L (skip/count late transactions).
The measure counts transactions which were skipped or beyond the expected
latency limit while targetting a transaction rate.
* "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during
100 seconds, and latency limit is 100 ms), over 256 runs, 7 hours per case:
flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 6.5 %
off | on | 6.1 %
on | off | 0.4 %
on | on | 0.4 %
* Same as above (100 tps target) over one run of 4000 seconds with
shared_buffers=256MB and checkpoint_timeout=10mn:
flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 1.3 %
off | on | 1.5 %
on | off | 0.6 %
on | on | 0.6 %
* Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs:
flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 8.0 %
off | on | 8.0 %
on | off | 0.4 %
on | on | 0.4 %
* Same as above (150 tps target) over one run of 4000 seconds with
shared_buffers=256MB and checkpoint_timeout=10mn:
flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 1.7 %
off | on | 1.9 %
on | off | 0.7 %
on | on | 0.6 %
Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.
bgwriter_flush_to_disk does not seem to have a significant impact on these
tests, maybe because pg shared_buffers size is much larger than the
database, so the bgwriter is seldom active.
(2) FULL SPEED PGBENCH
This is not the target use case, but it seems necessary to assess the
impact of these options of tps figures and their variability.
* "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case.
flush | performance on ...
cp | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds)
off | off | 691 +- 36 tps | 691 +- 236 tps
off | on | 677 +- 29 tps | 677 +- 230 tps
on | off | 655 +- 23 tps | 655 +- 130 tps
on | on | 657 +- 22 tps | 657 +- 130 tps
On this first test, setting checkpoint_flush_to_disk reduces the performance by
5%, but the per second standard deviation is nearly halved, that is the
performance is more stable over the runs, although lower.
Option bgwriter_flush_to_disk effect is inconclusive.
* "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with
checkpoint_timeout=10mn and shared_buffers=256MB (at least 6 checkpoints
during the run, probably more because segments are filled more often than
every 10mn):
flush | performance ... (stddev over per second tps)
off | off | 877 +- 179 tps
off | on | 880 +- 183 tps
on | off | 896 +- 131 tps
on | on | 888 +- 132 tps
On this second short test, setting checkpoint_flush_to_disk seems to maybe
slightly improve performance (maybe 2% ?) and significantly reduces
variability, so it looks like a good move.
* "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients)
flush | performance on ...
cp | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds)
off | off | 1970 +- 60 tps | 1970 +- 783 tps
off | on | 1928 +- 61 tps | 1928 +- 813 tps
on | off | 1578 +- 45 tps | 1578 +- 631 tps
on | on | 1594 +- 47 tps | 1594 +- 618 tps
On this test both average and standard deviation are both reduced by 20%.
This does not look like a win.
CONCLUSION
This approach is simple and significantly improves pg fsync behavior under
moderate load, where the database stays mostly responsive. Under full load,
the situation may be improved or degraded, it depends.
OTHER OPTIONS
Another idea suggested by Theodore Ts'o seems impractical: playing with
Linux io-scheduler priority (ioprio_set) looks only relevant with the
"sfq" scheduler on actual hard disk, but does not work with other
schedulers, especially "deadline" which seems more advisable for Pg, nor
for hardware RAID, which is a common setting.
Also, Theodore Ts'o suggested to use "sync_file_range" to check whether
the writes have reached the disk, and possibly to delay the actual
fsync/checkpoint conclusion if not... I have not tried that, the
implementation is not as trivial, and I'm not sure what to do when the
completion target is coming, but possibly that could be an interesting
option to investigate. Preliminary tests by adding a sleep between the
writes and the final fsync did not yield very good results.
I've also played with numerous other options (changing checkpointer
throttling parameters, reducing checkpoint timeout to 1 second, playing
around with various kernel settings), but that did not seem to be very
effective for the problem at hand.
I also attached a test script I used, that can be adapted if someone wants
to collect some performance data. I also have some basic scripts to
extract and compute stats, ask if needed.
--
Fabien.
Hi Fabien,
On 2015-06-01 PM 08:40, Fabien COELHO wrote:
Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.bgwriter_flush_to_disk does not seem to have a significant impact on these
tests, maybe because pg shared_buffers size is much larger than the database,
so the bgwriter is seldom active.
Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Amit,
Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?
It is currently "*_flush_to_disk". In Andres Freund version the name is
"sync_on_checkpoint_flush", but I did not found it very clear. Using
"*_flush_on_write" instead as your suggest, would be fine as well, it
emphasizes the "when/how" it occurs instead of the final "destination",
why not...
About words: checkpoint "write"s pages, but this really mean passing the
pages to the memory manager, which will think about it... "flush" seems to
suggest a more effective write, but really it may mean the same, the page
is just passed to the OS. So "write/flush" is really "to OS" and not "to
disk". I like the data to be on "disk" in the end, and as soon as
possible, hence the choice to emphasize that point.
Now I would really be okay with anything that people find simple to
understand, so any opinion is welcome!
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
It's nice to see the topic being picked up.
If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental. Did you
consider combining both approaches?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 1, 2015 at 5:10 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello pg-devs,
This patch is a simplified and generalized version of Andres Freund's
August 2014 patch for flushing while writing during checkpoints, with some
documentation and configuration warnings added.For the initial patch, see:
/messages/by-id/20140827091922.GD21544@awork2.anarazel.de
For the whole thread:
/messages/by-id/alpine.DEB.2.10.1408251900211.11151@sto
The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool
flush_to_disk)
{
XLogRecPtr recptr;
ErrorContextCallback errcallback;
@@ -2410,7 +2417,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation
reln)
buf->tag.forkNum,
buf->tag.blockNum,
bufToWrite,
- false);
+ false,
+ flush_to_disk);
Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?
I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.
Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?
Yet another idea could be to allow BGWriter to also fsync the dirty
buffers, that may have side impact of not able to clear the dirty pages
at speed required by system, but I think if that happens one can
think of having multiple BGwriter tasks.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hello Andres,
If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental. Did you
consider combining both approaches?
Ja, I thought that it was a more complex patch with uncertain/less clear
benefits, and as this simpler version was already effective enough as it
was, so I decided to start with that and try to have reasonable proof of
benefits so that it could get through.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Amit,
[...]
The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?
Yep, probably. Under "moderate load" this is not an issue. The
io-scheduler and other hd firmware will probably reorder writes anyway.
Also, if several data are updated together, probably they are likely to be
already neighbours in memory as well as on disk.
I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.
I would rather separate them, unless this is a blocker. This version seems
already quite effective and very light. ISTM that adding a sort phase
would mean reworking significantly how the checkpointer processes pages.
Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?
I think that currently the fsync is on the file handler, so what happens
depends on how fsync is implemented by the system.
Yet another idea could be to allow BGWriter to also fsync the dirty
buffers,
ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".
that may have side impact of not able to clear the dirty pages at speed
required by system, but I think if that happens one can think of having
multiple BGwriter tasks.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-02 15:15:39 +0200, Fabien COELHO wrote:
Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?Yep, probably. Under "moderate load" this is not an issue. The io-scheduler
and other hd firmware will probably reorder writes anyway.
They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.
Also, if several
data are updated together, probably they are likely to be already neighbours
in memory as well as on disk.
No, that's not how it'll happen outside of simplistic cases where you
start with an empty shared_buffers. Shared buffers are maintained by a
simplified LRU, so how often individual blocks are touched will define
the buffer replacement.
I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.I would rather separate them, unless this is a blocker.
I think it is a blocker.
This version seems
already quite effective and very light. ISTM that adding a sort phase would
mean reworking significantly how the checkpointer processes pages.
Meh. The patch for that wasn't that big.
The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Andres,
I would rather separate them, unless this is a blocker.
I think it is a blocker.
Hmmm. This is an argument...
This version seems already quite effective and very light. ISTM that
adding a sort phase would mean reworking significantly how the
checkpointer processes pages.Meh. The patch for that wasn't that big.
Hmmm. I think it should be implemented as Tom suggested, that is per
chunks of shared buffers, in order to avoid allocating a "large" memory.
The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.
ISTM that the two aspects are orthogonal, which would suggests two gucs
anyway.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-02 15:42:14 +0200, Fabien COELHO wrote:
This version seems already quite effective and very light. ISTM that
adding a sort phase would mean reworking significantly how the
checkpointer processes pages.Meh. The patch for that wasn't that big.
Hmmm. I think it should be implemented as Tom suggested, that is per chunks
of shared buffers, in order to avoid allocating a "large" memory.
I don't necessarily agree. But that's really just a minor implementation
detail. The actual problem is sorting & fsyncing in a way that deals
efficiently with tablespaces, i.e. doesn't write to tablespaces
one-by-one. Not impossible, but it requires some thought.
The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.ISTM that the two aspects are orthogonal, which would suggests two gucs
anyway.
They're pretty closely linked from their performance impact. IMO this
feature, if done correctly, should result in better performance in 95+%
of the workloads and be enabled by default. And that'll not be possible
without actually writing mostly sequentially.
It's also not just the sequential writes making this important, it's
also that it allows to do the final fsync() of the individual segments
as soon as their last buffer has been written out. That's important
because it means the file will get fewer writes done independently
(i.e. backends writing out dirty buffers) which will make the final
fsync more expensive.
It might be that we want to different gucs, but I don't think we can
release without both features.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hmmm. I think it should be implemented as Tom suggested, that is per chunks
of shared buffers, in order to avoid allocating a "large" memory.I don't necessarily agree. But that's really just a minor implementation
detail.
Probably.
The actual problem is sorting & fsyncing in a way that deals efficiently
with tablespaces, i.e. doesn't write to tablespaces one-by-one.
Not impossible, but it requires some thought.
Hmmm... I would have neglected this point in a first approximation,
but I agree that not interleaving tablespaces could indeed loose some
performance.
ISTM that the two aspects are orthogonal, which would suggests two gucs
anyway.They're pretty closely linked from their performance impact.
Sure.
IMO this feature, if done correctly, should result in better performance
in 95+% of the workloads
To demonstrate that would require time...
and be enabled by default.
I did not had such an ambition with the submitted patch:-)
And that'll not be possible without actually writing mostly
sequentially.
It's also not just the sequential writes making this important, it's
also that it allows to do the final fsync() of the individual segments
as soon as their last buffer has been written out.
Hmmm... I'm not sure this would have a large impact. The writes are
throttled as much as possible, so fsync will catch plenty other writes
anyway, if there are some.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-02 17:01:50 +0200, Fabien COELHO wrote:
The actual problem is sorting & fsyncing in a way that deals efficiently
with tablespaces, i.e. doesn't write to tablespaces one-by-one.
Not impossible, but it requires some thought.Hmmm... I would have neglected this point in a first approximation,
but I agree that not interleaving tablespaces could indeed loose some
performance.
I think it'll be a hard to diagnose performance regression. So we'll
have to fix it. That argument actually was the blocker in previous
attempts...
IMO this feature, if done correctly, should result in better performance
in 95+% of the workloadsTo demonstrate that would require time...
Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.
and be enabled by default.
I did not had such an ambition with the submitted patch:-)
I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.
And that'll not be possible without actually writing mostly sequentially.
It's also not just the sequential writes making this important, it's also
that it allows to do the final fsync() of the individual segments as soon
as their last buffer has been written out.Hmmm... I'm not sure this would have a large impact. The writes are
throttled as much as possible, so fsync will catch plenty other writes
anyway, if there are some.
That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
IMO this feature, if done correctly, should result in better performance
in 95+% of the workloadsTo demonstrate that would require time...
Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.
I did spent time (well, a machine spent time, really) to collect some
convincing data for the simple version without sorting to demonstrate that
it brings a clear value, which seems not to be enough...
I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.
Hmmm.
My opinion is that throughput is given too much attention in general, but
if both can be kept/improved, this would be easier to sell, obviously.
It's also not just the sequential writes making this important, it's also
that it allows to do the final fsync() of the individual segments as soon
as their last buffer has been written out.Hmmm... I'm not sure this would have a large impact. The writes are
throttled as much as possible, so fsync will catch plenty other writes
anyway, if there are some.That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.
Statistically, I think that it would reduce the number of unrelated writes
taken in a fsync by about half: the last table to be written on a
tablespace, at the end of the checkpoint, will have accumulated
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
time, while the first table will have avoided most of them.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-02 18:59:05 +0200, Fabien COELHO wrote:
IMO this feature, if done correctly, should result in better performance
in 95+% of the workloadsTo demonstrate that would require time...
Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.I did spent time (well, a machine spent time, really) to collect some
convincing data for the simple version without sorting to demonstrate that
it brings a clear value, which seems not to be enough...
"which seems not to be enough" - man. It's trivial to make things
faster/better/whatever if you don't care about regressions in other
parts. And if we'd add a guc for each of these cases we'd end up with
thousands of them.
My opinion is that throughput is given too much attention in general, but if
both can be kept/improved, this would be easier to sell, obviously.
Your priorities are not everyone's. That's life.
That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.Statistically, I think that it would reduce the number of unrelated writes
taken in a fsync by about half: the last table to be written on a
tablespace, at the end of the checkpoint, will have accumulated
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
time, while the first table will have avoided most of them.
That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2015-06-02 PM 07:19, Fabien COELHO wrote:
Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?It is currently "*_flush_to_disk". In Andres Freund version the name is
"sync_on_checkpoint_flush", but I did not found it very clear. Using
"*_flush_on_write" instead as your suggest, would be fine as well, it
emphasizes the "when/how" it occurs instead of the final "destination", why
not...About words: checkpoint "write"s pages, but this really mean passing the pages
to the memory manager, which will think about it... "flush" seems to suggest a
more effective write, but really it may mean the same, the page is just passed
to the OS. So "write/flush" is really "to OS" and not "to disk". I like the
data to be on "disk" in the end, and as soon as possible, hence the choice to
emphasize that point.Now I would really be okay with anything that people find simple to
understand, so any opinion is welcome!
It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
understand this and the previous discussion(s) correctly, the patch tries to
alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
the sync in step with writes (which do abide by the
checkpoint_completion_target). Given that impression, it seems *_sync_on_write
may even do the job.
Again, this is a minor issue.
By the way, I tend to agree with others here that there needs to be found a
good balance such that this sync-blocks-one-at-time-in-random-order approach
does not hurt generalized workload too much although it seems to help with
solving the latency problem that you seem set out to solve.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 2, 2015 at 6:45 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello Amit,
[...]
The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?Yep, probably. Under "moderate load" this is not an issue. The
io-scheduler and other hd firmware will probably reorder writes anyway.
Also, if several data are updated together, probably they are likely to be
already neighbours in memory as well as on disk.
I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.I would rather separate them, unless this is a blocker. This version
seems already quite effective and very light. ISTM that adding a sort phase
would mean reworking significantly how the checkpointer processes pages.
I agree with you that if we have to add a sort phase, there is additional
work and that work could be significant depending on the design we
choose, however without that, this patch can have impact on many kind
of workloads, even in your mail in one of the tests
("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
it has shown 20% degradation which is quite significant and test also
seems to be representative of the workload which many users in real-world
will use.
Now one can say that for such workloads turn the new knob to off, but
in reality it could be difficult to predict if the load is always moderate.
I think users might be able to predict that at table level, but inspite of
that
I don't think having any such knob can give us ticket to flush the buffers
in random order.
Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?I think that currently the fsync is on the file handler, so what happens
depends on how fsync is implemented by the system.
That can also lead to random I/O if the fsync for different files is not in
order as they are actually stored on disk.
Yet another idea could be to allow BGWriter to also fsync the dirty
buffers,ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".
I think patch just issues an async operation not the actual flush. Why
I have suggested so is that in your tests when the checkpoint_timeout
is small it seems there is a good gain in performance that means if
keep on flushing dirty buffers at regular intervals, the system's
performance
is good and BGWriter is the process where that can be done conveniently
apart from checkpoint, one might think that if same can be achieved by
using
shorter checkpoint_timeout interval, then why to do this incremental flushes
by bgwriter, but in reality I think checkpoint is responsible for other
things
as well other than dirty buffers, so we can't leave everything till
checkpoint
happens.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
That might be the case in a database with a single small table; i.e.
where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant
part of the writes issued independently from checkpointing will be
outside the processing of the individual segment.Statistically, I think that it would reduce the number of unrelated writes
taken in a fsync by about half: the last table to be written on a
tablespace, at the end of the checkpoint, will have accumulated
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
time, while the first table will have avoided most of them.That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.
I meant my argument with "continuous flushing" activated, so there is no
up to 30 seconds delay induced my the memory manager. Hmmm, maybe I do not
understood your argument.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Amit,
It is currently "*_flush_to_disk". In Andres Freund version the name is
"sync_on_checkpoint_flush", but I did not found it very clear. Using
"*_flush_on_write" instead as your suggest, would be fine as well, it
emphasizes the "when/how" it occurs instead of the final "destination", why
not...[...]
It seems 'sync' gets closer to what I really wanted 'flush' to mean. If
I understand this and the previous discussion(s) correctly, the patch
tries to alleviate the problems caused by one-big-sync-at-the
end-of-writes by doing the sync in step with writes (which do abide by
the checkpoint_completion_target). Given that impression, it seems
*_sync_on_write may even do the job.
I desagree with this one, because the sync is only *initiated*, not done.
For this reason I think that "flush" seems a better word. I understand
"sync" as "committed to disk". For the data to be synced, it should call
with the "wait after" option, which is a partial "fsync", but that would
be terrible for performance as all checkpointed pages would be written one
by one, without any opportunity for reordering them.
For what it's worth and for the record, Linux sync_file_range
documentation says "This is an asynchronous flush-to-disk operation" to
describe the corresponding option. This is probably where I took it.
So two contenders:
*_flush_to_disk
*_flush_on_write
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I agree with you that if we have to add a sort phase, there is additional
work and that work could be significant depending on the design we
choose, however without that, this patch can have impact on many kind
of workloads, even in your mail in one of the tests
("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
it has shown 20% degradation which is quite significant and test also
seems to be representative of the workload which many users in real-world
will use.
Yes, I do agree with the 4 clients, but I doubt that many user run their
application at maximum available throughput all the time (like always
driving foot to the floor). So for me throttled runs are more
representative of real life.
Now one can say that for such workloads turn the new knob to off, but
in reality it could be difficult to predict if the load is always moderate.
Hmmm. The switch says "I prefer stable (say latency bounded) performance",
if you run a web site probably you should want that.
Anyway, I'll look at sorting when I have some time.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fabien,
On 2015-06-03 PM 02:53, Fabien COELHO wrote:
It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
understand this and the previous discussion(s) correctly, the patch tries to
alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
the sync in step with writes (which do abide by the
checkpoint_completion_target). Given that impression, it seems
*_sync_on_write may even do the job.I desagree with this one, because the sync is only *initiated*, not done. For
this reason I think that "flush" seems a better word. I understand "sync" as
"committed to disk". For the data to be synced, it should call with the "wait
after" option, which is a partial "fsync", but that would be terrible for
performance as all checkpointed pages would be written one by one, without any
opportunity for reordering them.For what it's worth and for the record, Linux sync_file_range documentation
says "This is an asynchronous flush-to-disk operation" to describe the
corresponding option. This is probably where I took it.
Ah, okay! I didn't quite think about the async aspect here. But, I sure do
hope that the added mechanism turns out to be *less* async than kernel's own
dirty cache handling to achieve the hoped for gain.
So two contenders:
*_flush_to_disk
*_flush_on_write
Yep!
Regards,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers