Partitioned checkpointing
Hi All,
Recently, I have found a paper titled "Segmented Fussy Checkpointing for
Main Memory Databases" published in 1996 at ACM symposium on Applied
Computing, which inspired me to implement a similar mechanism in PostgreSQL.
Since the early evaluation results obtained from a 16 core server was beyond
my expectation, I have decided to submit a patch to be open for discussion
by community members interested in this mechanism.
Attached patch is a PoC (or mayby prototype) implementation of a partitioned
checkpointing on 9.5alpha2 The term 'partitioned' is used here instead of
'segmented' because I feel 'segmented' is somewhat confusable with 'xlog
segment,' etc. In contrast, the term 'partitioned' is not as it implies
almost the same concept of 'buffer partition,' thus I think it is suitable.
Background and my motivation is that performance dip due to checkpoint is a
major concern, and thus it is valuable to mitigate this issue. In fact, many
countermeasures have been attempted against this issue. As far as I know,
those countermeasures so far focus mainly on mitigating the adverse impact
due to disk writes to implement the buffer sync; recent highly regarded
'checkpointer continuous flushing' is a typical example. On the other hand,
I don't feel that another source of the performance dip has been heartily
addressed; full-page-write rush, which I call here, would be a major issue.
That is, the average size of transaction log (XLOG) records jumps up sharply
immediately after the beginning of each checkpoint, resulting in the
saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL
buffers.
In the following, I will describe early evaluation results and mechanism of
the partitioned checkpointing briefly.
1. Performance evaluation
1.1 Experimental setup
The configuration of the server machine was as follows.
CPU: Intel E5-2650 v2 (8 cores/chip) @ 2.60GHz x 2
Memory: 64GB
OS: Linux 2.6.32-504.12.2.el6.x86_64 (CentOS)
Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog
raid1 of 2 SSDs for $PGDATA (other than pg_xlog)
PostgreSQL settings
shared_buffers = 28GB
wal_buffers = 64MB
checkpoint_timeout = 10min
max_wal_size = 128GB
min_wal_size = 8GB
checkpoint_completion_target = 0.9
benchmark
pgbench -M prepared -N -P 1 -T 3600
The scaling factor was 1000. Both the number of clients (-c option) and
threads (-j option) were 120 for sync. commit case and 96 for async. commit
(synchronous_commit = off) case. These are chosen because maximum
throughputs were obtained under these conditions.
The server was connected to a client machine on which pgbench client program
run with a 1G ether. Since the client machine was not saturated in the
measurement and thus hardly affected the results, details of the client
machine are not described here.
1.2 Early results
The measurement results shown here are latency average, latency stddev, and
throughput (tps), which are the output of the pgbench program.
1.2.1 synchronous_commit = on
(a) 9.5alpha2(original)
latency average: 2.852 ms
latency stddev: 6.010 ms
tps = 42025.789717 (including connections establishing)
tps = 42026.137247 (excluding connections establishing)
(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.815 ms
latency stddev: 2.317 ms
tps = 42575.301137 (including connections establishing)
tps = 42575.677907 (excluding connections establishing)
1.2.2 synchronous_commit = off
(a) 9.5alpha2(original)
latency average: 2.136 ms
latency stddev: 5.422 ms
tps = 44870.897907 (including connections establishing)
tps = 44871.215792 (excluding connections establishing)
(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.085 ms
latency stddev: 1.529 ms
tps = 45974.617724 (including connections establishing)
tps = 45974.973604 (excluding connections establishing)
1.3 Summary
The partitioned checkpointing produced great improvement (reduction) in
latency stddev and slight improvement in latency average and tps; there was
no performance degradation. Therefore, there is an effect to stabilize the
operation in this partitioned checkpointing. In fact, the throughput
variation, obtained by -P 1 option, shows that the dips were mitigated in
both magnitude and frequency.
# Since I'm not sure whether it is OK to send an email to this mailing with
attaching some files other than patch, I refrain now from attaching raw
results (200K bytes of text/case) and result graphs in .jpg or .epsf format
illustrating the throughput variations to this email. If it is OK, I'm
pleased to show the results in those formats.
2. Mechanism
Imaginably, 'partitioned checkpointing' conducts buffer sync not for all
buffers at once but for the buffers belonging to one partition at one
invocation of the checkpointer. In the following description, the number of
partitions is expressed by N. (N is fixed to 16 in the attached patch).
2.1 Principles of operations
In order to preserve the semantics of the traditional checkpointing, the
checkpointer invocation interval is changed to checkpoint_timeout / N. The
checkpointer carries out the buffer sync for the buffer partition 0 at the
first invocation, and then for the buffer partition 1 at the second
invocation, and so on. When the turn of the the buffer partition N-1 comes,
i.e. the last round of a series of buffer sync, the checkpointer carries out
the buffer sync for the buffer partition and other usual checkpoint
operations, coded in CheckPointGuts() in xlog.c.
The principle is that, roughly speaking, 1) checkpointing for the buffer
partition 0 corresponds to the beginning of the traditional checkpointing,
where the XLOG location (LSN) is obtained and set to RedoRecPtr, and 2)
checkpointing for the buffer partition N - 1 corresponds to the end of the
traditional checkpointing, where the WAL files that are no longer need (up
to the previous log segment of that specified by the RedoRecPtr value) are
deleted or recycled.
A role of RedoRecPtr indicating the threshold to determine whether FPW is
necessary or not is moved to a new N-element array of XLogRecPtr, as the
threshold for each buffer is different among partitions. The n-th element of
the array is updated when the buffer sync for partition n is carried out.
2.2 Drawbacks
The 'partitioned checkpointing' works effectively in such situation that the
checkpointer is invoked by hitting the checkpoint_timeout; performance dip
is mitigated and the WAL size is not changed (in avarage).
On the other hand, when the checkpointer is invoked by another trigger event
than timeout, traditional checkpoint procedure which syncs all buffers at
once will take place, resulting in performance dip. Also the WAL size for
that checkpoint period (until the next invocation of the checkpointer) will
theoritically increase to 1.5 times of that of usual case because of the
increase in the FPW.
My opinion is that this is not serious because it is preferable for
checkpointer to be invoked by the timeout, and thus usual systems are
supposed to be tuned to work under such condition that is prefarable for the
'partitioned checkpointing.'
3. Conclusion
The 'partitioned checkpointing' mechanism is expected to be effective for
mitigating the performance dip due to checkpoint. In particular, it is
noteworthy that the effect was observed on a server machine that use SSDs
for $PGDATA, for which seek optimizations are not believed effective.
Therefore, this mechanism is worth to further investigation aiming to
implement in future PostgreSQL.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Hello Takashi-san,
I suggest that you have a look at the following patch submitted in June:
https://commitfest.postgresql.org/6/260/
And these two threads:
/messages/by-id/alpine.DEB.2.10.1408251900211.11151@sto/
/messages/by-id/alpine.DEB.2.10.1506011320000.28433@sto/
Including many performance figures under different condition in the second
thread. I'm not sure how it would interact with what you are proposing,
but it is also focussing on improving postgres availability especially on
HDD systems by changing how the checkpointer write buffers with sorting
and flushing.
Also, you may consider running some tests about the latest version of this
patch onto your hardware, to complement Amit tests?
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I don't feel that another source of the performance dip has been heartily
addressed; full-page-write rush, which I call here, would be a major issue.
That is, the average size of transaction log (XLOG) records jumps up sharply
immediately after the beginning of each checkpoint, resulting in the
saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL
buffers.
On this point, you may have a look at this item:
https://commitfest.postgresql.org/5/283/
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Feel free to do that. 200kB is well below this list's limits. (I'm not
sure how easy it is to open .epsf files in today's systems, but .jpg or
other raster image formats are pretty common.)
Thanks for your comment.
Please find two result graphs (for sync and async commit cases) in .jpg format.
I think it is obvious that performance dips were mitigated in both magnitude and frequency by introducing the partitioned checkpointing.
(In passing) one correction for my previous email.
Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog
raid1 of 2 SSDs for $PGDATA (other than pg_xlog)
'raid1' is wrong and 'raid0' is correct.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Show quoted text
-----Original Message-----
From: Alvaro Herrera [mailto:pgsql-hackers-owner@postgresql.org]
Sent: Friday, September 11, 2015 12:03 AM
To: Horikawa Takashi(堀川 隆)
Subject: Re: [HACKERS] Partitioned checkpointingTakashi Horikawa wrote:
# Since I'm not sure whether it is OK to send an email to this mailing
with attaching some files other than patch, I refrain now from
attaching raw results (200K bytes of text/case) and result graphs in
.jpg or .epsf format illustrating the throughput variations to this
email. If it is OK, I'm pleased to show the results in those formats.Feel free to do that. 200kB is well below this list's limits. (I'm not
sure how easy it is to open .epsf files in today's systems, but .jpg or
other raster image formats are pretty common.)--
Álvaro Herrera
Import Notes
Reply to msg id not found: 20150910150304.GI2912@alvherre.pgsql
Fabien,
Thanks for your comment.
I'll check them and try to examine what is the same and what is different.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Show quoted text
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Fabien COELHO
Sent: Thursday, September 10, 2015 7:33 PM
To: Horikawa Takashi(堀川 隆)
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingI don't feel that another source of the performance dip has been
heartily addressed; full-page-write rush, which I call here, would bea major issue.
That is, the average size of transaction log (XLOG) records jumps up
sharply immediately after the beginning of each checkpoint, resulting
in the saturation of WAL write path including disk(s) for
$PGDATA/pg_xlog and WAL buffers.On this point, you may have a look at this item:
https://commitfest.postgresql.org/5/283/
--
Fabien.--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachments:
Hello Takashi-san,
I wanted to do some tests with this POC patch. For this purpose, it would
be nice to have a guc which would allow to activate or not this feature.
Could you provide a patch with such a guc? I would suggest to have the
number of partitions as a guc, so that choosing 1 would basically reflect
the current behavior.
Some general comments :
I understand that what this patch does is cutting the checkpoint of
buffers in 16 partitions, each addressing 1/16 of buffers, and each with
its own wal-log entry, pacing, fsync and so on.
I'm not sure why it would be much better, although I agree that it may
have some small positive influence on performance, but I'm afraid it may
also degrade performance in some conditions. So I think that maybe a
better understanding of why there is a better performance and focus on
that could help obtain a more systematic gain.
This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.
I'm wondering whether the benefit you see are linked to the file flushing
behavior induced by fsyncing more often, in which case it is quite close
the "flushing" part of the current "checkpoint continuous flushing" patch,
and could be redundant/less efficient that what is done there, especially
as test have shown that the effect of flushing is *much* better on sorted
buffers.
Another proposal around, suggested by Andres Freund I think, is that
checkpoint could fsync files while checkpointing and not wait for the end
of the checkpoint. I think that it may also be one of the reason why your
patch does bring benefit, but Andres approach would be more systematic,
because there would be no need to fsync files several time (basically your
patch issues 16 fsync per file). This suggest that the "partitionning"
should be done at a lower level, from within the CheckPointBuffers, which
would take care of fsyncing files some time after writting buffers to them
is finished.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11 September 2015 at 09:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Some general comments :
Thanks for the summary Fabien.
I understand that what this patch does is cutting the checkpoint of
buffers in 16 partitions, each addressing 1/16 of buffers, and each with
its own wal-log entry, pacing, fsync and so on.I'm not sure why it would be much better, although I agree that it may
have some small positive influence on performance, but I'm afraid it may
also degrade performance in some conditions. So I think that maybe a better
understanding of why there is a better performance and focus on that could
help obtain a more systematic gain.
I think its a good idea to partition the checkpoint, but not doing it this
way.
Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.
This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.I'm wondering whether the benefit you see are linked to the file flushing
behavior induced by fsyncing more often, in which case it is quite close
the "flushing" part of the current "checkpoint continuous flushing" patch,
and could be redundant/less efficient that what is done there, especially
as test have shown that the effect of flushing is *much* better on sorted
buffers.Another proposal around, suggested by Andres Freund I think, is that
checkpoint could fsync files while checkpointing and not wait for the end
of the checkpoint. I think that it may also be one of the reason why your
patch does bring benefit, but Andres approach would be more systematic,
because there would be no need to fsync files several time (basically your
patch issues 16 fsync per file). This suggest that the "partitionning"
should be done at a lower level, from within the CheckPointBuffers, which
would take care of fsyncing files some time after writting buffers to them
is finished.
The idea to do a partial pass through shared buffers and only write a
fraction of dirty buffers, then fsync them is a good one.
The key point is that we spread out the fsyncs across the whole checkpoint
period.
I think we should be writing out all buffers for a particular file in one
pass, then issue one fsync per file. >1 fsyncs per file seems a bad idea.
So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there
2. Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints
3. Make N passes through shared buffers, writing out files assigned to each
batch as we go
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/11/2015 03:56 PM, Simon Riggs wrote:
The idea to do a partial pass through shared buffers and only write a
fraction of dirty buffers, then fsync them is a good one.The key point is that we spread out the fsyncs across the whole
checkpoint period.
I doubt that's really what we want to do, as it defeats one of the
purposes of spread checkpoints. With spread checkpoints, we write the
data to the page cache, and then let the OS to actually write the data
to the disk. This is handled by the kernel, which marks the data as
expired after some time (say, 30 seconds) and then flushes them to disk.
The goal is to have everything already written to disk when we call
fsync at the beginning of the next checkpoint, so that the fsync are
cheap and don't cause I/O issues.
What you propose (spreading the fsyncs) significantly changes that,
because it minimizes the amount of time the OS has for writing the data
to disk in the background to 1/N. That's a significant change, and I'd
bet it's for the worse.
I think we should be writing out all buffers for a particular file
in one pass, then issue one fsync per file. >1 fsyncs per file seems
a bad idea.So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there
2. Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints
3. Make N passes through shared buffers, writing out files assigned to
each batch as we go
What I think might work better is actually keeping the write/fsync
phases we have now, but instead of postponing the fsyncs until the next
checkpoint we might spread them after the writes. So with target=0.5
we'd do the writes in the first half, then the fsyncs in the other half.
Of course, we should sort the data like you propose, and issue the
fsyncs in the same order (so that the OS has time to write them to the
devices).
I wonder how much the original paper (written in 1996) is effectively
obsoleted by spread checkpoints, but the benchmark results posted by
Horikawa-san suggest there's a possible gain. But perhaps partitioning
the checkpoints is not the best approach?
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Simon,
The idea to do a partial pass through shared buffers and only write a
fraction of dirty buffers, then fsync them is a good one.
Sure.
The key point is that we spread out the fsyncs across the whole checkpoint
period.
Yes, this is really Andres suggestion, as I understood it.
I think we should be writing out all buffers for a particular file in one
pass, then issue one fsync per file. >1 fsyncs per file seems a bad idea.
This is one of the things done in the "checkpoint continuous flushing"
patch, as buffers are sorted, they are written per file, and in order
within a file, which help getting sequencial writes instead of random
writes.
See https://commitfest.postgresql.org/6/260/
However for now the final fsync is not called, but Linux is told that the
written buffers must be flushed, which is akin to an "asynchronous fsync",
i.e. it asks to move data but does not wait for the data to be actually
written, as a blocking fsync would.
Andres suggestion, which has some common points to Takashi-san patch, is
to also integrate the fsync in the buffer writing process. There are some
details to think about, because probably it is not a a good to issue an
fsync right after the corresponding writes, it is better to wait for some
delay before doing so, so the implementation is not straightforward.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
Partitioned checkpoint have the significant disadvantage that it
increases random write io by the number of passes. Which is a bad idea,
*especially* on SSDs.
So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there
2. Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints
3. Make N passes through shared buffers, writing out files assigned to
each batch as we go
That's essentially what Fabien's sorting patch does by sorting all
writes.
What I think might work better is actually keeping the write/fsync phases we
have now, but instead of postponing the fsyncs until the next checkpoint we
might spread them after the writes. So with target=0.5 we'd do the writes in
the first half, then the fsyncs in the other half. Of course, we should sort
the data like you propose, and issue the fsyncs in the same order (so that
the OS has time to write them to the devices).
I think the approach in Fabien's patch of enforcing that there's not
very much dirty data to flush by forcing early cache flushes is
better. Having gigabytes worth of dirty data in the OS page cache can
have massive negative impact completely independent of fsyncs.
I wonder how much the original paper (written in 1996) is effectively
obsoleted by spread checkpoints, but the benchmark results posted by
Horikawa-san suggest there's a possible gain. But perhaps partitioning the
checkpoints is not the best approach?
I think it's likely that the patch will have only a very small effect if
applied ontop of Fabien's patch (which'll require some massaging I'm
sure).
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Fabien,
I wanted to do some tests with this POC patch. For this purpose, it would
be nice to have a guc which would allow to activate or not this feature.
Thanks.
Could you provide a patch with such a guc? I would suggest to have the
number
of partitions as a guc, so that choosing 1 would basically reflect the
current behavior.
Sure.
Please find the attached patch.
A guc parameter named 'checkpoint_partitions' is added.
This can be set to 1, 2, 4, 8, or 16.
Default is 16. (It is trivial at this present, I think.)
And some definitions are moved from bufmgr.h to xlog.h.
It would be also trivial.
Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Hi,
I understand that what this patch does is cutting the checkpoint
of buffers in 16 partitions, each addressing 1/16 of buffers, and each with
its own wal-log entry, pacing, fsync and so on.
Right.
However,
The key point is that we spread out the fsyncs across the whole checkpoint
period.
this is not the key point of the 'partitioned checkpointing,' I think.
The original purpose is to mitigate full-page-write rush that occurs at
immediately after the beginning of each checkpoint.
The amount of FPW at each checkpoint is reduced to 1/16 by the
'Partitioned checkpointing.'
This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.
I agree.
Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.
May be right.
However, current method was designed with considering to split
buffers so as to balance the load as equally as possible;
current patch splits the buffer as
---
1st round: b[0], b[p], b[2p], … b[(n-1)p]
2nd round: b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
…
p-1 th round:b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
---
where N is the number of buffers,
p is the number of partitions, and n = (N / p).
It would be extremely unbalance if buffers are divided as follow.
---
1st round: b[0], b[1], b[2], … b[n-1]
2nd round: b[n], b[n+1], b[n+2], … b[2n-1]
…
p-1 th round:b[(p-1)n], b[(p-1)n+1], b[(p-1)n+2], … b[(p-1)n+(n-1)]
---
I'm afraid that I miss the point, but
2.
Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints
Splitting buffers with considering the file boundary makes FPW related processing
(in xlog.c and xloginsert.c) complicated intolerably, as 'Partitioned
checkpointing' is strongly related to the decision of whether this buffer
is necessary to FPW or not at the time of inserting the xlog record.
# 'partition id = buffer id % number of partitions' is fairly simple.
Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Show quoted text
-----Original Message-----
From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: Friday, September 11, 2015 10:57 PM
To: Fabien COELHO
Cc: Horikawa Takashi(堀川 隆); pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingOn 11 September 2015 at 09:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Some general comments :
Thanks for the summary Fabien.
I understand that what this patch does is cutting the checkpoint
of buffers in 16 partitions, each addressing 1/16 of buffers, and each with
its own wal-log entry, pacing, fsync and so on.I'm not sure why it would be much better, although I agree that
it may have some small positive influence on performance, but I'm afraid
it may also degrade performance in some conditions. So I think that maybe
a better understanding of why there is a better performance and focus on
that could help obtain a more systematic gain.I think its a good idea to partition the checkpoint, but not doing it this
way.Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.I'm wondering whether the benefit you see are linked to the file
flushing behavior induced by fsyncing more often, in which case it is quite
close the "flushing" part of the current "checkpoint continuous flushing"
patch, and could be redundant/less efficient that what is done there,
especially as test have shown that the effect of flushing is *much* better
on sorted buffers.Another proposal around, suggested by Andres Freund I think, is
that checkpoint could fsync files while checkpointing and not wait for the
end of the checkpoint. I think that it may also be one of the reason why
your patch does bring benefit, but Andres approach would be more systematic,
because there would be no need to fsync files several time (basically your
patch issues 16 fsync per file). This suggest that the "partitionning"
should be done at a lower level, from within the CheckPointBuffers, which
would take care of fsyncing files some time after writting buffers to them
is finished.The idea to do a partial pass through shared buffers and only write a fraction
of dirty buffers, then fsync them is a good one.The key point is that we spread out the fsyncs across the whole checkpoint
period.I think we should be writing out all buffers for a particular file in one
pass, then issue one fsync per file. >1 fsyncs per file seems a bad idea.So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there 2.
Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints 3. Make N passes through shared buffers, writing out files
assigned to each batch as we go--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
Hello Andres,
Thank you for discussion. It’s nice for me to discuss here.
Partitioned checkpoint have the significant disadvantage that it increases
random write io by the number of passes. Which is a bad idea,
*especially* on SSDs.
I’m curious about the conclusion that the Partitioned Checkpinting
increases random write io by the number of passes.
The order of buffer syncs in original checkpointing is as follows.
b[0], b[1], ….. , b[N-1]
where b[i] means buffer[i], N is the number of buffers.
(To make the story simple, starting point of buffer sync determined by the
statement ‘buf_id = StrategySyncStart(NULL, NULL);’ in BufferSync() is
ignored here. If it is important in this discussion, please note that.)
While partitioned checkpointing is as follows.
1st round
b[0], b[p], b[2p], … b[(n-1)p]
2nd round
b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
…
last round
b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
where p is the number of partitions and n = (N / p).
I think important here is that the ‘Partitioned checkpointing’ does not
change (increase) the total number of buffer writes.
I wonder why the sequence of b[0], b[1], ….. , b[N-1] is less random than
that of b[0], b[p], b[2p], … b[(n-1)p]. I think there is no relationship
between the neighboring buffers, like b[0] and b[1]. Is this wrong?
Also, I believe that random ‘PAGE’ writes are not harmful for SSDs.
(Buffer sync is carried out in the unit of 8 Kbyte page.) Harmful for SSDs
is partial write (write size is less than PAGE size) because it increases
the write-amplitude of the SSD, resulting in shortening its lifetime. On the
other hand, IIRC, random ‘PAGE’ writes do not increase the write
amplitude. Wearleveling algorithm of the SSD should effectively handle
random ‘PAGE’ writes.
I think it's likely that the patch will have only a very small effect if
applied ontop of Fabien's patch (which'll require some massaging I'm
sure).
It may be true or not. Who knows?
I think only detail experimentations tell the truth.
Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Andres Freund
Sent: Saturday, September 12, 2015 1:30 AM
To: Tomas Vondra
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingHi,
Partitioned checkpoint have the significant disadvantage that it increases
random write io by the number of passes. Which is a bad idea,
*especially* on SSDs.So we'd need logic like this
1. Run through shared buffers and analyze the files contained in
there 2. Assign files to one of N batches so we can make N roughly
equal sized mini-checkpoints 3. Make N passes through shared buffers,
writing out files assigned to each batch as we goThat's essentially what Fabien's sorting patch does by sorting all writes.
What I think might work better is actually keeping the write/fsync
phases we have now, but instead of postponing the fsyncs until the
next checkpoint we might spread them after the writes. So with
target=0.5 we'd do the writes in the first half, then the fsyncs in
the other half. Of course, we should sort the data like you propose,
and issue the fsyncs in the same order (so that the OS has time to writethem to the devices).
I think the approach in Fabien's patch of enforcing that there's not very
much dirty data to flush by forcing early cache flushes is better. Having
gigabytes worth of dirty data in the OS page cache can have massive
negative
impact completely independent of fsyncs.
I wonder how much the original paper (written in 1996) is effectively
obsoleted by spread checkpoints, but the benchmark results posted by
Horikawa-san suggest there's a possible gain. But perhaps partitioning
the checkpoints is not the best approach?I think it's likely that the patch will have only a very small effect if
applied ontop of Fabien's patch (which'll require some massaging I'm
sure).
Show quoted text
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachments:
Hello Fabien,
I wrote:
A guc parameter named 'checkpoint_partitions' is added.
This can be set to 1, 2, 4, 8, or 16.
Default is 16. (It is trivial at this present, I think.)
I've noticed that the behavior in 'checkpoint_partitions = 1' is not the
same as that of original 9.5alpha2.
Attached 'partitioned-checkpointing-v3.patch' fixed the bug, thus please use
it.
I'm sorry for that.
PS. The result graphs I sent was obtained using original 9.5alpha2, and thus
that results is unrelated to this bug.
As to 'partitioned checkpointing' case, the results shown in that graph
is probably worth than bug-fix version.
Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Takashi Horikawa
Sent: Saturday, September 12, 2015 11:50 AM
To: Fabien COELHO
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingHello Fabien,
I wanted to do some tests with this POC patch. For this purpose, it
would
Show quoted text
be nice to have a guc which would allow to activate or not this feature.
Thanks.
Could you provide a patch with such a guc? I would suggest to have the
number
of partitions as a guc, so that choosing 1 would basically reflect the
current behavior.Sure.
Please find the attached patch.A guc parameter named 'checkpoint_partitions' is added.
This can be set to 1, 2, 4, 8, or 16.
Default is 16. (It is trivial at this present, I think.)And some definitions are moved from bufmgr.h to xlog.h.
It would be also trivial.Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
As to 'partitioned checkpointing' case, the results shown in that
graph
is probably worth than bug-fix version.
^^^^^
worse
Sorry for typo.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Takashi Horikawa
Sent: Saturday, September 12, 2015 11:36 PM
To: Fabien COELHO
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingHello Fabien,
I wrote:
A guc parameter named 'checkpoint_partitions' is added.
This can be set to 1, 2, 4, 8, or 16.
Default is 16. (It is trivial at this present, I think.)I've noticed that the behavior in 'checkpoint_partitions = 1' is not the
same as that of original 9.5alpha2.
Attached 'partitioned-checkpointing-v3.patch' fixed the bug, thus please
use
it.
I'm sorry for that.PS. The result graphs I sent was obtained using original 9.5alpha2, and
thus
that results is unrelated to this bug.
As to 'partitioned checkpointing' case, the results shown in that
graph
is probably worth than bug-fix version.
Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of TakashiHorikawa
Sent: Saturday, September 12, 2015 11:50 AM
To: Fabien COELHO
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingHello Fabien,
I wanted to do some tests with this POC patch. For this purpose, it
would
be nice to have a guc which would allow to activate or not this
feature.
Show quoted text
Thanks.
Could you provide a patch with such a guc? I would suggest to have the
number
of partitions as a guc, so that choosing 1 would basically reflect the
current behavior.Sure.
Please find the attached patch.A guc parameter named 'checkpoint_partitions' is added.
This can be set to 1, 2, 4, 8, or 16.
Default is 16. (It is trivial at this present, I think.)And some definitions are moved from bufmgr.h to xlog.h.
It would be also trivial.Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Attachments:
Hi,
I wrote:
The original purpose is to mitigate full-page-write rush that occurs at
immediately after the beginning of each checkpoint.
The amount of FPW at each checkpoint is reduced to 1/16 by the
'Partitioned checkpointing.'
Let me show another set of measurement results that clearly illustrates this point. Please find DBT2-sync.jpg and DBT2-sync-FPWoff.jpg.
At first, I noticed the performance dip due to checkpointing when I conducted some performance measurement using DBT-2 that implements transactions based on the TPC-C specification. As can be seen in DBT2-sync.jpg, original 9.5alpha2 showed sharp dips in throughput periodically.
The point here is that I identified that those dips were caused by full-page-write rush that occurs immediately after the beginning of each checkpoint. As shown in DBT2-sync-FPWoff.jpg; those dips were eliminated when a GUC parameter 'full_page_writes' was set to 'off.' This also indicates that existing mechanism of spreading buffer sync operations over time was effectively worked. As only difference between original 9.5alpha2 case in DBT2-sync.jpg and DBT2-sync-FPWoff.jpg was in the setting of 'full_page_writes,' those dips were attributed to the full-page-write as a corollary.
The 'Partitioned checkpointing' was implemented to mitigate the dips by spreading full-page-writes over time and was worked exactly as designed (see DBT2-sync.jpg). It also produced good effect for pgbench, thus I have posted an article with a Partitioned-checkpointing.patch to this mailing list.
As to pgbench, however, I have found that full-page-writes did not cause the performance dips, because the dips also occurred when 'full_page_writes' was set to 'off.' So, honestly, I do not exactly know why 'Partitioned checkpointing' mitigated the dips in pgbench executions.
However, it is certain that there are some, other than pgbench, workloads for PostgreSQL in which the full-page-write rush causes performance dips and 'Partitioned checkpointing' is effective to eliminate (or mitigate) them; DBT-2 is an example.
And also, 'Partitioned checkpointing' is worth to study why it is effective in pgbench executions. By studying it, it may lead to devising better ways.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Show quoted text
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Takashi Horikawa
Sent: Saturday, September 12, 2015 12:50 PM
To: Simon Riggs; Fabien COELHO
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingHi,
I understand that what this patch does is cutting the checkpoint
of buffers in 16 partitions, each addressing 1/16 of buffers, and eachwith
its own wal-log entry, pacing, fsync and so on.
Right.
However,The key point is that we spread out the fsyncs across the whole checkpoint
period.this is not the key point of the 'partitioned checkpointing,' I think.
The original purpose is to mitigate full-page-write rush that occurs at
immediately after the beginning of each checkpoint.
The amount of FPW at each checkpoint is reduced to 1/16 by the
'Partitioned checkpointing.'This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.I agree.
Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.May be right.
However, current method was designed with considering to split
buffers so as to balance the load as equally as possible;
current patch splits the buffer as
---
1st round: b[0], b[p], b[2p], … b[(n-1)p]
2nd round: b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
…
p-1 th round:b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
---
where N is the number of buffers,
p is the number of partitions, and n = (N / p).It would be extremely unbalance if buffers are divided as follow.
---
1st round: b[0], b[1], b[2], … b[n-1]
2nd round: b[n], b[n+1], b[n+2], … b[2n-1]
…
p-1 th round:b[(p-1)n], b[(p-1)n+1], b[(p-1)n+2], … b[(p-1)n+(n-1)]
---I'm afraid that I miss the point, but
2.
Assign files to one of N batches so we can make N roughly equal sized
mini-checkpointsSplitting buffers with considering the file boundary makes FPW related
processing
(in xlog.c and xloginsert.c) complicated intolerably, as 'Partitioned
checkpointing' is strongly related to the decision of whether this buffer
is necessary to FPW or not at the time of inserting the xlog record.
# 'partition id = buffer id % number of partitions' is fairly simple.Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories-----Original Message-----
From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: Friday, September 11, 2015 10:57 PM
To: Fabien COELHO
Cc: Horikawa Takashi(堀川 隆); pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingOn 11 September 2015 at 09:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Some general comments :
Thanks for the summary Fabien.
I understand that what this patch does is cutting the checkpoint
of buffers in 16 partitions, each addressing 1/16 of buffers, and eachwith
its own wal-log entry, pacing, fsync and so on.
I'm not sure why it would be much better, although I agree that
it may have some small positive influence on performance, but I'm afraid
it may also degrade performance in some conditions. So I think that maybe
a better understanding of why there is a better performance and focuson
that could help obtain a more systematic gain.
I think its a good idea to partition the checkpoint, but not doing it
this
way.
Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.I'm wondering whether the benefit you see are linked to the file
flushing behavior induced by fsyncing more often, in which case it isquite
close the "flushing" part of the current "checkpoint continuous flushing"
patch, and could be redundant/less efficient that what is done there,
especially as test have shown that the effect of flushing is *much* better
on sorted buffers.Another proposal around, suggested by Andres Freund I think, is
that checkpoint could fsync files while checkpointing and not wait forthe
end of the checkpoint. I think that it may also be one of the reason why
your patch does bring benefit, but Andres approach would be moresystematic,
because there would be no need to fsync files several time (basically
your
patch issues 16 fsync per file). This suggest that the "partitionning"
should be done at a lower level, from within the CheckPointBuffers, which
would take care of fsyncing files some time after writting buffers tothem
is finished.
The idea to do a partial pass through shared buffers and only write a
fraction
of dirty buffers, then fsync them is a good one.
The key point is that we spread out the fsyncs across the whole checkpoint
period.I think we should be writing out all buffers for a particular file in
one
pass, then issue one fsync per file. >1 fsyncs per file seems a bad idea.
So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there2.
Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints 3. Make N passes through shared buffers, writing outfiles
assigned to each batch as we go
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Takashi-san,
I've noticed that the behavior in 'checkpoint_partitions = 1' is not the
same as that of original 9.5alpha2. Attached
'partitioned-checkpointing-v3.patch' fixed the bug, thus please use it.
I've done two sets of run on an old box (16 GB, 8 cores, RAID1 HDD)
with "pgbench -M prepared -N -j 2 -c 4 ..." and analysed per second traces
(-P 1) for 4 versions : sort+flush patch fully on, sort+flush patch full
off (should be equivalent to head), partition patch v3 with 1 partition
(should also be equivalent to head), partition patch v3 with 16
partitions.
I ran two configurations :
small:
shared_buffers = 2GB
checkpoint_timeout = 300s
checkpoint_completion_target = 0.8
pgbench's scale = 120, time = 4000
medium:
shared_buffers = 4GB
max_wal_size = 5GB
checkpoint_timeout = 30min
checkpoint_completion_target = 0.8
pgbench's scale = 300, time = 7500
* full speed run performance
average tps +- standard deviation (percent of under 10 tps seconds)
small medium
1. flush+sort : 751 +- 415 ( 0.2) 984 +- 500 ( 0.3)
2. no flush/sort : 188 +- 508 (83.6) 252 +- 551 (77.0)
3. 1 partition : 183 +- 518 (85.6) 232 +- 535 (78.3)
4. 16 partitions : 179 +- 462 (81.1) 196 +- 492 (80.9)
Although 2 & 3 should be equivalent, there seems to be a lower performance
with 1 partition, but it is pretty close and it may not be significant.
The 16 partitions seems to show significant lower tps performance,
especially for the medium case. Although the stddev is a little bit better
for the small case, as suggested by the lower off-line figure, but
relatively higher with the medium case (stddev = 2.5 * average).
There is no comparision with the flush & sort activated.
* throttled performance (-R 100/200 -L 100)
percent of late transactions - above 100 ms or not even started as the
system is much too behind schedule.
small-100 small-200 medium-100
1. flush+sort : 1.0 1.9 1.9
2. no flush/sort : 31.5 49.8 27.1
3. 1 partition : 32.3 49.0 27.0
4. 16 partitions : 32.9 48.0 31.5
2 & 3 seem pretty equivalent, as expected. The 16 partitions seem to
slightly degrade availability on average. Yet again, no comparison with
flush & sort activated.
From these runs, I would advise against applying the checkpoint
partitionning patch: there is no consistent benefit on the basic harware
I'm using on this test. I think that it make sense, because fsyncing
random I/O several times instead of one has little impact.
Now, once I/O are not random, that is with some kind of combined patch,
this is another question. I would rather go with Andres suggestion to
fsync once per file, when writing to a file is completed, because
partitionning as such would reduce the effectiveness of sorting buffers.
I think that it would be interesting if you could test the sort/flush
patch on the same high-end system that you used for testing your partition
patch.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
I wrote:
... So, honestly, I do not exactly know
why 'Partitioned checkpointing' mitigated the dips in pgbench executions.
I have conducted some experiments to clarify why checkpointing brings on performance dip and why partitioned checkpointing mitigated the dip in my experimental setup.
Some were finished, only traces of the past knowledge, and the others, I beleive, resulted in a new knowledge.
Please find some result graphs, default.JPG, no-flushd.JPG, dirty64M.JPG, dirty512M.JPG, and file_sync_range.JPG,
which were obtained with varying some kernel parameters to change the behavior of the flush daemon.
At first, it becomes clear that performance dip problem is related to the flush daemon, as it has been said conventionally.
When the flush daemon activity was disabled (no-flushd.JPG) during the benchmark execution, performance dips occurred at the end of checkpoint period.
Time variation of the throughput for this case is quite different from that for default setting (default.JPG).
When using the partitioned checkpointing, I think that the system had the same kind of behavior, but scale is one-Nth, where N is the number of partitions.
As the checkpoint interval was almost the same as dirty_expire_centisecs (30 seconds), there would be no chance for flush daemon to work.
Since the flush daemon has an important role in performance dip, I changed some kernel parameter dirty_bytes and dirty_background_bytes based on the conventional knowledge.
As to my experimental setup, 512MB, which some blogs recommend to set in dirty_bytes, was not small enough.
When 512MB and 256MB were set in dirty_bytes and dirty_background_bytes respectively (dirty512M.JPG), there was no visible improvement from the default setting (default.JPG).
Meaningful improvement was observed when 64MB and 32MB were set in those parameters (dirty64M.JPG).
In a situation in which small values are set in dirty_bytes and dirty_backgound_bytes, a buffer is likely stored in the HD immediately after the buffer is written in the kernel by the checkpointer.
Thus, I tried a quick hack to make the checkpointer invoke write system call to write a dirty buffer immediately followed by invoking store operation for a buffer implemented with sync_file_range() system call.
# For reference, I attach the patch.
As shown in file_sync_range.JPG, this strategy considered to have been effective.
In conclusion, as long as pgbench execution against linux concerns, using sync_file_range() is a promising solution.
That is, the checkpointer invokes sync_file_range() to store a buffer immediately after it writes the buffer in the kernel.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories
Show quoted text
-----Original Message-----
From: Horikawa Takashi(堀川 隆)
Sent: Monday, September 14, 2015 6:43 PM
To: Simon Riggs; Fabien COELHO
Cc: pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Partitioned checkpointingHi,
I wrote:
The original purpose is to mitigate full-page-write rush that occurs at
immediately after the beginning of each checkpoint.
The amount of FPW at each checkpoint is reduced to 1/16 by the
'Partitioned checkpointing.'Let me show another set of measurement results that clearly illustrates
this point. Please find DBT2-sync.jpg and DBT2-sync-FPWoff.jpg.At first, I noticed the performance dip due to checkpointing when I conducted
some performance measurement using DBT-2 that implements transactions based
on the TPC-C specification. As can be seen in DBT2-sync.jpg, original
9.5alpha2 showed sharp dips in throughput periodically.The point here is that I identified that those dips were caused by
full-page-write rush that occurs immediately after the beginning of each
checkpoint. As shown in DBT2-sync-FPWoff.jpg; those dips were eliminated
when a GUC parameter 'full_page_writes' was set to 'off.' This also
indicates that existing mechanism of spreading buffer sync operations over
time was effectively worked. As only difference between original 9.5alpha2
case in DBT2-sync.jpg and DBT2-sync-FPWoff.jpg was in the setting of
'full_page_writes,' those dips were attributed to the full-page-write as
a corollary.The 'Partitioned checkpointing' was implemented to mitigate the dips by
spreading full-page-writes over time and was worked exactly as designed
(see DBT2-sync.jpg). It also produced good effect for pgbench, thus I have
posted an article with a Partitioned-checkpointing.patch to this mailing
list.As to pgbench, however, I have found that full-page-writes did not cause
the performance dips, because the dips also occurred when
'full_page_writes' was set to 'off.' So, honestly, I do not exactly know
why 'Partitioned checkpointing' mitigated the dips in pgbench executions.However, it is certain that there are some, other than pgbench, workloads
for PostgreSQL in which the full-page-write rush causes performance dips
and 'Partitioned checkpointing' is effective to eliminate (or mitigate)
them; DBT-2 is an example.And also, 'Partitioned checkpointing' is worth to study why it is effective
in pgbench executions. By studying it, it may lead to devising better ways.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of TakashiHorikawa
Sent: Saturday, September 12, 2015 12:50 PM
To: Simon Riggs; Fabien COELHO
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingHi,
I understand that what this patch does is cutting the checkpoint
of buffers in 16 partitions, each addressing 1/16 of buffers, and eachwith
its own wal-log entry, pacing, fsync and so on.
Right.
However,The key point is that we spread out the fsyncs across the whole checkpoint
period.this is not the key point of the 'partitioned checkpointing,' I think.
The original purpose is to mitigate full-page-write rush that occurs at
immediately after the beginning of each checkpoint.
The amount of FPW at each checkpoint is reduced to 1/16 by the
'Partitioned checkpointing.'This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.I agree.
Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.May be right.
However, current method was designed with considering to split
buffers so as to balance the load as equally as possible;
current patch splits the buffer as
---
1st round: b[0], b[p], b[2p], … b[(n-1)p]
2nd round: b[1], b[p+1], b[2p+1], … b[(n-1)p+1]
…
p-1 th round:b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
---
where N is the number of buffers,
p is the number of partitions, and n = (N / p).It would be extremely unbalance if buffers are divided as follow.
---
1st round: b[0], b[1], b[2], … b[n-1]
2nd round: b[n], b[n+1], b[n+2], … b[2n-1]
…
p-1 th round:b[(p-1)n], b[(p-1)n+1], b[(p-1)n+2], … b[(p-1)n+(n-1)]
---I'm afraid that I miss the point, but
2.
Assign files to one of N batches so we can make N roughly equal sized
mini-checkpointsSplitting buffers with considering the file boundary makes FPW related
processing
(in xlog.c and xloginsert.c) complicated intolerably, as 'Partitioned
checkpointing' is strongly related to the decision of whether this buffer
is necessary to FPW or not at the time of inserting the xlog record.
# 'partition id = buffer id % number of partitions' is fairly simple.Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories-----Original Message-----
From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: Friday, September 11, 2015 10:57 PM
To: Fabien COELHO
Cc: Horikawa Takashi(堀川 隆); pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Partitioned checkpointingOn 11 September 2015 at 09:07, Fabien COELHO <coelho@cri.ensmp.fr>
wrote:
Some general comments :
Thanks for the summary Fabien.
I understand that what this patch does is cutting the checkpoint
of buffers in 16 partitions, each addressing 1/16 of buffers, and eachwith
its own wal-log entry, pacing, fsync and so on.
I'm not sure why it would be much better, although I agree that
it may have some small positive influence on performance, but I'm afraid
it may also degrade performance in some conditions. So I think thatmaybe
a better understanding of why there is a better performance and focus
on
that could help obtain a more systematic gain.
I think its a good idea to partition the checkpoint, but not doing it
this
way.
Splitting with N=16 does nothing to guarantee the partitions are equally
sized, so there would likely be an imbalance that would reduce the
effectiveness of the patch.This method interacts with the current proposal to improve the
checkpointer behavior by avoiding random I/Os, but it could be combined.I'm wondering whether the benefit you see are linked to the file
flushing behavior induced by fsyncing more often, in which case it isquite
close the "flushing" part of the current "checkpoint continuous
flushing"
patch, and could be redundant/less efficient that what is done there,
especially as test have shown that the effect of flushing is *much*better
on sorted buffers.
Another proposal around, suggested by Andres Freund I think, is
that checkpoint could fsync files while checkpointing and not wait forthe
end of the checkpoint. I think that it may also be one of the reason
why
your patch does bring benefit, but Andres approach would be more
systematic,
because there would be no need to fsync files several time (basically
your
patch issues 16 fsync per file). This suggest that the "partitionning"
should be done at a lower level, from within the CheckPointBuffers,which
would take care of fsyncing files some time after writting buffers to
them
is finished.
The idea to do a partial pass through shared buffers and only write
a
fraction
of dirty buffers, then fsync them is a good one.
The key point is that we spread out the fsyncs across the whole checkpoint
period.I think we should be writing out all buffers for a particular file in
one
pass, then issue one fsync per file. >1 fsyncs per file seems a bad
idea.
So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there2.
Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints 3. Make N passes through shared buffers, writing outfiles
assigned to each batch as we go
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
default.JPGimage/jpeg; name=default.JPGDownload
no-flushd.JPGimage/jpeg; name=no-flushd.JPGDownload+1-0
dirty512M.JPGimage/jpeg; name=dirty512M.JPGDownload
dirty64M.JPGimage/jpeg; name=dirty64M.JPGDownload
file_sync_range.JPGimage/jpeg; name=file_sync_range.JPGDownload+1-0
using-sync_file_range-poc.patchapplication/octet-stream; name=using-sync_file_range-poc.patchDownload+32-0
Hello,
These are interesting runs.
In a situation in which small values are set in dirty_bytes and
dirty_backgound_bytes, a buffer is likely stored in the HD immediately
after the buffer is written in the kernel by the checkpointer. Thus, I
tried a quick hack to make the checkpointer invoke write system call to
write a dirty buffer immediately followed by invoking store operation
for a buffer implemented with sync_file_range() system call. # For
reference, I attach the patch. As shown in file_sync_range.JPG, this
strategy considered to have been effective.
Indeed. This approach is part of this current patch:
https://commitfest.postgresql.org/6/260/
Basically, what you do is to call sync_file_range on each block, and you
tested on a high-end system probably with a lot of BBU disk cache, which I
guess allows the disk to reorder writes so as to benefit from sequential
write performance.
In conclusion, as long as pgbench execution against linux concerns,
using sync_file_range() is a promising solution.
I found that calling sync_file_range for every block could degrade
performance a bit under some conditions, at least onmy low-end systems
(just a [raid] disk, no significant disk cache in front of it), so the
above patch aggregates neighboring writes so as to issue less
sync_file_range calls.
That is, the checkpointer invokes sync_file_range() to store a buffer
immediately after it writes the buffer in the kernel.
Yep. It is interesting that sync_file_range alone improves stability a lot
on your high-end system, although sorting is mandatory for low-end
systems.
My interpretation, already stated above, is that the hardware does the
sorting on the cached data at the disk level in your system.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers