Improvement of checkpoint IO scheduler for stable transaction responses
Hi,
I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.
* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint scheduler
has two problems at start and end of checkpoint. One problem is IO heavy when
starting initial checkpoint in rounds of checkpoint. This problem was caused by
full-page-write which cause WAL IO in fast page writes after checkpoint write
page. Therefore, when starting checkpoint, WAL-based checkpoint scheduler wrong
judgment that is late schedule by full-page-write, nevertheless checkpoint
schedule is not late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint. Second problem is
fsync freeze problem in end of checkpoint. Normally, checkpoint write is executed
in background by OS's IO scheduler. But when it does not correctly work, end of
checkpoint fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease user-experience
in application service. It is especially serious problem in cloud and virtual
server database system which does not have IO performance. However we don't have
solution in postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it becomes
little bit long that is not problem. You may think that checkpoint_segments and
checkpoint_timeout are set larger value, however large checkpoint_segments
affects file-cache which is not read and is wasted, and large checkpoint_timeout
was caused long-time crash-recovery.
* Improvement method of checkpoint IO scheduler
1. Improvement full-page-write IO heavy problem in start of checkpoint
My idea is very simple. When start of checkpoint, checkpoint_completion_target
become more loose. I set three parameter of this issue;
'checkpoint_smooth_target', 'checkpoint_smooth_margin' and
'checkpointer_write_delay'. 'checkpointer_smooth_target' parameter is a term
point that is smooth checkpoint IO schedule in checkpoint progress.
'checkpoint_smooth_margin' parameter can be more smooth checkpoint schedule. It
is heuristic parameter, but it solves this problem effectively.
'checkpointer_write_delay' parameter is sleep time for checkpoint schedule. This
parameter is nearly same 'bgwriter_delay' in PG9.1 older.
If you want to get more detail information, please see attached patch.
2. Improvement fsync freeze problem in end of checkpoint
When fsync freeze problem was happened, file fsync more repeatedly is
meaningless and causes stop transactions. So I think, if fsync executing time was
long, IO queue is flooded and should give IO priority to transactions for fast
response time. It realize by inserting sleep time during fsync when fsync time
was long. It seems to be long time in checkpoint, but it is not very long. In
fact, when fsync time is long, IO queue is packed by another IO which is included
checkpoint writes, it only gives IO priority to another executing transactions.
I tested my patch in DBT-2 benchmark. Please see result of test. My patch
realize higher transaction and fast response than plain PG. Checkpoint time is
little bit longer than plain PG, but it is not serious.
* Result of DBT-2 with this patch. (Compared with original PG9.2.4)
I use DBT-2 benchmark software by OSDL. I also use pg_statsinfo and
pg_stats_reporter in this benchmark.
- Patched PG (patched 9.2.4)
DBT-2 result: http://goo.gl/1PD3l
statsinfo report: http://goo.gl/UlGAO
settings: http://goo.gl/X4Whu
- Original PG (9.2.4)
DBT-2 result: http://goo.gl/XVxtj
statsinfo report: http://goo.gl/UT1Li
settings: http://goo.gl/eofmb
Measurement Value is improved 4%, 'new-order 90%tile' is improved 20%,
'new-order average' is improved 18%, 'new-order deviation' is improved 24%, and
'new-order maximum' is improved 27%. I confirm high throughput and WAL IO at
executing checkpoint in pg_stats_reporter's report. My patch realizes high
response transactions and non-blocking executing transactions.
Bad point of my patch is longer checkpoint. Checkpoint time was increased about
10% - 20%. But it can work correctry on schedule-time in checkpoint_timeout.
Please see checkpoint result (http://goo.gl/NsbC6).
* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
It is not advertisement of pg_statsinfo and pg_stats_reporter:-) They are free
software. If you have comment and another idea about my patch, please send me.
Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
Attachments:
improvement_checkpoint_io-scheduler_v0.patchtext/x-diff; name=improvement_checkpoint_io-scheduler_v0.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..a66ce36 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,16 +141,21 @@ static CheckpointerShmemStruct *CheckpointerShmem;
/*
* GUC parameters
*/
+int CheckPointerWriteDelay = 200;
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+int CheckPointerFsyncDelayThreshold = -1;
double CheckPointCompletionTarget = 0.5;
+double CheckPointSmoothTarget = 0.0;
+double CheckPointSmoothMargin = 0.0;
+double CheckPointerFsyncDelayRatio = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
/*
* Private state
@@ -169,7 +174,6 @@ static pg_time_t last_xlog_switch_time;
static void CheckArchiveTimeout(void);
static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -643,7 +647,7 @@ CheckArchiveTimeout(void)
* this does not check the *current* checkpoint's IMMEDIATE flag, but whether
* there is one pending behind it.)
*/
-static bool
+extern bool
ImmediateCheckpointRequested(void)
{
if (checkpoint_requested)
@@ -715,7 +719,7 @@ CheckpointWriteDelay(int flags, double progress)
* Checkpointer and bgwriter are no longer related so take the Big
* Sleep.
*/
- pg_usleep(100000L);
+ pg_usleep(CheckPointerWriteDelay * 1000L);
}
else if (--absorb_counter <= 0)
{
@@ -742,14 +746,35 @@ IsCheckpointOnSchedule(double progress)
{
XLogRecPtr recptr;
struct timeval now;
- double elapsed_xlogs,
+ double original_progress,
+ elapsed_xlogs,
elapsed_time;
Assert(ckpt_active);
- /* Scale progress according to checkpoint_completion_target. */
- progress *= CheckPointCompletionTarget;
-
+ /* This variable is used by smooth checkpoint schedule.*/
+ original_progress = progress * CheckPointCompletionTarget;
+
+ /* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+ if(progress >= CheckPointSmoothTarget)
+ {
+ /* Normal checkpoint schedule. */
+ progress *= CheckPointCompletionTarget;
+ }
+ else
+ {
+ /* Smooth checkpoint schedule.
+ *
+ * When initial checkpoint, it tends to be high IO road average
+ * and slow executing transactions. This schedule reduces them
+ * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget,
+ * it becomes near normal checkpoint schedule. If you want to more
+ * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+ */
+ progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) *
+ (CheckPointSmoothMargin + 1 - CheckPointCompletionTarget)
+ + CheckPointCompletionTarget;
+ }
/*
* Check against the cached value first. Only do the more expensive
* calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
ckpt_cached_elapsed = elapsed_xlogs;
return false;
}
+ else if (original_progress < elapsed_xlogs)
+ {
+ ckpt_cached_elapsed = elapsed_xlogs;
+
+ /* smooth checkpoint write */
+ pg_usleep(CheckPointerWriteDelay * 1000L);
+ return false;
+ }
}
/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
ckpt_cached_elapsed = elapsed_time;
return false;
}
+ else if (original_progress < elapsed_time)
+ {
+ ckpt_cached_elapsed = elapsed_time;
+
+ /* smooth checkpoint write */
+ pg_usleep(CheckPointerWriteDelay * 1000L);
+ return false;
+ }
/* It looks like we're on schedule. */
return true;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..e558eb7 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
@@ -162,6 +163,8 @@ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
static CycleCtr mdckpt_cycle_ctr = 0;
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
typedef enum /* behavior for mdopen & _mdfd_getseg */
{
@@ -1171,6 +1174,18 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+ * for giving priority to executing transaction.
+ */
+ if( CheckPointerFsyncDelayThreshold >= 0 &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ (elapsed / 1000 > CheckPointerFsyncDelayThreshold)){
+ pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio * 1000L);
+ elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+ (double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+ }
break; /* out of retry loop */
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..f3fa5ab 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,30 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_write_delay", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during dirty buffers write in checkpoint."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerWriteDelay,
+ 200, 10, 10000,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerFsyncDelayThreshold,
+ -1, -1, 1000000,
+ NULL, NULL, NULL
+ },
+
+
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2575,36 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpoint_smooth_target", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("Smooth control IO load between starting checkpoint and this target parameter in progress of checkpoint."),
+ NULL
+ },
+ &CheckPointSmoothTarget,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"checkpoint_smooth_margin", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("More smooth control IO load between starting checkpoint and checkpoint_smooth_target."),
+ NULL
+ },
+ &CheckPointSmoothMargin,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+ NULL
+ },
+ &CheckPointerFsyncDelayRatio,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..9c07bd8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -185,7 +185,12 @@
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
+#checkpoint_smooth_target = 0.0 # smooth checkpoint target, 0.0 - 1.0
+#checkpoint_smooth_margin = 0.0 # smooth checkpoint margin, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_write_delay = 200ms # 10-10000 milliseconds
+#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable.
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..5964b99 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -21,9 +21,14 @@
/* GUC options */
extern int BgWriterDelay;
+extern int CheckPointerWriteDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+extern int CheckPointerFsyncDelayThreshold;
extern double CheckPointCompletionTarget;
+extern double CheckPointSmoothTarget;
+extern double CheckPointSmoothMargin;
+extern double CheckPointerFsyncDelayRatio;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +36,7 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern bool ImmediateCheckpointRequested(void);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
BlockNumber segno);
extern void AbsorbFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
On 10 June 2013 11:51, KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp> wrote:
I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.
Looks like good results, with good measurements. Should be an
interesting discussion.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 10, 2013 at 3:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 June 2013 11:51, KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp> wrote:
I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.Looks like good results, with good measurements. Should be an
interesting discussion.
+1.
I suspect we want to poke at the algorithms a little here and maybe
see if we can do this without adding new GUCs. Also, I think this is
probably two separate patches, in the end. But the direction seems
good to me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(2013/06/12 23:07), Robert Haas wrote:
On Mon, Jun 10, 2013 at 3:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 June 2013 11:51, KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp> wrote:
I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.Looks like good results, with good measurements. Should be an
interesting discussion.+1.
I suspect we want to poke at the algorithms a little here and maybe
see if we can do this without adding new GUCs. Also, I think this is
probably two separate patches, in the end. But the direction seems
good to me.
Thank you for comment!
I separate my patch in checkpoint-wirte and in checkpoint-fsync. As you
say, my patch has a lot of new GUCs. I don't think it cannot be decided
automatic. However, it is difficult that chekpoint-scheduler is suitable
for all of enviroments which are like virtual server, public cloude server,
and embedded server, etc. So I think that default setting parameter works
same as before. Setting parameter is primitive and difficult, but if we can
set correctly, it is suitable for a lot of enviroments and will not work
unintended action.
I try to take something into consideration about less GUCs version. And if you
have good idea, please discussion about this!
Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
Attachments:
Improvement_of_checkpoint_io-scheduler_in_write_v1.patchtext/x-diff; name=Improvement_of_checkpoint_io-scheduler_in_write_v1.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..0c0f215 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,9 +141,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
/*
* GUC parameters
*/
+int CheckPointerWriteDelay = 200;
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
double CheckPointCompletionTarget = 0.5;
+double CheckPointSmoothTarget = 0.0;
+double CheckPointSmoothMargin = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
@@ -715,7 +718,7 @@ CheckpointWriteDelay(int flags, double progress)
* Checkpointer and bgwriter are no longer related so take the Big
* Sleep.
*/
- pg_usleep(100000L);
+ pg_usleep(CheckPointerWriteDelay * 1000L);
}
else if (--absorb_counter <= 0)
{
@@ -742,14 +745,36 @@ IsCheckpointOnSchedule(double progress)
{
XLogRecPtr recptr;
struct timeval now;
- double elapsed_xlogs,
+ double original_progress,
+ elapsed_xlogs,
elapsed_time;
Assert(ckpt_active);
- /* Scale progress according to checkpoint_completion_target. */
- progress *= CheckPointCompletionTarget;
+ /* This variable is used by smooth checkpoint schedule.*/
+ original_progress = progress * CheckPointCompletionTarget;
+ /* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+ if(progress >= CheckPointSmoothTarget)
+ {
+ /* Normal checkpoint schedule. */
+ progress *= CheckPointCompletionTarget;
+ }
+ else
+ {
+ /*
+ * Smooth checkpoint schedule.
+ *
+ * When initial checkpoint, it tends to be high IO road average
+ * and slow executing transactions. This schedule reduces them
+ * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget,
+ * it becomes near normal checkpoint schedule. If you want to more
+ * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+ */
+ progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) *
+ (CheckPointSmoothMargin + 1 - CheckPointCompletionTarget) +
+ CheckPointCompletionTarget;
+ }
/*
* Check against the cached value first. Only do the more expensive
* calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
ckpt_cached_elapsed = elapsed_xlogs;
return false;
}
+ else if (original_progress < elapsed_xlogs)
+ {
+ ckpt_cached_elapsed = elapsed_xlogs;
+
+ /* smooth checkpoint write */
+ pg_usleep(CheckPointerWriteDelay * 1000L);
+ return false;
+ }
}
/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
ckpt_cached_elapsed = elapsed_time;
return false;
}
+ else if (original_progress < elapsed_time)
+ {
+ ckpt_cached_elapsed = elapsed_time;
+
+ /* smooth checkpoint write */
+ pg_usleep(CheckPointerWriteDelay * 1000L);
+ return false;
+ }
/* It looks like we're on schedule. */
return true;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..d41dc17 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_write_delay", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during dirty buffers write in checkpoint."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerWriteDelay,
+ 200, 10, 10000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2562,26 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpoint_smooth_target", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("Smooth control IO load between starting checkpoint and this target parameter in progress of checkpoint."),
+ NULL
+ },
+ &CheckPointSmoothTarget,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"checkpoint_smooth_margin", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("More smooth control IO load between starting checkpoint and checkpoint_smooth_target."),
+ NULL
+ },
+ &CheckPointSmoothMargin,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..b4d83f2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -185,7 +185,10 @@
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
+#checkpoint_smooth_target = 0.0 # smooth checkpoint target, 0.0 - 1.0
+#checkpoint_smooth_margin = 0.0 # smooth checkpoint margin, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_write_delay = 200ms # 10-10000 milliseconds
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..8a441bc 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -21,9 +21,12 @@
/* GUC options */
extern int BgWriterDelay;
+extern int CheckPointerWriteDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
extern double CheckPointCompletionTarget;
+extern double CheckPointSmoothTarget;
+extern double CheckPointSmoothMargin;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
Improvement_of_checkpoint_io-scheduler_in_fsynci_v1.patchtext/x-diff; name=Improvement_of_checkpoint_io-scheduler_in_fsynci_v1.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..2b223e9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
*/
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+int CheckPointerFsyncDelayThreshold = -1;
double CheckPointCompletionTarget = 0.5;
+double CheckPointerFsyncDelayRatio = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
/*
* Private state
@@ -169,7 +171,6 @@ static pg_time_t last_xlog_switch_time;
static void CheckArchiveTimeout(void);
static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -643,7 +644,7 @@ CheckArchiveTimeout(void)
* this does not check the *current* checkpoint's IMMEDIATE flag, but whether
* there is one pending behind it.)
*/
-static bool
+extern bool
ImmediateCheckpointRequested(void)
{
if (checkpoint_requested)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..99dac53 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
@@ -162,6 +163,8 @@ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
static CycleCtr mdckpt_cycle_ctr = 0;
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
typedef enum /* behavior for mdopen & _mdfd_getseg */
{
@@ -1171,6 +1174,20 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+ * for giving priority to executing transaction.
+ */
+ if( CheckPointerFsyncDelayThreshold >= 0 &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ (elapsed / 1000 > CheckPointerFsyncDelayThreshold))
+ {
+ pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio * 1000L);
+ if(log_checkpoints)
+ elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+ (double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+ }
break; /* out of retry loop */
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..74051cb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerFsyncDelayThreshold,
+ -1, -1, 1000000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+ NULL
+ },
+ &CheckPointerFsyncDelayRatio,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..707b433 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable.
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..a02ba1f 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+extern int CheckPointerFsyncDelayThreshold;
extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,7 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern bool ImmediateCheckpointRequested(void);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
BlockNumber segno);
extern void AbsorbFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
On 10.06.2013 13:51, KONDO Mitsumasa wrote:
I create patch which is improvement of checkpoint IO scheduler for
stable transaction responses.* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint
scheduler has two problems at start and end of checkpoint. One problem
is IO heavy when starting initial checkpoint in rounds of checkpoint.
This problem was caused by full-page-write which cause WAL IO in fast
page writes after checkpoint write page. Therefore, when starting
checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
schedule by full-page-write, nevertheless checkpoint schedule is not
late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint.
Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images. That's an interesting
phenomenon, but did you actually see that causing a problem in your
tests? I couldn't tell from the results you posted what the impact of
that was. Could you repeat the tests separately with the two separate
patches you posted later in this thread?
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.
Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.
Second problem is fsync freeze problem in end of checkpoint.
Normally, checkpoint write is executed in background by OS's IO
scheduler. But when it does not correctly work, end of checkpoint
fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease
user-experience in application service. It is especially serious
problem in cloud and virtual server database system which does not
have IO performance. However we don't have solution in
postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it
becomes little bit long that is not problem. You may think that
checkpoint_segments and checkpoint_timeout are set larger value,
however large checkpoint_segments affects file-cache which is not
read and is wasted, and large checkpoint_timeout was caused
long-time crash-recovery.
A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.
Apart from the direct performance impact of that patch, sorting the
writes would allow us to interleave the fsyncs with the writes. You
would write out all buffers for relation A, then fsync it, then all
buffers for relation B, then fsync it, and so forth. That would
naturally spread out the fsyncs.
If we don't mind scanning the buffer cache several times, we don't
necessarily even need to sort the writes for that. Just scan the buffer
cache for all buffers belonging to relation A, then fsync it. Then scan
the buffer cache again, for all buffers belonging to relation B, then
fsync that, and so forth.
Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).
For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote:
Another thought is that rather than trying to compensate for that effect in
the checkpoint scheduler, could we avoid the sudden rush of full-page images
in the first place? The current rule for when to write a full page image is
conservative: you don't actually need to write a full page image when you
modify a buffer that's sitting in the buffer cache, if that buffer hasn't
been flushed to disk by the checkpointer yet, because the checkpointer will
write and fsync it later. I'm not sure how much it would smoothen WAL write
I/O, but it would be interesting to try.
Hm. Could you elaborate why that wouldn't open new hazards? I don't see
how that could be safe against crashes in some places. It seems to me
we could end up replaying records like heap_insert or similar pages
while the page is still torn?
A long time ago, Itagaki wrote a patch to sort the checkpoint writes: www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because Tom
couldn't reproduce the numbers, and because sorting requires allocating a
large array, which has the risk of running out of memory, which would be bad
when you're trying to checkpoint.
Hm. We could allocate the array early on since the number of buffers
doesn't change. Sure that would be pessimistic, but that seems fine.
Alternatively I can very well imagine that it would still be beneficial
to sort the dirty buffers in shared buffers. I.e. scan till we found 50k
dirty pages, sort them and only then write them out.
Apart from the direct performance impact of that patch, sorting the writes
would allow us to interleave the fsyncs with the writes. You would write out
all buffers for relation A, then fsync it, then all buffers for relation B,
then fsync it, and so forth. That would naturally spread out the
fsyncs.
I personally think that optionally trying to force the pages to be
written out earlier (say, with sync_file_range) to make the actual
fsync() lateron cheaper is likely to be better overall.
If we don't mind scanning the buffer cache several times, we don't
necessarily even need to sort the writes for that. Just scan the buffer
cache for all buffers belonging to relation A, then fsync it. Then scan the
buffer cache again, for all buffers belonging to relation B, then fsync
that, and so forth.
That would end up with quite a lot of scans in a reasonably sized
machines. Not to talk of those that have a million+ relations. That
doesn't seem to be a good idea for bigger shared_buffers. C.f. the stuff
we did for 9.3 to make it cheaper to drop a bunch of relations at once
by only scanning shared_buffers once.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for giving comments and my patch reviewer!
(2013/06/16 23:27), Heikki Linnakangas wrote:
On 10.06.2013 13:51, KONDO Mitsumasa wrote:
I create patch which is improvement of checkpoint IO scheduler for
stable transaction responses.* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint
scheduler has two problems at start and end of checkpoint. One problem
is IO heavy when starting initial checkpoint in rounds of checkpoint.
This problem was caused by full-page-write which cause WAL IO in fast
page writes after checkpoint write page. Therefore, when starting
checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
schedule by full-page-write, nevertheless checkpoint schedule is not
late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint.Yeah, the checkpoint scheduling logic doesn't take into account the heavy WAL
activity caused by full page images. That's an interesting phenomenon, but did
you actually see that causing a problem in your tests? I couldn't tell from the
results you posted what the impact of that was. Could you repeat the tests
separately with the two separate patches you posted later in this thread?
OK, I try to test with the two separate patches. My patches results which I send past
indicate high WAL throughputs(write_size_per_sec) and high transaction during
checkpoint. Please see
under following HTML file which I set tag jump, and put 'checkpoint highlight
switch' button.
* With my patched PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#wal_statistics
* Plain PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#wal_statistics
In wal statistics result, I think that high WAL thorouputs in checkpoint starting
indicates that checkpoint IO does not disturb other executing transaction IO.
Rationalizing a bit, I could even argue to myself that it's a *good* thing. At
the beginning of a checkpoint, the OS write cache should be relatively empty, as
the checkpointer hasn't done any writes yet. So it might make sense to write a
burst of pages at the beginning, to partially fill the write cache first, before
starting to throttle. But this is just handwaving - I have no idea what the
effect is in real life.
Yes, I think so. If we want to change IO throttle, we change OS parameter which
are '/proc/sys/vm/dirty_background_ratio' or '/proc/sys/vm/dirty_ratio'. But this
parameter effects whole applications in OS, it is difficult to change this
parameter and cannot set intuitive parameter. And I think that database tuning
should be set in database parameter rather than OS parameter. It is more clear in
tuning a server.
Another thought is that rather than trying to compensate for that effect in the
checkpoint scheduler, could we avoid the sudden rush of full-page images in the
first place? The current rule for when to write a full page image is
conservative: you don't actually need to write a full page image when you modify
a buffer that's sitting in the buffer cache, if that buffer hasn't been flushed
to disk by the checkpointer yet, because the checkpointer will write and fsync it
later. I'm not sure how much it would smoothen WAL write I/O, but it would be
interesting to try.
It is most right method in ideal implementations. But I don't have any idea about
this method. It seems very difficult...
Second problem is fsync freeze problem in end of checkpoint.
Normally, checkpoint write is executed in background by OS's IO
scheduler. But when it does not correctly work, end of checkpoint
fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease
user-experience in application service. It is especially serious
problem in cloud and virtual server database system which does not
have IO performance. However we don't have solution in
postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it
becomes little bit long that is not problem. You may think that
checkpoint_segments and checkpoint_timeout are set larger value,
however large checkpoint_segments affects file-cache which is not
read and is wasted, and large checkpoint_timeout was caused
long-time crash-recovery.A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because Tom
couldn't reproduce the numbers, and because sorting requires allocating a large
array, which has the risk of running out of memory, which would be bad when
you're trying to checkpoint.
Yes, we tested Itagaki's patche last year. But our test results is not good. I
think that our test server's RAID contoroler with 1GB cache and 8 disks was too
good to indicate good results. Write IO might be eventually optimized in RAID
contoroler which has big chache.
Apart from the direct performance impact of that patch, sorting the writes would
allow us to interleave the fsyncs with the writes. You would write out all
buffers for relation A, then fsync it, then all buffers for relation B, then
fsync it, and so forth. That would naturally spread out the fsyncs.If we don't mind scanning the buffer cache several times, we don't necessarily
even need to sort the writes for that. Just scan the buffer cache for all buffers
belonging to relation A, then fsync it. Then scan the buffer cache again, for all
buffers belonging to relation B, then fsync that, and so forth.
Yes. But I don't think that it needs *exactly* buffer sort. It needs roughly
buffer sort only for interleving the fsyncs with the writes. Roughly buffer sort
reduce computational complexity which was said by Tom, and it will be optimized
in OS IO scheduler as same as exactly buffer sort. My roughly buffer sort images
are clustering like k-means. If we can know distribution of buffers in advance,
we will be able to realize roughly buffer sort with less computational complexity.
Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).For a fair comparison, you should increase the checkpoint_completion_target of
the unpatched test, so that the checkpoints run for roughly the same amount of
time with and without the patch. Otherwise the benefit you're seeing could be
just because of a more lazy checkpoint.
It is important to understand other contributer, I need more fair comparison and
an objective analysis. Thanks for your advice, I try it!
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 17, 2013 at 2:18 AM, Andres Freund <andres@2ndquadrant.com>wrote:
On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote:
A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
.He posted very promising performance numbers, but it was dropped because
Tom
couldn't reproduce the numbers, and because sorting requires allocating a
large array, which has the risk of running out of memory, which would bebad
when you're trying to checkpoint.
Hm. We could allocate the array early on since the number of buffers
doesn't change. Sure that would be pessimistic, but that seems fine.Alternatively I can very well imagine that it would still be beneficial
to sort the dirty buffers in shared buffers. I.e. scan till we found 50k
dirty pages, sort them and only then write them out.
Without knowing that Itagaki had done something similar in the past, couple
of months back I tried exactly the same thing i.e. sort the shared buffers
in chunks and then write them out at once. But I did not get any
significant performance gain except when the shared buffers are 3/4th (or
some such number) or more than the available RAM. I will see if I can pull
out the patch and the numbers. But if memory serves well, I concluded that
the kernel is already utilising its buffer cache to achieve the same thing
and it does not help beyond a point.
Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
(2013/06/17 5:48), Andres Freund wrote:> On 2013-06-16 17:27:56 +0300, Heikki
Linnakangas wrote:
If we don't mind scanning the buffer cache several times, we don't
necessarily even need to sort the writes for that. Just scan the buffer
cache for all buffers belonging to relation A, then fsync it. Then scan the
buffer cache again, for all buffers belonging to relation B, then fsync
that, and so forth.That would end up with quite a lot of scans in a reasonably sized
machines. Not to talk of those that have a million+ relations. That
doesn't seem to be a good idea for bigger shared_buffers. C.f. the stuff
we did for 9.3 to make it cheaper to drop a bunch of relations at once
by only scanning shared_buffers once.
As I written to reply to Heikki, I think that it is unnecessary to exactly buffer
sort which has expensive cost. What we need to solve this problem, we need
accuracy of sort which can be optimized in OS IO scheduler. And we normally have
two optimized IO scheduler layer which are OS layer and RAID controller layer. I
think that performance will be improved if it enables sort accuracy to optimize
in these process. I think that computational complexity required to solve this
problem is one sequential buffer descriptor scan for roughly buffer sort. I will
try to study about this implementation, too.
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
I took results of my separate patches and original PG.
* Result of DBT-2
| TPS 90%tile Average Maximum
------------------------------------------------------
original_0.7 | 3474.62 18.348328 5.739 36.977713
original_1.0 | 3469.03 18.637865 5.842 41.754421
fsync | 3525.03 13.872711 5.382 28.062947
write | 3465.96 19.653667 5.804 40.664066
fsync + write | 3564.94 16.31922 5.1 34.530766
- 'original_*' indicates checkpoint_completion_target in PG 9.2.4.
- In other patched postgres, checkpoint_completion_target sets 0.7.
- 'write' is applied write patch, and 'fsync' is applied fsync patch.
- 'fsync + write' is applied both patches.
* Investigation of result
- Large value of checkpoint_completion_target in original and the patch in
write become slow latency in benchmark transactions. Because slow write pages are
caused long time fsync IO in final checkpoint.
- The patch in fsync has an effect latency in each file fsync. Continued
fsyncsin each files are caused slow latency. Therefore, it is good for latency
that fsync stage in checkpoint has sleeping time after slow fsync IO.
- The patches of fsync + write were seemed to improve TPS. I think that write
patch does not disturb transactions which are in full-page-write WAL write than
original(plain) PG.
I will send you more detail investigation and result next week. And I will also
take result in pgbench. If you mind other part of benchmark result or parameter
of postgres, please tell me.
Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 21.06.2013 11:29, KONDO Mitsumasa wrote:
I took results of my separate patches and original PG.
* Result of DBT-2
| TPS 90%tile Average Maximum
------------------------------------------------------
original_0.7 | 3474.62 18.348328 5.739 36.977713
original_1.0 | 3469.03 18.637865 5.842 41.754421
fsync | 3525.03 13.872711 5.382 28.062947
write | 3465.96 19.653667 5.804 40.664066
fsync + write | 3564.94 16.31922 5.1 34.530766- 'original_*' indicates checkpoint_completion_target in PG 9.2.4.
- In other patched postgres, checkpoint_completion_target sets 0.7.
- 'write' is applied write patch, and 'fsync' is applied fsync patch.
- 'fsync + write' is applied both patches.* Investigation of result
- Large value of checkpoint_completion_target in original and the patch
in write become slow latency in benchmark transactions. Because slow
write pages are caused long time fsync IO in final checkpoint.
- The patch in fsync has an effect latency in each file fsync. Continued
fsyncsin each files are caused slow latency. Therefore, it is good for
latency that fsync stage in checkpoint has sleeping time after slow
fsync IO.
- The patches of fsync + write were seemed to improve TPS. I think that
write patch does not disturb transactions which are in full-page-write
WAL write than original(plain) PG.
Hmm, so the write patch doesn't do much, but the fsync patch makes the
response times somewhat smoother. I'd suggest that we drop the write
patch for now, and focus on the fsyncs.
What checkpointer_fsync_delay_ratio and
checkpointer_fsync_delay_threshold settings did you use with the fsync
patch? It's disabled by default.
This is the interesting part of the patch:
@@ -1171,6 +1174,20 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);+ /* + * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio' + * for giving priority to executing transaction. + */ + if( CheckPointerFsyncDelayThreshold >= 0 && + !shutdown_requested && + !ImmediateCheckpointRequested() && + (elapsed / 1000 > CheckPointerFsyncDelayThreshold)) + { + pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio * 1000L); + if(log_checkpoints) + elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec", + (double) (elapsed / 1000) * CheckPointerFsyncDelayRatio); + } break; /* out of retry loop */ }
I'm not sure it's a good idea to sleep proportionally to the time it
took to complete the previous fsync. If you have a 1GB cache in the RAID
controller, fsyncing the a 1GB segment will fill it up. But since it
fits in cache, it will return immediately. So we proceed fsyncing other
files, until the cache is full and the fsync blocks. But once we fill up
the cache, it's likely that we're hurting concurrent queries. ISTM it
would be better to stay under that threshold, keeping the I/O system
busy, but never fill up the cache completely.
This is just a theory, though. I don't have a good grasp on how the OS
and a typical RAID controller behaves under these conditions.
I'd suggest that we just sleep for a small fixed amount of time between
every fsync, unless we're running behind the checkpoint schedule. And
for a first approximation, let's just assume that the fsync phase is e.g
10% of the whole checkpoint work.
I will send you more detail investigation and result next week. And I
will also take result in pgbench. If you mind other part of benchmark
result or parameter of postgres, please tell me.
Attached is a quick patch to implement a fixed, 100ms delay between
fsyncs, and the assumption that fsync phase is 10% of the total
checkpoint duration. I suspect 100ms is too small to have much effect,
but that happens to be what we have currently in CheckpointWriteDelay().
Could you test this patch along with yours? If you can test with
different delays (e.g 100ms, 500ms and 1000ms) and different ratios
between the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of
how sensitive the test case is to those settings.
- Heikki
Attachments:
fsync-delay-1.patchtext/x-diff; name=fsync-delay-1.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..a622a18 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -94,7 +94,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
static void PinBuffer_Locked(volatile BufferDesc *buf);
static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
-static void BufferSync(int flags);
+static void BufferSync(int flags, double progress_upto);
static int SyncOneBuffer(int buf_id, bool skip_recently_used);
static void WaitIO(volatile BufferDesc *buf);
static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
@@ -1207,7 +1207,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
* remaining flags currently have no effect here.
*/
static void
-BufferSync(int flags)
+BufferSync(int flags, double progress_upto)
{
int buf_id;
int num_to_scan;
@@ -1319,7 +1319,7 @@ BufferSync(int flags)
/*
* Sleep to throttle our I/O rate.
*/
- CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+ CheckpointWriteDelay(flags, progress_upto * (double) num_written / num_to_write);
}
}
@@ -1825,10 +1825,10 @@ CheckPointBuffers(int flags)
{
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
- BufferSync(flags);
+ BufferSync(flags, 0.9);
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- smgrsync();
+ smgrsync(flags, 0.9);
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..7ceec9c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -235,7 +235,7 @@ SetForwardFsyncRequests(void)
/* Perform any pending fsyncs we may have queued up, then drop table */
if (pendingOpsTable)
{
- mdsync();
+ mdsync(CHECKPOINT_IMMEDIATE, 0.0);
hash_destroy(pendingOpsTable);
}
pendingOpsTable = NULL;
@@ -974,7 +974,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
* mdsync() -- Sync previous writes to stable storage.
*/
void
-mdsync(void)
+mdsync(int ckpt_flags, double progress_at_begin)
{
static bool mdsync_in_progress = false;
@@ -990,6 +990,7 @@ mdsync(void)
uint64 elapsed;
uint64 longest = 0;
uint64 total_elapsed = 0;
+ int ntoprocess;
/*
* This is only called during checkpoints, and checkpoints should only
@@ -1052,6 +1053,7 @@ mdsync(void)
/* Now scan the hashtable for fsync requests to process */
absorb_counter = FSYNCS_PER_ABSORB;
hash_seq_init(&hstat, pendingOpsTable);
+ ntoprocess = hash_get_num_entries(pendingOpsTable);
while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
{
ForkNumber forknum;
@@ -1171,6 +1173,11 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * Sleep to throttle our I/O rate.
+ */
+ CheckpointWriteDelay(ckpt_flags, progress_at_begin + (1.0 - progress_at_begin) * (double) processed / ntoprocess);
+
break; /* out of retry loop */
}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..ec24007 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_pre_ckpt) (void); /* may be NULL */
- void (*smgr_sync) (void); /* may be NULL */
+ void (*smgr_sync) (int ckpt_flags, double progress_at_begin); /* may be NULL */
void (*smgr_post_ckpt) (void); /* may be NULL */
} f_smgr;
@@ -708,14 +708,18 @@ smgrpreckpt(void)
* smgrsync() -- Sync files to disk during checkpoint.
*/
void
-smgrsync(void)
+smgrsync(int ckpt_flags, double progress_at_begin)
{
int i;
for (i = 0; i < NSmgr; i++)
{
+ /*
+ * XXX: If we ever have more than one smgr, the remaining progress
+ * should somehow be divided among all smgrs.
+ */
if (smgrsw[i].smgr_sync)
- (*(smgrsw[i].smgr_sync)) ();
+ (*(smgrsw[i].smgr_sync)) (ckpt_flags, progress_at_begin);
}
}
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..e8efcbe 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double progress_at_begin);
extern void smgrpostckpt(void);
extern void AtEOXact_SMgr(void);
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double progress_at_begin);
extern void mdpostckpt(void);
extern void SetForwardFsyncRequests(void);
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.
Isn't the behavior implemented by the patch a reasonable approximation
of just that? When the fsyncs start to get slow, that's when we start
to sleep. I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that. The only feedback we
have on how bad things are is how long it took the last fsync to
complete, so I actually think that's a much better way to go than any
fixed sleep - which will often be unnecessarily long on a well-behaved
system, and which will often be far too short on one that's having
trouble. I'm inclined to think think Kondo-san has got it right.
I like your idea of putting a stake in the ground and assuming that
the fsync phase will turn out to be X% of the checkpoint, but I wonder
if we can be a bit more sophisticated, especially for cases where
checkpoint_segments is small. When checkpoint_segments is large, then
we know that some of the data will get written back to disk during the
write phase, because the OS cache is only so big. But when it's
small, the OS will essentially do nothing during the write phase, and
then it's got to write all the data out during the fsync phase. I'm
not sure we can really model that effect thoroughly, but even
something dumb would be smarter than what we have now - e.g. use 10%,
but when checkpoint_segments < 10, use 1/checkpoint_segments. Or just
assume the fsync phase will take 30 seconds. Or ... something. I'm
not really sure what the right model is here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 25.06.2013 23:03, Robert Haas wrote:
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.Isn't the behavior implemented by the patch a reasonable approximation
of just that? When the fsyncs start to get slow, that's when we start
to sleep. I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.
Well, that's the point I was trying to make: you should sleep *before*
the fsyncs get slow.
The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.
Quite possible, I really don't know. I'm inclined to first try the
simplest thing possible, and only make it more complicated if that's not
good enough. Kondo-san's patch wasn't very complicated, but nevertheless
a fixed sleep between every fsync, unless you're behind the schedule, is
even simpler. In particular, it's easier to tie that into the checkpoint
scheduler - I'm not sure how you'd measure progress or determine how
long to sleep unless you assume that every fsync is the same.
I like your idea of putting a stake in the ground and assuming that
the fsync phase will turn out to be X% of the checkpoint, but I wonder
if we can be a bit more sophisticated, especially for cases where
checkpoint_segments is small. When checkpoint_segments is large, then
we know that some of the data will get written back to disk during the
write phase, because the OS cache is only so big. But when it's
small, the OS will essentially do nothing during the write phase, and
then it's got to write all the data out during the fsync phase. I'm
not sure we can really model that effect thoroughly, but even
something dumb would be smarter than what we have now - e.g. use 10%,
but when checkpoint_segments< 10, use 1/checkpoint_segments. Or just
assume the fsync phase will take 30 seconds.
If checkpoint_segments < 10, there isn't very much dirty data to flush
out. This isn't really problem in that case - no matter how stupidly we
do the writing and fsyncing. the I/O cache can absorb it. It doesn't
really matter what we do in that case.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for comments!
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
Hmm, so the write patch doesn't do much, but the fsync patch makes the response
times somewhat smoother. I'd suggest that we drop the write patch for now, and
focus on the fsyncs.
Write patch is effective in TPS! I think that delay of checkpoint write is caused
long time fsync and heavy load in fsync phase. Because it go slow disk right in write
phase. Therefore, combination of write patch and fsync patch are suiter each
other than
only write patch. I think that amount of WAL write in beginning of checkpoint can
indicate effect of write patch.
What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold
settings did you use with the fsync patch? It's disabled by default.
I used these parameters.
checkpointer_fsync_delay_ratio = 1
checkpointer_fsync_delay_threshold = 1000ms
As a matter of fact, I used long time sleep in slow fsyncs.
And other maintains parameters are here.
checkpoint_completion_target = 0.7
checkpoint_smooth_target = 0.3
checkpoint_smooth_margin = 0.5
checkpointer_write_delay = 200ms
Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and the
assumption that fsync phase is 10% of the total checkpoint duration. I suspect 100ms
is too small to have much effect, but that happens to be what we have
currently in
CheckpointWriteDelay(). Could you test this patch along with yours? If you can test
with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between
the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive the
test case is to those settings.
It seems interesting algorithm! I will test it in same setting and study about
your patch essence.
(2013/06/26 5:28), Heikki Linnakangas wrote:
On 25.06.2013 23:03, Robert Haas wrote:
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.Isn't the behavior implemented by the patch a reasonable approximation
of just that? When the fsyncs start to get slow, that's when we start
to sleep. I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.Well, that's the point I was trying to make: you should sleep *before* the fsyncs
get slow.
Actuary, fsync time is changed by progress of background disk writes in OS. We
cannot know about progress of background disk write before fsyncs. I think
Robert's argument is right. Please see under following log messages.
* fsync file which had been already wrote in disk
DEBUG: 00000: checkpoint sync: number=23 file=base/16384/16413.5 time=2.546 msec
DEBUG: 00000: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 msec
DEBUG: 00000: checkpoint sync: number=25 file=base/16384/16413.7 time=2.358 msec
DEBUG: 00000: checkpoint sync: number=26 file=base/16384/16413.8 time=2.013 msec
DEBUG: 00000: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535
msec
DEBUG: 00000: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 msec
* fsync file which had not been wrote in disk very much
DEBUG: 00000: checkpoint sync: number=54 file=base/16384/16419.8 time=3408.759
msec
DEBUG: 00000: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075
msec
DEBUG: 00000: checkpoint sync: number=56 file=base/16384/16419.10
time=13848.237 msec
DEBUG: 00000: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836
msec
DEBUG: 00000: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 msec
DEBUG: 00000: checkpoint sync: number=59 file=base/16384/16419_vm time=0.002 msec
I think it is wasteful of sleep every fsyncs including short time, and fsync time
performance is also changed by hardware which is like RAID card and kind of or
number of disks and OS. So it is difficult to set fixed-sleep-time. My proposed
method will be more adoptive in these cases.
The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.Quite possible, I really don't know. I'm inclined to first try the simplest thing
possible, and only make it more complicated if that's not good enough.
Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep between
every fsync, unless you're behind the schedule, is even simpler. In particular,
it's easier to tie that into the checkpoint scheduler - I'm not sure how you'd
measure progress or determine how long to sleep unless you assume that every
fsync is the same.
I think it is important in phase of fsync that short time as possible without IO
freeze, keep schedule of checkpoint, and good for executing transactions. I try
to make progress patch in that's point of view. By the way, executing DBT-2
benchmark has long time(It may be four hours.). For that reason I hope that don't
mind my late reply very much! :-)
Best Regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 26.06.2013 11:37, KONDO Mitsumasa wrote:
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
Hmm, so the write patch doesn't do much, but the fsync patch makes
the response
times somewhat smoother. I'd suggest that we drop the write patch
for now, and focus on the fsyncs.Write patch is effective in TPS!
Your test results don't agree with that. You got 3465.96 TPS with the
write patch, and 3474.62 and 3469.03 without it. The fsync+write
combination got slightly more TPS than just the fsync patch, but only by
about 1%, and then the response times were worse.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(2013/06/26 20:15), Heikki Linnakangas wrote:
On 26.06.2013 11:37, KONDO Mitsumasa wrote:
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
Hmm, so the write patch doesn't do much, but the fsync patch makes
the response
times somewhat smoother. I'd suggest that we drop the write patch
for now, and focus on the fsyncs.Write patch is effective in TPS!
Your test results don't agree with that. You got 3465.96 TPS with the write
patch, and 3474.62 and 3469.03 without it. The fsync+write combination got
slightly more TPS than just the fsync patch, but only by about 1%, and then the
response times were worse.
Please see result of DBT-2 more careful. Average latency in fsync+write was
improoved from only fsync patch. 90% tile and Maximum latency are not all of
result but only part of result in DBT-2. And Average and TPS are all of result.
Generally, when TPS become high in benchmark, checkpointer has to write more
pages. Therefore, 90%tile and Maximum are worse in this case, and it is general
in other benchmark tests.
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.Quite possible, I really don't know. I'm inclined to first try the simplest
thing possible, and only make it more complicated if that's not good enough.
Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep
between every fsync, unless you're behind the schedule, is even simpler.
I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well. I have also tried it and the resulting
behavior was unimpressive. It makes checkpoints take a long time to
complete even when there's very little data to flush out to the OS,
which is annoying; and when things actually do get ugly, the sleeps
aren't long enough to matter. See the timings Kondo-san posted
downthread: 100ms delays aren't going let the system recover in any
useful way when the fsync can take 13 s for one file. On a system
that's badly weighed down by I/O, the fsync times are often
*extremely* long - 13 s is far from the worst you can see. You have
to give the system a meaningful time to recover from that, allowing
other processes to make meaningful progress before you hit it again,
or system performance just goes down the tubes. Greg's test, IIRC,
used 3 s sleeps rather than your proposal of 100 ms, but it still
wasn't enough.
In
particular, it's easier to tie that into the checkpoint scheduler - I'm not
sure how you'd measure progress or determine how long to sleep unless you
assume that every fsync is the same.
I think the thing to do is assume that the fsync phase will take 10%
or so of the total checkpoint time, but then be prepared to let the
checkpoint run a bit longer if the fsyncs end up being slow. As Greg
has pointed out during prior discussions of this, the normal scenario
when things get bad here is that there is no way in hell you're going
to fit the checkpoint into the originally planned time. Once all of
the write caches between PostgreSQL and the spinning rust are full,
the system is in trouble and things are going to suck. The hope is
that we can stop beating the horse while it is merely in intensive
care rather than continuing until the corpse is fully skeletized.
Fixed delays don't work because - to push an already-overdone metaphor
a bit further - we have no idea how much of a beating the horse can
take; we need something adaptive so that we respond to what actually
happens rather than making predictions that will almost certainly be
wrong a large fraction of the time.
To put this another way, when we start the fsync() phase, it often
consumes 100% of the available I/O on the machine, completing starving
every other process that might need any. This is certainly a
deficiency in the Linux I/O scheduler, but as they seem in no hurry to
fix it we'll have to cope with it as best we can. If you do the
fsyncs in fast succession (and 100ms separation might as well be no
separation at all), then the I/O starvation of the entire system
persists through the entire fsync phase. If, on the other hand, you
sleep for the same amount of time the previous fsync took, then on the
average, 50% of the machine's I/O capacity will be available for all
other system activity throughout the fsync phase, rather than 0%.
Now, unfortunately, this is still not that good, because it's often
the case that all of the fsyncs except one are reasonably fast, and
there's one monster one that is very slow. ext3 has a known bad
behavior that dumps all dirty data for the entire *filesystem* when
you fsync, which tends to create these kinds of effects. But even on
better-behaved filesystem, like ext4, it's fairly common to have one
fsync that is painfully longer than all the others. So even with
this patch, there are still going to be cases where the whole system
becomes unresponsive. I don't see any way to to do better without a
better kernel API, or a better I/O scheduler, but that doesn't mean we
shouldn't do at least this much.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(2013/06/28 0:08), Robert Haas wrote:
On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well. I have also tried it and the resulting
behavior was unimpressive. It makes checkpoints take a long time to
complete even when there's very little data to flush out to the OS,
which is annoying; and when things actually do get ugly, the sleeps
aren't long enough to matter. See the timings Kondo-san posted
downthread: 100ms delays aren't going let the system recover in any
useful way when the fsync can take 13 s for one file. On a system
that's badly weighed down by I/O, the fsync times are often
*extremely* long - 13 s is far from the worst you can see. You have
to give the system a meaningful time to recover from that, allowing
other processes to make meaningful progress before you hit it again,
or system performance just goes down the tubes. Greg's test, IIRC,
used 3 s sleeps rather than your proposal of 100 ms, but it still
wasn't enough.
Yes. In write phase, checkpointer writes numerous 8KB dirty pages in each
SyncOneBuffer(), therefore it can be well for tiny(100ms) sleep time. But
in fsync phase, checkpointer writes scores of relation files in each fsync(),
therefore it can not be well for tiny sleep. It shoud need longer sleep time
for recovery IO performance. If we know its best sleep time, we had better use
previous fsync time. And if we want to prevent fast long fsync time, we had
better change relation file size which is 1GB in default max size to smaller.
Go back to the subject. Here is our patches test results. Fsync + write patch was
not good result in past result, so I retry benchmark in same condition. It seems
to get good perfomance than past result.
* Performance result in DBT-2 (WH340)
| TPS 90%tile Average Maximum
---------------+---------------------------------------
original_0.7 | 3474.62 18.348328 5.739 36.977713
original_1.0 | 3469.03 18.637865 5.842 41.754421
fsync | 3525.03 13.872711 5.382 28.062947
write | 3465.96 19.653667 5.804 40.664066
fsync + write | 3586.85 14.459486 4.960 27.266958
Heikki's patch | 3504.3 19.731743 5.761 38.33814
* HTML result in DBT-2
http://pgstatsinfo.projects.pgfoundry.org/RESULT/
In attached text, I also describe in each checkpoint time. fsync patch was seemed
to have longer time than not fsync patch. However, checkpoint schedule is on time
in checkpoint_timeout and allowable time. I think that it is most important
things in fsync phase that fast finished checkpoint is not but definitely and
assurance write pages in end of checkpoint. So my fsync patch is not wrong
working any more.
My write patch seems to have lot of riddle, so I try to investigate objective
result and theory of effect.
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
Attachments:
Hi,
I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!
* Performance result in DBT-2 (WH340)
| NOTPM 90%tile Average Maximum
-----------------------------+---------------------------------------
original_0.7 (baseline) | 3474.62 18.348328 5.739 36.977713
fsync + write | 3586.85 14.459486 4.960 27.266958
fsync + write + segsize=0.25 | 3661.17 8.28816 4.117 17.23191
Changing segsize with my checkpoint patches improved original over 50% at 90%tile
and maximum response time.
However, this tests ware not same condition... I also changed SESSION parameter
100 to 300 in DBT-2 driver. In general, I heard good SESSION parameter is 100.
Andt I didn't understand optimized DBT-2 parameters a lot. So I will retry to
test my patches and baseline with optimized parameters in DBT-2. Please wait for
a while.
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
Attachments:
segsize-instant.patchtext/x-diff; name=segsize-instant.patchDownload
diff --git a/configure b/configure
index 7c662c3..6269cb9 100755
--- a/configure
+++ b/configure
@@ -2879,7 +2879,7 @@ $as_echo "$as_me: error: Invalid block size. Allowed values are 1,2,4,8,16,32."
esac
{ $as_echo "$as_me:$LINENO: result: ${blocksize}kB" >&5
$as_echo "${blocksize}kB" >&6; }
-
+echo ${blocksize}
cat >>confdefs.h <<_ACEOF
#define BLCKSZ ${BLCKSZ}
@@ -2917,14 +2917,15 @@ else
segsize=1
fi
-
# this expression is set up to avoid unnecessary integer overflow
# blocksize is already guaranteed to be a factor of 1024
-RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
-test $? -eq 0 || exit 1
+#RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
+RELSEG_SIZE=`echo 1024/$blocksize*$segsize*1024 | bc`
+#test $? -eq} 0 || exit 1
{ $as_echo "$as_me:$LINENO: result: ${segsize}GB" >&5
$as_echo "${segsize}GB" >&6; }
-
+echo ${segsize}
+echo ${RELSEG_SIZE}
cat >>confdefs.h <<_ACEOF
#define RELSEG_SIZE ${RELSEG_SIZE}
On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!
This is interesting. Unfortunately, it has a significant downside:
potentially, there will be a lot more files in the data directory. As
it is, the number of files that exist there today has caused
performance problems for some of our customers. I'm not sure off-hand
to what degree those problems have been related to overall inode
consumption vs. the number of files in the same directory.
If the problem is mainly with number of of files in the same
directory, we could consider revising our directory layout. Instead
of:
base/${DBOID}/${RELFILENODE}_{FORK}
We could have:
base/${DBOID}/${FORK}/${RELFILENODE}
That would move all the vm and fsm forks to separate directories,
which would cut down the number of files in the main-fork directory
significantly. That might be worth doing independently of the issue
you're raising here. For large clusters, you'd even want one more
level to keep the directories from getting too big:
base/${DBOID}/${FORK}/${X}/${RELFILENODE}
...where ${X} is two hex digits, maybe just the low 16 bits of the
relfilenode number. But this would be not as good for small clusters
where you'd end up with oodles of little-tiny directories, and I'm not
sure it'd be practical to smoothly fail over from one system to the
other.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-07-03 17:18:29 +0900, KONDO Mitsumasa wrote:
Hi,
I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!* Performance result in DBT-2 (WH340)
| NOTPM 90%tile Average Maximum
-----------------------------+---------------------------------------
original_0.7 (baseline) | 3474.62 18.348328 5.739 36.977713
fsync + write | 3586.85 14.459486 4.960 27.266958
fsync + write + segsize=0.25 | 3661.17 8.28816 4.117 17.23191Changing segsize with my checkpoint patches improved original over 50% at 90%tile
and maximum response time.
Hm. I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files. Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/07/13 01:31, Robert Haas wrote:
On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!This is interesting. Unfortunately, it has a significant downside:
potentially, there will be a lot more files in the data directory. As
it is, the number of files that exist there today has caused
performance problems for some of our customers. I'm not sure off-hand
to what degree those problems have been related to overall inode
consumption vs. the number of files in the same directory.If the problem is mainly with number of of files in the same
directory, we could consider revising our directory layout. Instead
of:base/${DBOID}/${RELFILENODE}_{FORK}
We could have:
base/${DBOID}/${FORK}/${RELFILENODE}
That would move all the vm and fsm forks to separate directories,
which would cut down the number of files in the main-fork directory
significantly. That might be worth doing independently of the issue
you're raising here. For large clusters, you'd even want one more
level to keep the directories from getting too big:base/${DBOID}/${FORK}/${X}/${RELFILENODE}
...where ${X} is two hex digits, maybe just the low 16 bits of the
relfilenode number. But this would be not as good for small clusters
where you'd end up with oodles of little-tiny directories, and I'm not
sure it'd be practical to smoothly fail over from one system to the
other.
16 bits ==> 4 hex digits
Could you perhaps start with 1 hex digit, and automagically increase it
to 2, 3, .. as needed? There could be a status file at that level, that
would indicate the current number of hex digits, plus a temporary
mapping file when in transition.
Cheers,
Gavin
(2013/07/03 22:31), Robert Haas wrote:
On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!This is interesting. Unfortunately, it has a significant downside:
potentially, there will be a lot more files in the data directory. As
it is, the number of files that exist there today has caused
performance problems for some of our customers. I'm not sure off-hand
to what degree those problems have been related to overall inode
consumption vs. the number of files in the same directory.
Did you change number of max FD per process in kernel parameter? In default
setting, number of max FD per process is 1024. I think that it might over limit
in 500GB class database. Or, this problem might be caused by _mdfd_getseg() at
md.c. In write phase, dirty buffers don't have own FD. Therefore they seek to
find own FD and check the file in each dirty buffer. I think it is safe file
writing, but it might too wasteful. I think that BufferTag should have own FD and
it will be more efficient in checkpoint writing.
If the problem is mainly with number of of files in the same
directory, we could consider revising our directory layout. Instead
of:base/${DBOID}/${RELFILENODE}_{FORK}
We could have:
base/${DBOID}/${FORK}/${RELFILENODE}
That would move all the vm and fsm forks to separate directories,
which would cut down the number of files in the main-fork directory
significantly. That might be worth doing independently of the issue
you're raising here. For large clusters, you'd even want one more
level to keep the directories from getting too big:base/${DBOID}/${FORK}/${X}/${RELFILENODE}
...where ${X} is two hex digits, maybe just the low 16 bits of the
relfilenode number. But this would be not as good for small clusters
where you'd end up with oodles of little-tiny directories, and I'm not
sure it'd be practical to smoothly fail over from one system to the
other.
It seems good idea! In generally, base directory was not seen by user.
So it should be more efficient arrangement for performance and adopt for
large database.
(2013/07/03 22:39), Andres Freund wrote:> On 2013-07-03 17:18:29 +0900
Hm. I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files.
Sync_file_rage system call is interesting. But it was supported only by Linux
kernel 2.6.22 or later. In postgresql, it will suits Robert's idea which does not
depend on kind of OS.
Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.
Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.
I think that best way to write buffers in checkpoint is sorted by buffer's FD and
block-number with small segsize setting and each property sleep times. It will
realize genuine sorted checkpint with sequential disk writing!
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-07-04 21:28:11 +0900, KONDO Mitsumasa wrote:
That would move all the vm and fsm forks to separate directories,
which would cut down the number of files in the main-fork directory
significantly. That might be worth doing independently of the issue
you're raising here. For large clusters, you'd even want one more
level to keep the directories from getting too big:base/${DBOID}/${FORK}/${X}/${RELFILENODE}
...where ${X} is two hex digits, maybe just the low 16 bits of the
relfilenode number. But this would be not as good for small clusters
where you'd end up with oodles of little-tiny directories, and I'm not
sure it'd be practical to smoothly fail over from one system to the
other.It seems good idea! In generally, base directory was not seen by user.
So it should be more efficient arrangement for performance and adopt for
large database.Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.
I don't like going in this direction at all:
1) it breaks pg_upgrade. Which means many of the bigger users won't be
able to migrate to this and most packagers would carry the old
segsize around forever.
Even if we could get pg_upgrade to split files accordingly link mode
would still be broken.
2) It drastically increases the amount of file handles neccessary and by
extension increases the amount of open/close calls. Those aren't all
that cheap. And it increases metadata traffic since mtime/atime are
kept for more files. Also, file creation is rather expensive since it
requires metadata transaction on the filesystem level.
3) It breaks readahead since that usually only works within a single
file. I am pretty sure that this will significantly slow down
uncached sequential reads on larger tables.
(2013/07/03 22:39), Andres Freund wrote:> On 2013-07-03 17:18:29 +0900
Hm. I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files.Sync_file_rage system call is interesting. But it was supported only by
Linux kernel 2.6.22 or later. In postgresql, it will suits Robert's idea
which does not depend on kind of OS.
Well. But it can be implemented without breaking things... Even if we
don't have sync_file_range() we can cope by simply doing fsync()s more
frequently. For every open file keep track of the amount of buffers
dirtied and every 32MB or so issue an fdatasync()/fsync().
I think that best way to write buffers in checkpoint is sorted by buffer's
FD and block-number with small segsize setting and each property sleep
times. It will realize genuine sorted checkpint with sequential disk
writing!
That would mke regular fdatasync()ing even easier.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
I don't like going in this direction at all:
1) it breaks pg_upgrade. Which means many of the bigger users won't be
able to migrate to this and most packagers would carry the old
segsize around forever.
Even if we could get pg_upgrade to split files accordingly link mode
would still be broken.
TBH, I think *any* rearrangement of the on-disk storage files is going
to be rejected. It seems very unlikely to me that you could demonstrate
a checkpoint performance improvement from that that occurs consistently
across different platforms and filesystems. And as Andres points out,
the pain associated with it is going to be bad enough that a very high
bar will be set on whether you've proven the change is worthwhile.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/04/2013 06:05 AM, Andres Freund wrote:
Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.
I did testing on this a few years ago, I tried with 2MB segments over
16MB thinking similarly to you. It failed miserably, performance
completely tanked.
JD
--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
(2013/07/05 0:35), Joshua D. Drake wrote:
On 07/04/2013 06:05 AM, Andres Freund wrote:
Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.I did testing on this a few years ago, I tried with 2MB segments over 16MB
thinking similarly to you. It failed miserably, performance completely tanked.
Just as you say, test result was miserable... Too small segsize is bad for
parformance. It might be improved by separate derectory, but too many FD with
open() and close() seem to be bad. However, I think taht this implementation have
potential which is improve for IO performance, so we need to try to test with
some methods.
* Performance result in DBT-2 (WH340)
| NOTPM 90%tile Average Maximum
--------------------------------+-----------------------------------
original_0.7 (baseline) | 3474.62 18.348328 5.739 36.977713
fsync + write | 3586.85 14.459486 4.960 27.266958
fsync + write + segsize=0.25 | 3661.17 8.28816 4.117 17.23191
fsync + wrote + segsize=0.03125 | 3309.99 10.851245 6.759 19.500598
(2013/07/04 22:05), Andres Freund wrote:
1) it breaks pg_upgrade. Which means many of the bigger users won't be
able to migrate to this and most packagers would carry the old
segsize around forever.
Even if we could get pg_upgrade to split files accordingly link mode
would still be broken.
I think that pg_upgrade is one of the contrib, but not mainly implimentation of
Postgres. So contrib should not try to stand in improvement of main
implimentaion. Pg_upgrade users might consider same opinion.
2) It drastically increases the amount of file handles neccessary and by
extension increases the amount of open/close calls. Those aren't all
that cheap. And it increases metadata traffic since mtime/atime are
kept for more files. Also, file creation is rather expensive since it
requires metadata transaction on the filesystem level.
My test result was seemed this problem. But my test wasn't separate directory in
base/. I'm not sure that which way is best. If you have time to create patch,
please send us, and I try to test in DBT-2.
Best regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I create fsync v2 patch. There's not much time, so I try to focus fsync patch in
this commit festa as adviced by Heikki. And I'm sorry that it is not good that
diverging from main discussion in this commit festa... Of course, I continue to
try another improvement.
* Changes
- Add ckpt_flag in mdsync() etc with reference by Heikki's patch. It will be
more controllable mdsync() in checkpoint.
- Too long sleep in fsync phase is not good for checkpoint schedule. So I set
limited sleep time which is always less than 10 seconds(MAX_FSYNC_SLEEP).
I think that 10 seconds sleep time is a suitable value in various situations.
And I also considered limited sleep time by checkpoint progress,
however, I thought md.c should be simple and remain robust. So I have remained
simple.
- Maximum checkpoint_fsync_sleep_ratio in guc.c is changed 1 to 2. Because I
set limited sleep time 10 secounds. We can more flexibly change it and be more
safety.
And I considered abbreviation of parameters in my fsync patch.
* checkpoint_fsync_delay_threshold
In general, I think that it is suitable about 1 second in various environments.
If we want to adjust sleep time in fsync phase, we can change
checkpoint_fsync_sleep_ratio.
* checkpoint_fsync_sleep_ratio
I don't want to omit this parameter, because it can only regulate sleep time
in fsync phase and checkpoint time.
* Benchmark Result(DBT-2)
| NOTPM Average 90%tile Maximum
------------------------+----------------------------------------
original_0.7 (baseline) | 3610.42 4.556 10.9180 23.1326
fsync v1 | 3685.51 4.036 9.2017 17.5594
fsync v2 | 3748.80 3.562 8.1871 17.5101
I'm not sure about this result. Fsync v2 patch was too good. Of cource I didn't
do anything in executing benchmark.
Please see checkpoint_time.txt which is written detail checkpoint in each
checkpoint. Fsync v2 patch seems to be short in each checkpoint time.
* Benchmark Setting
[postgresql.conf]
archive_mode = on
archive_command = '/bin/cp %p /pgdata/pgarch/arc_dbt2/%f'
synchronous_commit = on
max_connections = 300
shared_buffers = 2458MB
work_mem = 1MB
fsync = on
wal_sync_method = fdatasync
full_page_writes = on
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.7
segsize=1GB(default)
[patched postgresql.conf (add)]
checkpointer_fsync_delay_ratio = 1
checkpointer_fsync_delay_threshold = 1000ms
[DBT-2 driver settings]
SESSION:250
WH:340
TPW:10
PRETEST_DURATION: 1800
TEST_DURATION: 1800
* Test Server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
(Add) Set off energy efficient function in BIOS and OS.
Best regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center
Attachments:
fsync-patch_v2.patchtext/x-diff; name=fsync-patch_v2.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..2b223e9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
*/
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+int CheckPointerFsyncDelayThreshold = -1;
double CheckPointCompletionTarget = 0.5;
+double CheckPointerFsyncDelayRatio = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
/*
* Private state
@@ -169,7 +171,6 @@ static pg_time_t last_xlog_switch_time;
static void CheckArchiveTimeout(void);
static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -643,7 +644,7 @@ CheckArchiveTimeout(void)
* this does not check the *current* checkpoint's IMMEDIATE flag, but whether
* there is one pending behind it.)
*/
-static bool
+extern bool
ImmediateCheckpointRequested(void)
{
if (checkpoint_requested)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..3f02d0b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1828,7 +1828,7 @@ CheckPointBuffers(int flags)
BufferSync(flags);
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- smgrsync();
+ smgrsync(flags);
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..d762511 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
@@ -44,6 +45,9 @@
#define FSYNCS_PER_ABSORB 10
#define UNLINKS_PER_ABSORB 10
+/* Protect too long sleep in each file fsync. */
+#define MAX_FSYNC_SLEEP 10000
+
/*
* Special values for the segno arg to RememberFsyncRequest.
*
@@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
static CycleCtr mdckpt_cycle_ctr = 0;
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
typedef enum /* behavior for mdopen & _mdfd_getseg */
{
@@ -235,7 +241,7 @@ SetForwardFsyncRequests(void)
/* Perform any pending fsyncs we may have queued up, then drop table */
if (pendingOpsTable)
{
- mdsync();
+ mdsync(CHECKPOINT_IMMEDIATE);
hash_destroy(pendingOpsTable);
}
pendingOpsTable = NULL;
@@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
* mdsync() -- Sync previous writes to stable storage.
*/
void
-mdsync(void)
+mdsync(int ckpt_flags)
{
static bool mdsync_in_progress = false;
@@ -1171,6 +1177,28 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+ * for giving priority to executing transaction.
+ */
+ if(CheckPointerFsyncDelayThreshold >= 0 &&
+ CheckPointerFsyncDelayRatio > 0 &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ !(ckpt_flags & CHECKPOINT_FORCE) &&
+ !(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) &&
+ (elapsed / 1000 > CheckPointerFsyncDelayThreshold))
+ {
+ double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio;
+
+ /* Too long sleep is not good for checkpoint scheduler */
+ if(fsync_sleep > MAX_FSYNC_SLEEP)
+ fsync_sleep = MAX_FSYNC_SLEEP;
+ pg_usleep(fsync_sleep * 1000L);
+ if(log_checkpoints)
+ elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+ fsync_sleep);
+ }
break; /* out of retry loop */
}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..bc07b03 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_pre_ckpt) (void); /* may be NULL */
- void (*smgr_sync) (void); /* may be NULL */
+ void (*smgr_sync) (int ckpt_flags); /* may be NULL */
void (*smgr_post_ckpt) (void); /* may be NULL */
} f_smgr;
@@ -708,14 +708,18 @@ smgrpreckpt(void)
* smgrsync() -- Sync files to disk during checkpoint.
*/
void
-smgrsync(void)
+smgrsync(int ckpt_flags)
{
int i;
+ /*
+ * XXX: If we ever have more than one smgr, the remaining progress
+ * should somehow be divided among all smgrs.
+ */
for (i = 0; i < NSmgr; i++)
{
if (smgrsw[i].smgr_sync)
- (*(smgrsw[i].smgr_sync)) ();
+ (*(smgrsw[i].smgr_sync)) (ckpt_flags);
}
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..a240c43 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerFsyncDelayThreshold,
+ -1, -1, 1000000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+ NULL
+ },
+ &CheckPointerFsyncDelayRatio,
+ 0.0, 0.0, 2.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..707b433 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable.
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..a02ba1f 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+extern int CheckPointerFsyncDelayThreshold;
extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,7 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern bool ImmediateCheckpointRequested(void);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
BlockNumber segno);
extern void AbsorbFsyncRequests(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..f796ab7 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags);
extern void smgrpostckpt(void);
extern void AtEOXact_SMgr(void);
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags);
extern void mdpostckpt(void);
extern void SetForwardFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
Hi,l
I create fsync v3 v4 v5 patches and test them.
* Changes
- Add considering about total checkpoint schedule in fsync phase (v3 v4 v5)
- Add considering about total checkpoint schedule in write phase (v4 only)
- Modify some implementations from v3 (v5 only)
I use linear combination method for considering about total checkpoint schedule
which are write phase and fsync phase. V3 patch was considered about only fsync
phase, V4 patch was considered about write phase and fsync phase, and v5 patch
was considered about only fsync phase.
Test result is here. Benchmark setting and server are same as previous test. '-*'
shows checkpoint_completion_target in each tests. And all tests which are except
'fsync v3_disabled' set 'checkpointer_fsync_delay_ratio=1' and
'checkpointer_fsync_delay_threshold=1000'. 'fsync v3_disabled' set
'checkpointer_fsync_delay_ratio=0' and 'checkpointer_fsync_delay_threshold= -1'.
V5 patch is testing now:-), but it will be same score as v3 patch.
* Result
** DBT-2 result
| NOTPM | 90%tile | Average | S.Deviation | Maximum
---------------------+-----------+---------+---------+-------------+--------
fsync v3-0.7 | 3649.02 | 9.703 | 4.226 | 3.853 | 21.754
fsync v3-0.9 | 3694.41 | 9.897 | 3.874 | 4.016 | 20.774
fsync v3-0.7_disabled| 3583.28 | 10.966 | 4.684 | 4.866 | 31.545
fsync v4-0.7 | 3546.38 | 12.734 | 5.062 | 4.798 | 24.468
fsync v4-0.9 | 3670.81 | 9.864 | 4.130 | 3.665 | 19.236
** Average checkpoint duration (sec) (Not include during loading time)
| write_duration | sync_duration | total | punctual to
checkpoint schedule
---------------------+----------------+---------------+--------+--------------------------------
fsync v3-0.7 | 296.6 | 251.8898 | 548.48 | OK
fsync v3-0.9 | 292.086 | 276.4525 | 568.53 | OK
fsync v3-0.7_disabled| 303.5706 | 155.6116 | 459.18 | OK
fsync v4-0.7 | 273.8338 | 355.6224 | 629.45 | OK
fsync v4-0.9 | 329.0522 | 231.77 | 560.82 | OK
** Increase of checkpoint duration (%) (Reference point is 'fsync v3-0.7_disabled'.)
| write_duration | sync_duration | total
---------------------+----------------+---------------+-------
fsync v3-0.7 | 97.7% | 161.9% | 119.4%
fsync v3-0.9 | 96.2% | 177.7% | 123.8%
fsync v3-0.7_disabled| 100.0% | 100.0% | 100.0%
fsync v4-0.7 | 90.2% | 228.5% | 137.1%
fsync v4-0.9 | 108.4% | 148.9% | 122.1%
* Examination
** DBT-2 result
V3 patch seems good result which is be faster response time about 10%-30% and
inclease NOTPM about 5% than no sleep(fsync v3-0.7_disabled), and v4 patch is not
good result. However, 'fsync v4-0.9' is same score as v3 patch when more large
checkpoint_completion_target. I think that considering about checkpoint schedule
about write phase and fsync phase makes more harsh in IO schedule. Because write
phase IO schedule is more strict than normal write phase. And it is also bad in
fsync phase and concern latter.
** Average checkpoint duration
All methods are punctual to checkpoint schedule. In enabling fsync sleep, it is
longer fsync time, however total time are much the same as no sleep.
'fsync v4-0.7 ' becomes very bad sync duration and total time. It indicates that
changing checkpoint_completion_target is very delicate. It had not better change
write phase scheduling, the same as it used to. At write phase in normal setting
, it have sufficiently time for punctual to checkpoint schedule. And I think that
many user want to be compatible with old version.
What do you think about these patches?
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
Attachments:
Improvement_of_checkpoint_IO-scheduler_in_fsync_v3.patchtext/x-diff; name=Improvement_of_checkpoint_IO-scheduler_in_fsync_v3.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..d09fe4f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
*/
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+int CheckPointerFsyncDelayThreshold = -1;
double CheckPointCompletionTarget = 0.5;
+double CheckPointerFsyncDelayRatio = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
/*
* Private state
@@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time;
/* Prototypes for private functions */
static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -643,7 +643,7 @@ CheckArchiveTimeout(void)
* this does not check the *current* checkpoint's IMMEDIATE flag, but whether
* there is one pending behind it.)
*/
-static bool
+extern bool
ImmediateCheckpointRequested(void)
{
if (checkpoint_requested)
@@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress)
* checkpoint, and returns true if the progress we've made this far is greater
* than the elapsed time/segments.
*/
-static bool
+extern bool
IsCheckpointOnSchedule(double progress)
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..a09adad 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1828,7 +1828,7 @@ CheckPointBuffers(int flags)
BufferSync(flags);
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- smgrsync();
+ smgrsync(flags, 0.9);
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..ee67edf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
@@ -44,6 +45,9 @@
#define FSYNCS_PER_ABSORB 10
#define UNLINKS_PER_ABSORB 10
+/* Protect too long sleep in each file fsync. */
+#define MAX_FSYNC_SLEEP 10000
+
/*
* Special values for the segno arg to RememberFsyncRequest.
*
@@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
static CycleCtr mdckpt_cycle_ctr = 0;
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
typedef enum /* behavior for mdopen & _mdfd_getseg */
{
@@ -235,7 +241,7 @@ SetForwardFsyncRequests(void)
/* Perform any pending fsyncs we may have queued up, then drop table */
if (pendingOpsTable)
{
- mdsync();
+ mdsync(CHECKPOINT_IMMEDIATE, 0.0);
hash_destroy(pendingOpsTable);
}
pendingOpsTable = NULL;
@@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
* mdsync() -- Sync previous writes to stable storage.
*/
void
-mdsync(void)
+mdsync(int ckpt_flags, double progress_at_begin)
{
static bool mdsync_in_progress = false;
@@ -984,6 +990,7 @@ mdsync(void)
/* Statistics on sync times */
int processed = 0;
+ int num_to_process;
instr_time sync_start,
sync_end,
sync_diff;
@@ -1052,6 +1059,7 @@ mdsync(void)
/* Now scan the hashtable for fsync requests to process */
absorb_counter = FSYNCS_PER_ABSORB;
hash_seq_init(&hstat, pendingOpsTable);
+ num_to_process = hash_get_num_entries(pendingOpsTable);
while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
{
ForkNumber forknum;
@@ -1171,6 +1179,28 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+ * for giving priority to executing transaction.
+ */
+ if(CheckPointerFsyncDelayThreshold >= 0 &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ !(ckpt_flags & CHECKPOINT_FORCE) &&
+ !(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) &&
+ (elapsed / 1000 > CheckPointerFsyncDelayThreshold) &&
+ IsCheckpointOnSchedule(progress_at_begin + (1.0 - progress_at_begin) * (double) processed / num_to_process))
+ {
+ double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio;
+
+ /* Too long sleep is not good for checkpoint scheduler */
+ if(fsync_sleep > MAX_FSYNC_SLEEP)
+ fsync_sleep = MAX_FSYNC_SLEEP;
+ pg_usleep(fsync_sleep * 1000L);
+ if(log_checkpoints)
+ elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+ fsync_sleep);
+ }
break; /* out of retry loop */
}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..6a5cc0d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_pre_ckpt) (void); /* may be NULL */
- void (*smgr_sync) (void); /* may be NULL */
+ void (*smgr_sync) (int ckpt_flags, double progress_at_begin); /* may be NULL */
void (*smgr_post_ckpt) (void); /* may be NULL */
} f_smgr;
@@ -708,14 +708,18 @@ smgrpreckpt(void)
* smgrsync() -- Sync files to disk during checkpoint.
*/
void
-smgrsync(void)
+smgrsync(int ckpt_flags, double progress_at_begin)
{
int i;
+ /*
+ * XXX: If we ever have more than one smgr, the remaining progress
+ * should somehow be divided among all smgrs.
+ */
for (i = 0; i < NSmgr; i++)
{
if (smgrsw[i].smgr_sync)
- (*(smgrsw[i].smgr_sync)) ();
+ (*(smgrsw[i].smgr_sync)) (ckpt_flags, progress_at_begin);
}
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..a240c43 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerFsyncDelayThreshold,
+ -1, -1, 1000000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+ NULL
+ },
+ &CheckPointerFsyncDelayRatio,
+ 0.0, 0.0, 2.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..707b433 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable.
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..ab266d6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+extern int CheckPointerFsyncDelayThreshold;
extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern bool ImmediateCheckpointRequested(void);
+extern bool IsCheckpointOnSchedule(double progress);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
BlockNumber segno);
extern void AbsorbFsyncRequests(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..e8efcbe 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double progress_at_begin);
extern void smgrpostckpt(void);
extern void AtEOXact_SMgr(void);
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double progress_at_begin);
extern void mdpostckpt(void);
extern void SetForwardFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
Improvement_of_checkpoint_IO-scheduler_in_fsync_v4.patchtext/x-diff; name=Improvement_of_checkpoint_IO-scheduler_in_fsync_v4.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..d09fe4f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
*/
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+int CheckPointerFsyncDelayThreshold = -1;
double CheckPointCompletionTarget = 0.5;
+double CheckPointerFsyncDelayRatio = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
/*
* Private state
@@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time;
/* Prototypes for private functions */
static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -643,7 +643,7 @@ CheckArchiveTimeout(void)
* this does not check the *current* checkpoint's IMMEDIATE flag, but whether
* there is one pending behind it.)
*/
-static bool
+extern bool
ImmediateCheckpointRequested(void)
{
if (checkpoint_requested)
@@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress)
* checkpoint, and returns true if the progress we've made this far is greater
* than the elapsed time/segments.
*/
-static bool
+extern bool
IsCheckpointOnSchedule(double progress)
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..9f4177a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,6 +66,9 @@
#define DROP_RELS_BSEARCH_THRESHOLD 20
+/* Checkpoint schedule ratio of write phase to fsync phase */
+#define CHECKPOINT_SCHEDULE_RATIO 0.9
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -94,7 +97,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
static void PinBuffer_Locked(volatile BufferDesc *buf);
static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
-static void BufferSync(int flags);
+static void BufferSync(int flags, double ckpt_schedule_ratio);
static int SyncOneBuffer(int buf_id, bool skip_recently_used);
static void WaitIO(volatile BufferDesc *buf);
static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
@@ -1207,7 +1210,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
* remaining flags currently have no effect here.
*/
static void
-BufferSync(int flags)
+BufferSync(int flags, double ckpt_schedule_ratio)
{
int buf_id;
int num_to_scan;
@@ -1319,7 +1322,7 @@ BufferSync(int flags)
/*
* Sleep to throttle our I/O rate.
*/
- CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+ CheckpointWriteDelay(flags, ckpt_schedule_ratio * (double) num_written / num_to_write);
}
}
@@ -1825,10 +1828,10 @@ CheckPointBuffers(int flags)
{
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
- BufferSync(flags);
+ BufferSync(flags, CHECKPOINT_SCHEDULE_RATIO);
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- smgrsync();
+ smgrsync(flags, CHECKPOINT_SCHEDULE_RATIO);
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..9809fb1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
@@ -44,6 +45,9 @@
#define FSYNCS_PER_ABSORB 10
#define UNLINKS_PER_ABSORB 10
+/* Protect too long sleep in each file fsync. */
+#define MAX_FSYNC_SLEEP 10000
+
/*
* Special values for the segno arg to RememberFsyncRequest.
*
@@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
static CycleCtr mdckpt_cycle_ctr = 0;
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
typedef enum /* behavior for mdopen & _mdfd_getseg */
{
@@ -235,7 +241,7 @@ SetForwardFsyncRequests(void)
/* Perform any pending fsyncs we may have queued up, then drop table */
if (pendingOpsTable)
{
- mdsync();
+ mdsync(CHECKPOINT_IMMEDIATE, 0.0);
hash_destroy(pendingOpsTable);
}
pendingOpsTable = NULL;
@@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
* mdsync() -- Sync previous writes to stable storage.
*/
void
-mdsync(void)
+mdsync(int ckpt_flags, double ckpt_schedule_ratio)
{
static bool mdsync_in_progress = false;
@@ -984,6 +990,7 @@ mdsync(void)
/* Statistics on sync times */
int processed = 0;
+ int num_to_process;
instr_time sync_start,
sync_end,
sync_diff;
@@ -1052,6 +1059,7 @@ mdsync(void)
/* Now scan the hashtable for fsync requests to process */
absorb_counter = FSYNCS_PER_ABSORB;
hash_seq_init(&hstat, pendingOpsTable);
+ num_to_process = hash_get_num_entries(pendingOpsTable);
while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
{
ForkNumber forknum;
@@ -1171,6 +1179,29 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+ * for giving priority to executing transaction.
+ */
+ if(CheckPointerFsyncDelayThreshold >= 0 &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ !(ckpt_flags & CHECKPOINT_IMMEDIATE) &&
+ !(ckpt_flags & CHECKPOINT_FORCE) &&
+ !(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) &&
+ (elapsed / 1000 > CheckPointerFsyncDelayThreshold) &&
+ IsCheckpointOnSchedule(ckpt_schedule_ratio + (1.0 - ckpt_schedule_ratio) * (double) processed / num_to_process))
+ {
+ double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio;
+
+ /* Too long sleep is not good for checkpoint scheduler */
+ if(fsync_sleep > MAX_FSYNC_SLEEP)
+ fsync_sleep = MAX_FSYNC_SLEEP;
+ pg_usleep(fsync_sleep * 1000L);
+ if(log_checkpoints)
+ elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+ fsync_sleep);
+ }
break; /* out of retry loop */
}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..e704b52 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_pre_ckpt) (void); /* may be NULL */
- void (*smgr_sync) (void); /* may be NULL */
+ void (*smgr_sync) (int ckpt_flags, double progress_at_begin); /* may be NULL */
void (*smgr_post_ckpt) (void); /* may be NULL */
} f_smgr;
@@ -708,14 +708,18 @@ smgrpreckpt(void)
* smgrsync() -- Sync files to disk during checkpoint.
*/
void
-smgrsync(void)
+smgrsync(int ckpt_flags, double ckpt_schedule_ratio)
{
int i;
+ /*
+ * XXX: If we ever have more than one smgr, the remaining progress
+ * should somehow be divided among all smgrs.
+ */
for (i = 0; i < NSmgr; i++)
{
if (smgrsw[i].smgr_sync)
- (*(smgrsw[i].smgr_sync)) ();
+ (*(smgrsw[i].smgr_sync)) (ckpt_flags, ckpt_schedule_ratio);
}
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..a240c43 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerFsyncDelayThreshold,
+ -1, -1, 1000000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+ NULL
+ },
+ &CheckPointerFsyncDelayRatio,
+ 0.0, 0.0, 2.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..707b433 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable.
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..ab266d6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+extern int CheckPointerFsyncDelayThreshold;
extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern bool ImmediateCheckpointRequested(void);
+extern bool IsCheckpointOnSchedule(double progress);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
BlockNumber segno);
extern void AbsorbFsyncRequests(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..d68b950 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double ckpt_schedule_ratio);
extern void smgrpostckpt(void);
extern void AtEOXact_SMgr(void);
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double ckpt_schedule_ratio);
extern void mdpostckpt(void);
extern void SetForwardFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
Improvement_of_checkpoint_IO-scheduler_in_fsync_v5.patchtext/x-diff; name=Improvement_of_checkpoint_IO-scheduler_in_fsync_v5.patchDownload
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..d09fe4f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
*/
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+int CheckPointerFsyncDelayThreshold = -1;
double CheckPointCompletionTarget = 0.5;
+double CheckPointerFsyncDelayRatio = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
/*
* Private state
@@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time;
/* Prototypes for private functions */
static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -643,7 +643,7 @@ CheckArchiveTimeout(void)
* this does not check the *current* checkpoint's IMMEDIATE flag, but whether
* there is one pending behind it.)
*/
-static bool
+extern bool
ImmediateCheckpointRequested(void)
{
if (checkpoint_requested)
@@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress)
* checkpoint, and returns true if the progress we've made this far is greater
* than the elapsed time/segments.
*/
-static bool
+extern bool
IsCheckpointOnSchedule(double progress)
{
XLogRecPtr recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..93a879a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,6 +66,9 @@
#define DROP_RELS_BSEARCH_THRESHOLD 20
+/* Checkpoint schedule ratio of write phase to fsync phase */
+#define CKPT_SCHEDULE_RATIO 0.9
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -1828,7 +1831,7 @@ CheckPointBuffers(int flags)
BufferSync(flags);
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- smgrsync();
+ smgrsync(flags, CKPT_SCHEDULE_RATIO);
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..9809fb1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
@@ -44,6 +45,9 @@
#define FSYNCS_PER_ABSORB 10
#define UNLINKS_PER_ABSORB 10
+/* Protect too long sleep in each file fsync. */
+#define MAX_FSYNC_SLEEP 10000
+
/*
* Special values for the segno arg to RememberFsyncRequest.
*
@@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
static CycleCtr mdckpt_cycle_ctr = 0;
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
typedef enum /* behavior for mdopen & _mdfd_getseg */
{
@@ -235,7 +241,7 @@ SetForwardFsyncRequests(void)
/* Perform any pending fsyncs we may have queued up, then drop table */
if (pendingOpsTable)
{
- mdsync();
+ mdsync(CHECKPOINT_IMMEDIATE, 0.0);
hash_destroy(pendingOpsTable);
}
pendingOpsTable = NULL;
@@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
* mdsync() -- Sync previous writes to stable storage.
*/
void
-mdsync(void)
+mdsync(int ckpt_flags, double ckpt_schedule_ratio)
{
static bool mdsync_in_progress = false;
@@ -984,6 +990,7 @@ mdsync(void)
/* Statistics on sync times */
int processed = 0;
+ int num_to_process;
instr_time sync_start,
sync_end,
sync_diff;
@@ -1052,6 +1059,7 @@ mdsync(void)
/* Now scan the hashtable for fsync requests to process */
absorb_counter = FSYNCS_PER_ABSORB;
hash_seq_init(&hstat, pendingOpsTable);
+ num_to_process = hash_get_num_entries(pendingOpsTable);
while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
{
ForkNumber forknum;
@@ -1171,6 +1179,29 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+ * for giving priority to executing transaction.
+ */
+ if(CheckPointerFsyncDelayThreshold >= 0 &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ !(ckpt_flags & CHECKPOINT_IMMEDIATE) &&
+ !(ckpt_flags & CHECKPOINT_FORCE) &&
+ !(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) &&
+ (elapsed / 1000 > CheckPointerFsyncDelayThreshold) &&
+ IsCheckpointOnSchedule(ckpt_schedule_ratio + (1.0 - ckpt_schedule_ratio) * (double) processed / num_to_process))
+ {
+ double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio;
+
+ /* Too long sleep is not good for checkpoint scheduler */
+ if(fsync_sleep > MAX_FSYNC_SLEEP)
+ fsync_sleep = MAX_FSYNC_SLEEP;
+ pg_usleep(fsync_sleep * 1000L);
+ if(log_checkpoints)
+ elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+ fsync_sleep);
+ }
break; /* out of retry loop */
}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..da68900 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_pre_ckpt) (void); /* may be NULL */
- void (*smgr_sync) (void); /* may be NULL */
+ void (*smgr_sync) (int ckpt_flags, double ckpt_schedule_ratio); /* may be NULL */
void (*smgr_post_ckpt) (void); /* may be NULL */
} f_smgr;
@@ -708,14 +708,18 @@ smgrpreckpt(void)
* smgrsync() -- Sync files to disk during checkpoint.
*/
void
-smgrsync(void)
+smgrsync(int ckpt_flags, double ckpt_schedule_ratio)
{
int i;
+ /*
+ * XXX: If we ever have more than one smgr, the remaining progress
+ * should somehow be divided among all smgrs.
+ */
for (i = 0; i < NSmgr; i++)
{
if (smgrsw[i].smgr_sync)
- (*(smgrsw[i].smgr_sync)) ();
+ (*(smgrsw[i].smgr_sync)) (ckpt_flags, ckpt_schedule_ratio);
}
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..a240c43 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerFsyncDelayThreshold,
+ -1, -1, 1000000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+ NULL
+ },
+ &CheckPointerFsyncDelayRatio,
+ 0.0, 0.0, 2.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..b4b3a9d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 2.0
+#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable.
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..ab266d6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+extern int CheckPointerFsyncDelayThreshold;
extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern bool ImmediateCheckpointRequested(void);
+extern bool IsCheckpointOnSchedule(double progress);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
BlockNumber segno);
extern void AbsorbFsyncRequests(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..d68b950 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double ckpt_schedule_ratio);
extern void smgrpostckpt(void);
extern void AtEOXact_SMgr(void);
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double ckpt_schedule_ratio);
extern void mdpostckpt(void);
extern void SetForwardFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
On 6/16/13 10:27 AM, Heikki Linnakangas wrote:
Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images...
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.
That's exactly right. When a checkpoint finishes the OS write cache is
clean. That means all of the full-page writes aren't even hitting disk
in many cases. They just pile up in the OS dirty memory, often sitting
there all the way until when the next checkpoint fsyncs start. That's
why I never wandered down the road of changing FPW behavior. I have
never seen a benchmark workload hit a write bottleneck until long after
the big burst of FPW pages is over.
I could easily believe that there are low-memory systems where the FPW
write pressure becomes a problem earlier. And slim VMs make sense as
the place this behavior is being seen at.
I'm a big fan of instrumenting the code around a performance change
before touching anything, as a companion patch that might make sense to
commit on its own. In the case of a change to FPW spacing, I'd want to
see some diagnostic output in something like pg_stat_bgwriter that
tracks how many FPW pages are being modified. A
pgstat_bgwriter.full_page_writes counter would be perfect here, and then
graph that data over time as the benchmark runs.
Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.
There I also think the right way to proceed is instrumenting that area
first.
A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.
I updated and re-reviewed that in 2011:
/messages/by-id/4D31AE64.3000202@2ndquadrant.com
and commented on why I think the improvement was difficult to reproduce
back then. The improvement didn't follow for me either. It would take
a really amazing bit of data to get me to believe write sorting code is
worthwhile after that. On large systems capable of dirtying enough
blocks to cause a problem, the operating system and RAID controllers are
already sorting block. And *that* sorting is also considering
concurrent read requests, which are a lot more important to an efficient
schedule than anything the checkpoint process knows about. The database
doesn't have nearly enough information yet to compete against OS level
sorting.
Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.
Heikki has nailed the problem with the submitted dbt-2 results here. If
you spread checkpoints out more, you cannot fairly compare the resulting
TPS or latency numbers anymore.
Simple example: 20 minute long test. Server A does a checkpoint every
5 minutes. Server B has modified parameters or server code such that
checkpoints happen every 6 minutes. If you run both to completion, A
will have hit 4 checkpoints that flush the buffer cache, B only 3. Of
course B will seem faster. It didn't do as much work.
pgbench_tools measures the number of checkpoints during the test, as
well as the buffer count statistics. If those numbers are very
different between two tests, I have to throw them out as unfair. A lot
of things that seem promising turn out to have this sort of problem.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 6/27/13 11:08 AM, Robert Haas wrote:
I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well.
That's correct, I spent about a year whipping that particular horse and
submitted improvements on it to the community.
/messages/by-id/4D4F9A3D.5070700@2ndquadrant.com
and its updates downthread are good ones to compare this current work
against.
The important thing to realize about just delaying fsync calls is that
it *cannot* increase TPS throughput. Not possible in theory, obviously
doesn't happen in practice. The most efficient way to write things out
is to delay those writes as long as possible. The longer you postpone a
write, the more elevator sorting and write combining you get out of the
OS. This is why operating systems like Linux come tuned for such
delayed writes in the first place. Throughput and latency are linked;
any patch that aims to decrease latency will probably slow throughput.
Accordingly, the current behavior--no delay--is already the best
possible throughput. If you apply a write timing change and it seems to
increase TPS, that's almost certainly because it executed less
checkpoint writes. It's not a fair comparison. You have to adjust any
delaying to still hit the same end point on the checkpoint schedule.
That's what my later submissions did, and under that sort of controlled
condition most of the improvements went away.
Now, I still do really believe that better spacing of fsync calls helps
latency in the real world. Far as I know the server that I developed
that patch for originally in 2010 is still running with that change.
The result is not a throughput change though; there is a throughput drop
with a latency improvement. That is the unbreakable trade-off in this
area if all you touch is scheduling.
The reason why I was ignoring this discussion and working on pgbench
throttling until now is that you need to measure latency at a constant
throughput to advance here on this topic, and that's exactly what the
new pgbench feature enables. If we can take the current checkpoint
scheduler and an altered one, run both at exactly the same rate, and one
gives lower latency, now we're onto something. It's possible to do that
with DBT-2 as well, but I wanted something really simple that people
could replicate results with in pgbench.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7/3/13 9:39 AM, Andres Freund wrote:
I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files.
The fsync calls decomposing into the queued set of block writes. If
they all need to go out eventually to finish a checkpoint, the most
efficient way from a throughput perspective is to dump them all at once.
I'm not sure sync_file_range targeting checkpoint writes will turn out
any differently than block sorting. Let's say the database tries to get
involved in forcing a particular write order that way. Right now it's
going to be making that ordering decision without the benefit of also
knowing what blocks are being read. That makes it hard to do better
than the OS, which knows a different--and potentially more useful in a
ready-heavy environment--set of information about all the pending I/O.
And it would be very expensive to made all the backends start sharing
information about what they read to ever pull that logic into the
database. It's really easy to wander down the path where you assume you
must know more than the OS does, which leads to things like direct I/O.
I am skeptical of that path in general. I really don't want Postgres
to be competing with the innovation rate in Linux kernel I/O if we can
ride it instead.
One idea I was thinking about that overlaps with a sync_file_range
refactoring is simply tracking how many blocks have been written to each
relation. If there was a rule like "fsync any relation that's gotten
more than 100 8K writes", we'd never build up the sort of backlog that
causes the worst latency issues. You really need to start tracking the
file range there, just to fairly account for multiple writes to the same
block. One of the reasons I don't mind all the work I'm planning to put
into block write statistics is that I think that will make it easier to
build this sort of facility too. The original page write and the fsync
call that eventually flushes it out are very disconnected right now, and
file range data seems the right missing piece to connect them well.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14/07/2013 20:13, Greg Smith wrote:
The most efficient way to write things out is to delay those writes as
long as possible.
That doesn't smell right to me. It might be that delaying allows more
combining and allows the kernel to see more at once and optimise it, but
I think the counter-argument is that it is an efficiency loss to have
either CPU or disk idle waiting on the other. It cannot make sense from
a throughput point of view to have disks doing nothing and then become
overloaded so they are a bottleneck (primarily seeking) and the CPU does
nothing.
Now I have NOT measured behaviour but I'd observe that we see disks that
can stream 100MB/s but do only 5% of that if they are doing random IO.
Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free. The
flip side is that we can't really know whether they will get merged with
adjacent writes later so its hard to schedule them early. But we can
observe that if we have a bunch of writes to adjacent data then a seek
to do the write is effectively amortised across them.
So it occurs to me that perhaps we can watch for patterns where we have
groups of adjacent writes that might stream, and when they form we might
schedule them to be pushed out early (if not immediately), ideally out
as far as the drive (but not flushed from its cache) and without forcing
all other data to be flushed too. And perhaps we should always look to
be getting drives dedicated to dbms to do something, even if it turns
out to have been redundant in the end.
That's not necessarily easy on Linux without using a direct unbuffered
IO but to me that is Linux' problem. For a start its not the only
target system, and having feedback 'we need' from db and mail system
groups to the NT kernels devs hasn't hurt, and it never hurt Solaris to
hear what Oracle and Sybase devs felt they needed either.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7/11/13 8:29 AM, KONDO Mitsumasa wrote:
I use linear combination method for considering about total checkpoint schedule
which are write phase and fsync phase. V3 patch was considered about only fsync
phase, V4 patch was considered about write phase and fsync phase, and v5 patch
was considered about only fsync phase.
Your v5 now looks like my "Self-tuning checkpoint sync spread" series:
https://commitfest.postgresql.org/action/patch_view?id=514 which I did
after deciding write phase delays didn't help. It looks to me like
some, maybe all, of your gain is coming from how any added delays spread
out the checkpoints. The "self-tuning" part I aimed at was trying to
stay on exactly the same checkpoint end time even with the delays in
place. I got that part to work, but the performance gain went away once
the schedule was a fair comparison. You are trying to solve a very hard
problem.
How long are you running your dbt-2 tests for? I didn't see that listed
anywhere.
** Average checkpoint duration (sec) (Not include during loading time)
| write_duration | sync_duration | total
fsync v3-0.7 | 296.6 | 251.8898 | 548.48 | OK
fsync v3-0.9 | 292.086 | 276.4525 | 568.53 | OK
fsync v3-0.7_disabled| 303.5706 | 155.6116 | 459.18 | OK
fsync v4-0.7 | 273.8338 | 355.6224 | 629.45 | OK
fsync v4-0.9 | 329.0522 | 231.77 | 560.82 | OK
I graphed the total times against the resulting NOTPM values and
attached that. I expect transaction rate to increase along with time
time between checkpoints, and that's what I see here. The fsync v4-0.7
result is worse than the rest for some reason, but all the rest line up
nicely.
Notice how fsync v3-0.7_disabled has the lowest total time between
checkpoints, at 459.18. That is why it has the most I/O and therefore
runs more slowly than the rest. If you take your fsync v3-0.7_disabled
and increase checkpoint_segments and/or checkpoint_timeout until that
test is averaging about 550 seconds between checkpoints, NOTPM should
also increase. That's interesting to know, but you don't need any
change to Postgres for that. That's what always happens when you have
less checkpoints per run.
If you get a checkpoint time table like this where the total duration is
very close--within +/-20 seconds is the sort of noise I would expect
there--at that point I would say you have all your patches on the same
checkpoint schedule. And then you can compare the NOTPM numbers
usefully. When the checkpoint times are in a large range like 459.18 to
629.45 in this table, as my graph shows the associated NOTPM numbers are
going to be based on that time.
I would recommend taking a snapshot of pg_stat_bgwriter before and after
the test runs, and then showing the difference between all of those
numbers too. If the test runs for a while--say 30 minutes--the total
number of checkpoints should be very close too.
* Test Server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
(Add) Set off energy efficient function in BIOS and OS.
Excellent, here I have a DL160 G6 with 2 processors, 72GB of RAM, and
that same P410 controller + 4 disks. I've been meaning to get DBT-2
running on there usefully, your research gives me a reason to do that.
You seem to be in a rush due to the commitfest schedule. I have some
bad news for you there. You're not going to see a change here committed
in this CF based on where it's at, so you might as well think about the
best longer term plan. I would be shocked if anything came out of this
in less than 3 months really. That's the shortest amount of time I've
ever done something useful in this area. Each useful benchmarking run
takes me about 3 days of computer time, it's not a very fast development
cycle.
Even if all of your results were great, we'd need to get someone to
duplicate them on another server, and we'd need to make sure they didn't
make other workloads worse. DBT-2 is very useful, but no one is going
to get a major change to the write logic in the database committed based
on one benchmark. Past changes like this have used both DBT-2 and a
large number of pgbench tests to get enough evidence of improvement to
commit. I can help with that part when you get to something I haven't
tried already. I am very interesting in improving this area, it just
takes a lot of work to do it.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachments:
NOTPM-Checkpoints.pngimage/png; name=NOTPM-Checkpoints.pngDownload
�PNG
IHDR B � ��k iCCPICC Profile X �Yy8�����9��8����9C�y�<W8���1+�%� E�D�
��
J�h"2DI$ ��5|����{����������^{�������
5**�@-��hnDq������ 0�_t����
�?��� ��|"�!��d��+H��~ ����?�/�+ �[���1 `7����Dm��s���Wn����e���}�x��!�( D*� �,�S����"