checkpointer continuous flushing

Started by Fabien COELHOover 10 years ago261 messages

coelho@cri.ensmp.fr

over 10 years ago

2 attachment(s)

Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's
August 2014 patch for flushing while writing during checkpoints, with some
documentation and configuration warnings added.

For the initial patch, see:

/messages/by-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:

/messages/by-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.

Flushes are managed with pg throttled writes instead of waiting for the
checkpointer final "fsync" which induces occasional stalls. From
"pgbench -P 1 ...", such stalls look like this:

progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043 # ok
progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev 123.828 # stalled
progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346 # ...
progress: 38.0 s, 161.0 tps, lat 6.968 ms stddev 32.964 # restart
progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326 # ok

I've seen similar behavior on FreeBSD with its native FS, so it is not a
Linux-specific or ext4-specific issue, even if both factor may contribute.

There are two implementations, first one based on "sync_file_range" is Linux
specific, while the other relies on "posix_fadvise". Tests below ran on Linux.
If someone could test the posix_fadvise version on relevant platforms, that
would be great...

The Linux specific "sync_file_range" approach was suggested among other ideas
by Theodore Ts'o on Robert Haas blog in March 2014:

http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.html

Two guc variables control whether the feature is activated for writes of
dirty pages issued by checkpointer and bgwriter. Given that the settings
may improve or degrade performance, having GUC seems justified. In
particular the stalling issue disappears with SSD.

The effect is significant on a series of tests shown below with scale 10
pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw
RAID), with shared_buffers=1GB checkpoint_completion_target=0.8
completion_timeout=30s, unless stated otherwise.

Note: I know that this completion_timeout is too small for a normal
config, but the point is to test how checkpoints behave, so the test
triggers as many checkpoints as possible, hence the minimum timeout
setting. I have also done some tests with larger timeout.

(1) THROTTLED PGBENCH

The objective of the patch is to be able to reduce the latency of transactions
under a moderate load. These first serie of tests focuses on this point with
the help of pgbench -R (rate) and -L (skip/count late transactions).
The measure counts transactions which were skipped or beyond the expected
latency limit while targetting a transaction rate.

* "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during
100 seconds, and latency limit is 100 ms), over 256 runs, 7 hours per case:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 6.5 %
off | on | 6.1 %
on | off | 0.4 %
on | on | 0.4 %

* Same as above (100 tps target) over one run of 4000 seconds with
shared_buffers=256MB and checkpoint_timeout=10mn:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 1.3 %
off | on | 1.5 %
on | off | 0.6 %
on | on | 0.6 %

* Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 8.0 %
off | on | 8.0 %
on | off | 0.4 %
on | on | 0.4 %

* Same as above (150 tps target) over one run of 4000 seconds with
shared_buffers=256MB and checkpoint_timeout=10mn:

flush | percent of skipped
cp | bgw | & out of latency limit transactions
off | off | 1.7 %
off | on | 1.9 %
on | off | 0.7 %
on | on | 0.6 %

Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.

bgwriter_flush_to_disk does not seem to have a significant impact on these
tests, maybe because pg shared_buffers size is much larger than the
database, so the bgwriter is seldom active.

(2) FULL SPEED PGBENCH

This is not the target use case, but it seems necessary to assess the
impact of these options of tps figures and their variability.

* "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case.

flush | performance on ...
cp | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds)
off | off | 691 +- 36 tps | 691 +- 236 tps
off | on | 677 +- 29 tps | 677 +- 230 tps
on | off | 655 +- 23 tps | 655 +- 130 tps
on | on | 657 +- 22 tps | 657 +- 130 tps

On this first test, setting checkpoint_flush_to_disk reduces the performance by
5%, but the per second standard deviation is nearly halved, that is the
performance is more stable over the runs, although lower.
Option bgwriter_flush_to_disk effect is inconclusive.

* "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with
checkpoint_timeout=10mn and shared_buffers=256MB (at least 6 checkpoints
during the run, probably more because segments are filled more often than
every 10mn):

flush | performance ... (stddev over per second tps)
off | off | 877 +- 179 tps
off | on | 880 +- 183 tps
on | off | 896 +- 131 tps
on | on | 888 +- 132 tps

On this second short test, setting checkpoint_flush_to_disk seems to maybe
slightly improve performance (maybe 2% ?) and significantly reduces
variability, so it looks like a good move.

* "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients)

flush | performance on ...
cp | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds)
off | off | 1970 +- 60 tps | 1970 +- 783 tps
off | on | 1928 +- 61 tps | 1928 +- 813 tps
on | off | 1578 +- 45 tps | 1578 +- 631 tps
on | on | 1594 +- 47 tps | 1594 +- 618 tps

On this test both average and standard deviation are both reduced by 20%.
This does not look like a win.

CONCLUSION

This approach is simple and significantly improves pg fsync behavior under
moderate load, where the database stays mostly responsive. Under full load,
the situation may be improved or degraded, it depends.

OTHER OPTIONS

Another idea suggested by Theodore Ts'o seems impractical: playing with
Linux io-scheduler priority (ioprio_set) looks only relevant with the
"sfq" scheduler on actual hard disk, but does not work with other
schedulers, especially "deadline" which seems more advisable for Pg, nor
for hardware RAID, which is a common setting.

Also, Theodore Ts'o suggested to use "sync_file_range" to check whether
the writes have reached the disk, and possibly to delay the actual
fsync/checkpoint conclusion if not... I have not tried that, the
implementation is not as trivial, and I'm not sure what to do when the
completion target is coming, but possibly that could be an interesting
option to investigate. Preliminary tests by adding a sleep between the
writes and the final fsync did not yield very good results.

I've also played with numerous other options (changing checkpointer
throttling parameters, reducing checkpoint timeout to 1 second, playing
around with various kernel settings), but that did not seem to be very
effective for the problem at hand.

I also attached a test script I used, that can be adapted if someone wants
to collect some performance data. I also have some basic scripts to
extract and compute stats, ask if needed.

--
Fabien.

Attachments:

checkpoint-continuous-flush-1.patchtext/x-diff; name=checkpoint-continuous-flush-1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549b7d..1c0a3a1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1818,6 +1818,24 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+     <variablelist>
+      <varlistentry id="guc-bgwriter-flush-to-disk" xreflabel="bgwriter_flush_to_disk">
+       <term><varname>bgwriter_flush_to_disk</varname> (<type>bool</type>)
+       <indexterm>
+        <primary><varname>bgwriter_flush_to_disk</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         When the bgwriter writes data, hint the underlying OS that the data
+         must be sent to disk as soon as possible.  This may help smoothing
+         disk IO writes and avoid a stall when an fsync is issued by a
+         checkpoint, but it may also reduce average performance.
+         This setting may have no effect on some platforms.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-bgwriter-lru-maxpages" xreflabel="bgwriter_lru_maxpages">
        <term><varname>bgwriter_lru_maxpages</varname> (<type>integer</type>)
        <indexterm>
@@ -2495,6 +2513,23 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk IO writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f4083c3..cdbdca9 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,15 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it also has an adverse effect on the average transaction rate.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..2d5c873 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9431ab5..3375032 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ae8c1ca 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..242af8f 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cc973b5..3e19ebc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,9 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
+bool		bgwriter_flush_to_disk = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -396,7 +399,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -409,7 +413,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1018,7 +1022,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1662,7 +1666,7 @@ BufferSync(int flags)
 		 */
 		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1939,7 +1943,7 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int			buffer_state = SyncOneBuffer(next_to_clean, true, bgwriter_flush_to_disk);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2016,7 +2020,7 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2057,7 +2061,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2319,9 +2323,12 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2410,7 +2417,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk);
 
 	if (track_io_timing)
 	{
@@ -2830,6 +2838,7 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
+						  false,
 						  false);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
@@ -2864,7 +2873,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2916,7 +2925,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..156539d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,6 +208,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
+				  false,
 				  false);
 
 		/* Mark not-dirty now in case we error out below */
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..132cc43 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..717e772 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1345,7 +1345,7 @@ retry:
 }
 
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk)
 {
 	int			returnCode;
 
@@ -1395,6 +1395,55 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+			/* Linux: tell the memory manager to move these blocks to io so
+			 * that they are considered for being actually written to disk.
+			 */
+			rc = sync_file_range(VfdCache[file].fd, VfdCache[file].seekPos,
+								 amount, SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+			/* Others: say that data should not be kept in memory...
+			 * This is not exactly what we want to say, because we want to write
+			 * the data for durability but we may need it later nevertheless.
+			 * It seems that Linux would free the memory *if* the data has
+			 * already been written do disk, else it is ignored.
+			 * For FreeBSD this may have the desired effect of moving the
+			 * data to the io layer.
+			 */
+			rc = posix_fadvise(VfdCache[file].fd, VfdCache[file].seekPos,
+							   amount, POSIX_FADV_DONTNEED);
+
+#endif
+
+			if (rc < 0)
+				ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT " in file \"%s\": %m",
+							(int64) VfdCache[file].seekPos / BLCKSZ,
+							VfdCache[file].fileName)));
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..5c50e19 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +767,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..199695d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,10 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+											  buffer, skipFsync, flush_to_disk);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3c9f14..0b5ca17 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -569,6 +570,8 @@ const char *const config_group_names[] =
 	gettext_noop("Write-Ahead Log / Checkpoints"),
 	/* WAL_ARCHIVING */
 	gettext_noop("Write-Ahead Log / Archiving"),
+	/* BGWRITER */
+	gettext_noop("Background Writer"),
 	/* REPLICATION */
 	gettext_noop("Replication"),
 	/* REPLICATION_SENDING */
@@ -1009,6 +1012,27 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
+		{"bgwriter_flush_to_disk", PGC_SIGHUP, BGWRITER,
+			gettext_noop("Hint that bgwriter's writes are high priority."),
+			NULL
+		},
+		&bgwriter_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -9761,6 +9785,22 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" or "
+					"\"bgwriter_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..4fea196 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,8 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
+extern bool bgwriter_flush_to_disk;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..32ac80f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -70,7 +70,7 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..0bf0886 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -95,7 +95,7 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+		 BlockNumber blocknum, char *buffer, bool skipFsync, bool flush_to_disk);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -121,7 +121,7 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+		BlockNumber blocknum, char *buffer, bool skipFsync, bool flush_to_disk);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7a58ddb..b69af2d 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -68,6 +68,7 @@ enum config_group
 	WAL_SETTINGS,
 	WAL_CHECKPOINTS,
 	WAL_ARCHIVING,
+	BGWRITER,
 	REPLICATION,
 	REPLICATION_SENDING,
 	REPLICATION_MASTER,

cp_test.shapplication/x-sh; name=cp_test.shDownload

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Fabien COELHO (#1)

Re: checkpointer continuous flushing

Hi Fabien,

On 2015-06-01 PM 08:40, Fabien COELHO wrote:

Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.

bgwriter_flush_to_disk does not seem to have a significant impact on these
tests, maybe because pg shared_buffers size is much larger than the database,
so the bgwriter is seldom active.

Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Langote (#2)

Re: checkpointer continuous flushing

Hello Amit,

Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?

It is currently "*_flush_to_disk". In Andres Freund version the name is
"sync_on_checkpoint_flush", but I did not found it very clear. Using
"*_flush_on_write" instead as your suggest, would be fine as well, it
emphasizes the "when/how" it occurs instead of the final "destination",
why not...

About words: checkpoint "write"s pages, but this really mean passing the
pages to the memory manager, which will think about it... "flush" seems to
suggest a more effective write, but really it may mean the same, the page
is just passed to the OS. So "write/flush" is really "to OS" and not "to
disk". I like the data to be on "disk" in the end, and as soon as
possible, hence the choice to emphasize that point.

Now I would really be okay with anything that people find simple to
understand, so any opinion is welcome!

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#1)

Re: checkpointer continuous flushing

Hi,

It's nice to see the topic being picked up.

If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental. Did you
consider combining both approaches?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#1)

Re: checkpointer continuous flushing

On Mon, Jun 1, 2015 at 5:10 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's
August 2014 patch for flushing while writing during checkpoints, with some
documentation and configuration warnings added.

For the initial patch, see:

/messages/by-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:

/messages/by-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.

-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool
flush_to_disk)
 {
  XLogRecPtr recptr;
  ErrorContextCallback errcallback;
@@ -2410,7 +2417,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation
reln)
   buf->tag.forkNum,
   buf->tag.blockNum,
   bufToWrite,
-  false);
+  false,
+  flush_to_disk);

Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?

I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.

Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?

Yet another idea could be to allow BGWriter to also fsync the dirty
buffers, that may have side impact of not able to clear the dirty pages
at speed required by system, but I think if that happens one can
think of having multiple BGwriter tasks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#4)

Re: checkpointer continuous flushing

Hello Andres,

If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental. Did you
consider combining both approaches?

Ja, I thought that it was a more complex patch with uncertain/less clear
benefits, and as this simpler version was already effective enough as it
was, so I decided to start with that and try to have reasonable proof of
benefits so that it could get through.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#5)

Re: checkpointer continuous flushing

Hello Amit,

[...]

The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.

Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?

Yep, probably. Under "moderate load" this is not an issue. The
io-scheduler and other hd firmware will probably reorder writes anyway.
Also, if several data are updated together, probably they are likely to be
already neighbours in memory as well as on disk.

I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.

I would rather separate them, unless this is a blocker. This version seems
already quite effective and very light. ISTM that adding a sort phase
would mean reworking significantly how the checkpointer processes pages.

Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?

I think that currently the fsync is on the file handler, so what happens
depends on how fsync is implemented by the system.

Yet another idea could be to allow BGWriter to also fsync the dirty
buffers,

ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".

that may have side impact of not able to clear the dirty pages at speed
required by system, but I think if that happens one can think of having
multiple BGwriter tasks.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#7)

Re: checkpointer continuous flushing

On 2015-06-02 15:15:39 +0200, Fabien COELHO wrote:

Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?

Yep, probably. Under "moderate load" this is not an issue. The io-scheduler
and other hd firmware will probably reorder writes anyway.

They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.

Also, if several
data are updated together, probably they are likely to be already neighbours
in memory as well as on disk.

No, that's not how it'll happen outside of simplistic cases where you
start with an empty shared_buffers. Shared buffers are maintained by a
simplified LRU, so how often individual blocks are touched will define
the buffer replacement.

I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.

I would rather separate them, unless this is a blocker.

I think it is a blocker.

This version seems
already quite effective and very light. ISTM that adding a sort phase would
mean reworking significantly how the checkpointer processes pages.

Meh. The patch for that wasn't that big.

The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#8)

Re: checkpointer continuous flushing

Hello Andres,

I would rather separate them, unless this is a blocker.

I think it is a blocker.

Hmmm. This is an argument...

This version seems already quite effective and very light. ISTM that
adding a sort phase would mean reworking significantly how the
checkpointer processes pages.

Meh. The patch for that wasn't that big.

Hmmm. I think it should be implemented as Tom suggested, that is per
chunks of shared buffers, in order to avoid allocating a "large" memory.

The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.

ISTM that the two aspects are orthogonal, which would suggests two gucs
anyway.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#9)

Re: checkpointer continuous flushing

On 2015-06-02 15:42:14 +0200, Fabien COELHO wrote:

This version seems already quite effective and very light. ISTM that
adding a sort phase would mean reworking significantly how the
checkpointer processes pages.

Meh. The patch for that wasn't that big.

Hmmm. I think it should be implemented as Tom suggested, that is per chunks
of shared buffers, in order to avoid allocating a "large" memory.

I don't necessarily agree. But that's really just a minor implementation
detail. The actual problem is sorting & fsyncing in a way that deals
efficiently with tablespaces, i.e. doesn't write to tablespaces
one-by-one. Not impossible, but it requires some thought.

The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.

ISTM that the two aspects are orthogonal, which would suggests two gucs
anyway.

They're pretty closely linked from their performance impact. IMO this
feature, if done correctly, should result in better performance in 95+%
of the workloads and be enabled by default. And that'll not be possible
without actually writing mostly sequentially.

It's also not just the sequential writes making this important, it's
also that it allows to do the final fsync() of the individual segments
as soon as their last buffer has been written out. That's important
because it means the file will get fewer writes done independently
(i.e. backends writing out dirty buffers) which will make the final
fsync more expensive.

It might be that we want to different gucs, but I don't think we can
release without both features.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#10)

Re: checkpointer continuous flushing

Hmmm. I think it should be implemented as Tom suggested, that is per chunks
of shared buffers, in order to avoid allocating a "large" memory.

I don't necessarily agree. But that's really just a minor implementation
detail.

Probably.

The actual problem is sorting & fsyncing in a way that deals efficiently
with tablespaces, i.e. doesn't write to tablespaces one-by-one.
Not impossible, but it requires some thought.

Hmmm... I would have neglected this point in a first approximation,
but I agree that not interleaving tablespaces could indeed loose some
performance.

ISTM that the two aspects are orthogonal, which would suggests two gucs
anyway.

They're pretty closely linked from their performance impact.

Sure.

IMO this feature, if done correctly, should result in better performance
in 95+% of the workloads

To demonstrate that would require time...

and be enabled by default.

I did not had such an ambition with the submitted patch:-)

And that'll not be possible without actually writing mostly
sequentially.

It's also not just the sequential writes making this important, it's
also that it allows to do the final fsync() of the individual segments
as soon as their last buffer has been written out.

Hmmm... I'm not sure this would have a large impact. The writes are
throttled as much as possible, so fsync will catch plenty other writes
anyway, if there are some.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#11)

Re: checkpointer continuous flushing

On 2015-06-02 17:01:50 +0200, Fabien COELHO wrote:

The actual problem is sorting & fsyncing in a way that deals efficiently
with tablespaces, i.e. doesn't write to tablespaces one-by-one.
Not impossible, but it requires some thought.

Hmmm... I would have neglected this point in a first approximation,
but I agree that not interleaving tablespaces could indeed loose some
performance.

I think it'll be a hard to diagnose performance regression. So we'll
have to fix it. That argument actually was the blocker in previous
attempts...

IMO this feature, if done correctly, should result in better performance
in 95+% of the workloads

To demonstrate that would require time...

Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.

and be enabled by default.

I did not had such an ambition with the submitted patch:-)

I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.

And that'll not be possible without actually writing mostly sequentially.

It's also not just the sequential writes making this important, it's also
that it allows to do the final fsync() of the individual segments as soon
as their last buffer has been written out.

Hmmm... I'm not sure this would have a large impact. The writes are
throttled as much as possible, so fsync will catch plenty other writes
anyway, if there are some.

That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#12)

Re: checkpointer continuous flushing

IMO this feature, if done correctly, should result in better performance
in 95+% of the workloads

To demonstrate that would require time...

Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.

I did spent time (well, a machine spent time, really) to collect some
convincing data for the simple version without sorting to demonstrate that
it brings a clear value, which seems not to be enough...

I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.

Hmmm.

My opinion is that throughput is given too much attention in general, but
if both can be kept/improved, this would be easier to sell, obviously.

It's also not just the sequential writes making this important, it's also
that it allows to do the final fsync() of the individual segments as soon
as their last buffer has been written out.

Hmmm... I'm not sure this would have a large impact. The writes are
throttled as much as possible, so fsync will catch plenty other writes
anyway, if there are some.

That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.

Statistically, I think that it would reduce the number of unrelated writes
taken in a fsync by about half: the last table to be written on a
tablespace, at the end of the checkpoint, will have accumulated
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
time, while the first table will have avoided most of them.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#13)

Re: checkpointer continuous flushing

On 2015-06-02 18:59:05 +0200, Fabien COELHO wrote:

IMO this feature, if done correctly, should result in better performance
in 95+% of the workloads

To demonstrate that would require time...

Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.

I did spent time (well, a machine spent time, really) to collect some
convincing data for the simple version without sorting to demonstrate that
it brings a clear value, which seems not to be enough...

"which seems not to be enough" - man. It's trivial to make things
faster/better/whatever if you don't care about regressions in other
parts. And if we'd add a guc for each of these cases we'd end up with
thousands of them.

My opinion is that throughput is given too much attention in general, but if
both can be kept/improved, this would be easier to sell, obviously.

Your priorities are not everyone's. That's life.

That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.

Statistically, I think that it would reduce the number of unrelated writes
taken in a fsync by about half: the last table to be written on a
tablespace, at the end of the checkpoint, will have accumulated
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
time, while the first table will have avoided most of them.

That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Fabien COELHO (#3)

Re: checkpointer continuous flushing

Hi,

On 2015-06-02 PM 07:19, Fabien COELHO wrote:

Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?

It is currently "*_flush_to_disk". In Andres Freund version the name is
"sync_on_checkpoint_flush", but I did not found it very clear. Using
"*_flush_on_write" instead as your suggest, would be fine as well, it
emphasizes the "when/how" it occurs instead of the final "destination", why
not...

About words: checkpoint "write"s pages, but this really mean passing the pages
to the memory manager, which will think about it... "flush" seems to suggest a
more effective write, but really it may mean the same, the page is just passed
to the OS. So "write/flush" is really "to OS" and not "to disk". I like the
data to be on "disk" in the end, and as soon as possible, hence the choice to
emphasize that point.

Now I would really be okay with anything that people find simple to
understand, so any opinion is welcome!

It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
understand this and the previous discussion(s) correctly, the patch tries to
alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
the sync in step with writes (which do abide by the
checkpoint_completion_target). Given that impression, it seems *_sync_on_write
may even do the job.

Again, this is a minor issue.

By the way, I tend to agree with others here that there needs to be found a
good balance such that this sync-blocks-one-at-time-in-random-order approach
does not hurt generalized workload too much although it seems to help with
solving the latency problem that you seem set out to solve.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#7)

Re: checkpointer continuous flushing

On Tue, Jun 2, 2015 at 6:45 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

[...]

The objective is to help avoid PG stalling when fsyncing on checkpoints,
and in general to get better latency-bound performance.

Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?

Yep, probably. Under "moderate load" this is not an issue. The

io-scheduler and other hd firmware will probably reorder writes anyway.
Also, if several data are updated together, probably they are likely to be
already neighbours in memory as well as on disk.

I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch. I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.

I would rather separate them, unless this is a blocker. This version

seems already quite effective and very light. ISTM that adding a sort phase
would mean reworking significantly how the checkpointer processes pages.

I agree with you that if we have to add a sort phase, there is additional
work and that work could be significant depending on the design we
choose, however without that, this patch can have impact on many kind
of workloads, even in your mail in one of the tests
("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
it has shown 20% degradation which is quite significant and test also
seems to be representative of the workload which many users in real-world
will use.

Now one can say that for such workloads turn the new knob to off, but
in reality it could be difficult to predict if the load is always moderate.
I think users might be able to predict that at table level, but inspite of
that
I don't think having any such knob can give us ticket to flush the buffers
in random order.

Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?

I think that currently the fsync is on the file handler, so what happens

depends on how fsync is implemented by the system.

That can also lead to random I/O if the fsync for different files is not in
order as they are actually stored on disk.

Yet another idea could be to allow BGWriter to also fsync the dirty
buffers,

ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".

I think patch just issues an async operation not the actual flush. Why
I have suggested so is that in your tests when the checkpoint_timeout
is small it seems there is a good gain in performance that means if
keep on flushing dirty buffers at regular intervals, the system's
performance
is good and BGWriter is the process where that can be done conveniently
apart from checkpoint, one might think that if same can be achieved by
using
shorter checkpoint_timeout interval, then why to do this incremental flushes
by bgwriter, but in reality I think checkpoint is responsible for other
things
as well other than dirty buffers, so we can't leave everything till
checkpoint
happens.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#14)

Re: checkpointer continuous flushing

That might be the case in a database with a single small table; i.e.
where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant
part of the writes issued independently from checkpointing will be
outside the processing of the individual segment.

Statistically, I think that it would reduce the number of unrelated writes
taken in a fsync by about half: the last table to be written on a
tablespace, at the end of the checkpoint, will have accumulated
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
time, while the first table will have avoided most of them.

That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.

I meant my argument with "continuous flushing" activated, so there is no
up to 30 seconds delay induced my the memory manager. Hmmm, maybe I do not
understood your argument.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Langote (#15)

Re: checkpointer continuous flushing

Hello Amit,

It is currently "*_flush_to_disk". In Andres Freund version the name is
"sync_on_checkpoint_flush", but I did not found it very clear. Using
"*_flush_on_write" instead as your suggest, would be fine as well, it
emphasizes the "when/how" it occurs instead of the final "destination", why
not...

[...]

It seems 'sync' gets closer to what I really wanted 'flush' to mean. If
I understand this and the previous discussion(s) correctly, the patch
tries to alleviate the problems caused by one-big-sync-at-the
end-of-writes by doing the sync in step with writes (which do abide by
the checkpoint_completion_target). Given that impression, it seems
*_sync_on_write may even do the job.

I desagree with this one, because the sync is only *initiated*, not done.
For this reason I think that "flush" seems a better word. I understand
"sync" as "committed to disk". For the data to be synced, it should call
with the "wait after" option, which is a partial "fsync", but that would
be terrible for performance as all checkpointed pages would be written one
by one, without any opportunity for reordering them.

For what it's worth and for the record, Linux sync_file_range
documentation says "This is an asynchronous flush-to-disk operation" to
describe the corresponding option. This is probably where I took it.

So two contenders:

*_flush_to_disk
*_flush_on_write

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#16)

Re: checkpointer continuous flushing

I agree with you that if we have to add a sort phase, there is additional
work and that work could be significant depending on the design we
choose, however without that, this patch can have impact on many kind
of workloads, even in your mail in one of the tests
("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
it has shown 20% degradation which is quite significant and test also
seems to be representative of the workload which many users in real-world
will use.

Yes, I do agree with the 4 clients, but I doubt that many user run their
application at maximum available throughput all the time (like always
driving foot to the floor). So for me throttled runs are more
representative of real life.

Now one can say that for such workloads turn the new knob to off, but
in reality it could be difficult to predict if the load is always moderate.

Hmmm. The switch says "I prefer stable (say latency bounded) performance",
if you run a web site probably you should want that.

Anyway, I'll look at sorting when I have some time.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Fabien COELHO (#18)

Re: checkpointer continuous flushing

Fabien,

On 2015-06-03 PM 02:53, Fabien COELHO wrote:

It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
understand this and the previous discussion(s) correctly, the patch tries to
alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
the sync in step with writes (which do abide by the
checkpoint_completion_target). Given that impression, it seems
*_sync_on_write may even do the job.

I desagree with this one, because the sync is only *initiated*, not done. For
this reason I think that "flush" seems a better word. I understand "sync" as
"committed to disk". For the data to be synced, it should call with the "wait
after" option, which is a partial "fsync", but that would be terrible for
performance as all checkpointed pages would be written one by one, without any
opportunity for reordering them.

For what it's worth and for the record, Linux sync_file_range documentation
says "This is an asynchronous flush-to-disk operation" to describe the
corresponding option. This is probably where I took it.

Ah, okay! I didn't quite think about the async aspect here. But, I sure do
hope that the added mechanism turns out to be *less* async than kernel's own
dirty cache handling to achieve the hoped for gain.

So two contenders:

*_flush_to_disk
*_flush_on_write

Yep!

Regards,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#8)

1 attachment(s)

Re: checkpointer continuous flushing

Hello Andres,

They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.

* VERSION 2 "WORK IN PROGRESS".

The implementation is more a proof-of-concept for having feedback than
clean code. What it does:

- as version 1 : simplified asynchronous flush based on Andres Freund
patch, with sync_file_range/posix_fadvise used to hint the OS that
the buffer must be sent to disk "now".

- added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki
but with a smaller and static buffer allocated once. Also,
sorting is done by chunks in the current version.

- also added: sync/advise calls are now merged if possible,
so less calls are used, especially when buffers are sorted,
but also if there are few files.

* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1" scale 10 (simple update pgbench
with a mostly-write activity), with checkpoint_completion_target=0.8
and shared_buffers=1GB.

Contrary to v1, I have not tested bgwriter flushing as the impact
on the first round was close to nought. This does not mean that particular
loads may benefit or be harmed but flushing from bgwriter.

- 100 tps throttled max 100 ms latency over 6400 seconds
with checkpoint_timeout=30s

flush | sort | late transactions
off | off | 6.0 %
off | on | 6.1 %
on | off | 0.4 %
on | on | 0.4 % (93% improvement)

- 100 tps throttled max 100 ms latency over 4000 seconds
with checkpoint_timeout=10mn

flush | sort | late transactions
off | off | 1.5 %
off | on | 0.6 % (?!)
on | off | 0.8 %
on | on | 0.6 % (60% improvement)

- 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours)
with checkpoint_timeout=30s

flush | sort | late transactions
off | off | 8.5 %
off | on | 8.1 %
on | off | 0.5 %
on | on | 0.4 % (95% improvement)

- full speed bgbench over 6400 seconds with checkpoint_timeout=30s

flush | sort | tps performance over per second data
off | off | 676 +- 230
off | on | 683 +- 213
on | off | 712 +- 130
on | on | 725 +- 116 (7.2% avg/50% stddev improvements)

- full speed bgbench over 4000 seconds with checkpoint_timeout=10mn

flush | sort | tps performance over per second data
off | off | 885 +- 188
off | on | 940 +- 120 (6%/36%!)
on | off | 778 +- 245 (hmmm... not very consistent?)
on | on | 927 +- 108 (4.5% avg/43% sttdev improvements)

- full speed bgbench "-j2 -c4" over 6400 seconds with checkpoint_timeout=30s

flush | sort | tps performance over per second data
off | off | 2012 +- 747
off | on | 2086 +- 708
on | off | 2099 +- 459
on | on | 2114 +- 422 (5% avg/44% stddev improvements)

* CONCLUSION :

For all these HDD tests, when both options are activated the tps performance
is improved, the latency is reduced and the performance is more stable
(smaller standard deviation).

Overall the option effects, not surprisingly, are quite (with exceptions)
orthogonal:
- latency is essentially improved (60 to 95% reduction) by flushing
- throughput is improved (4 to 7% better) thanks to sorting

In detail, some loads may benefit more from only one option activated.
Also on SSD probably both options would have limited benefit.

Usual caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load. The good news is that all these tests tell
the same thing.

* LOOK FOR THOUGHTS

- The bgwriter flushing option seems ineffective, it could be removed
from the patch?

- Move fsync as early as possible, suggested by Andres Freund?

In these tests, when the flush option is activated, the fsync duration
at the end of the checkpoint is small: on more than 5525 checkpoint
fsyncs, 0.5% are above 1 second when flush is on, but the figure raises
to 24% when it is off.... This suggest that doing the fsync as soon as
possible would probably have no significant effect on these tests.

My opinion is that this should be left out for the nonce.

- Take into account tablespaces, as pointed out by Andres Freund?

The issue is that if writes are sorted, they are not be distributed
randomly over tablespaces, inducing lower performance on such systems.

How to do it: while scanning shared_buffers, count dirty buffers for each
tablespace. Then start as many threads as table spaces, each one doing
its own independent throttling for a tablespace? For some obscure reason
there are 2 tablespaces by default (pg_global and pg_default), that would
mean at least 2 threads.

Alternatively, maybe it can be done from one thread, but it would probably
involve some strange hocus-pocus to switch frequently between tablespaces.

--
Fabien.

Attachments:

checkpoint-continuous-flush-2-WIP.patchtext/x-diff; name=checkpoint-continuous-flush-2-WIP.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1da7dfb..2e6bb10 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1818,6 +1818,24 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+     <variablelist>
+      <varlistentry id="guc-bgwriter-flush-to-disk" xreflabel="bgwriter_flush_to_disk">
+       <term><varname>bgwriter_flush_to_disk</varname> (<type>bool</type>)
+       <indexterm>
+        <primary><varname>bgwriter_flush_to_disk</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         When the bgwriter writes data, hint the underlying OS that the data
+         must be sent to disk as soon as possible.  This may help smoothing
+         disk IO writes and avoid a stall when an fsync is issued by a
+         checkpoint, but it may also reduce average performance.
+         This setting may have no effect on some platforms.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-bgwriter-lru-maxpages" xreflabel="bgwriter_lru_maxpages">
        <term><varname>bgwriter_lru_maxpages</varname> (<type>integer</type>)
        <indexterm>
@@ -2495,6 +2513,23 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk IO writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f4083c3..cdbdca9 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,15 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it also has an adverse effect on the average transaction rate.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9431ab5..49ec258 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..d962c3a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -663,7 +663,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress, FileFlushContext * context)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -693,6 +693,14 @@ CheckpointWriteDelay(int flags, double progress)
 
 		CheckArchiveTimeout();
 
+		/* Before sleeping, sync written blocks
+		 */
+		if (checkpoint_flush_to_disk && context->ncalls != 0)
+		{
+			PerformFileFlush(context);
+			ResetFileFlushContext(context);
+		}
+
 		/*
 		 * Report interim activity statistics to the stats collector.
 		 */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cc973b5..b341bf7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,10 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
+bool		bgwriter_flush_to_disk = false;
+int			checkpoint_sort_size = 1024 * 1024;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -396,7 +400,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -409,7 +414,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1018,7 +1024,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1561,6 +1567,53 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Array of buffer ids of all buffers to checkpoint.
+ */
+static int * CheckpointBufferIds = NULL;
+
+/* Compare checkpoint buffers
+ */
+static int bufcmp(const int * pa, const int * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(*pa),
+		*b = GetBufferDescriptor(*pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode, dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* first, compare table space (hmmm...) */
+	if (a->tag.rnode.spcNode < b->tag.rnode.spcNode)
+		return -1;
+	else if (a->tag.rnode.spcNode > b->tag.rnode.spcNode)
+		return 1;
+	/* same table space, compare relation */
+	else if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same file, try block number */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block... */
+		return 1;
+}
+
+static void AllocateCheckpointBufferIds(void)
+{
+	/* safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely.
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1575,10 +1628,17 @@ static void
 BufferSync(int flags)
 {
 	int			buf_id;
-	int			num_to_scan;
 	int			num_to_write;
 	int			num_written;
+	int			i;
 	int			mask = BM_DIRTY;
+	FileFlushContext context;
+
+	ResetFileFlushContext(&context);
+
+	// lazy, to be really called by CheckpointerMain
+	if (CheckpointBufferIds == NULL)
+		AllocateCheckpointBufferIds();
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1622,6 +1682,7 @@ BufferSync(int flags)
 		if ((bufHdr->flags & mask) == mask)
 		{
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write] = buf_id;
 			num_to_write++;
 		}
 
@@ -1633,19 +1694,47 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Sort buffer ids by chunks to help find sequential writes.
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.  Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.
+	 */
+	if (checkpoint_sort_size > 1)
+	{
+		int i;
+
+		// debug...
+		ereport(WARNING,
+				(errcode(ERRCODE_WARNING),
+				 errmsg("Checkpoint: sorting %d buffers (%d chunks of size %d)",
+						num_to_write,
+						(checkpoint_sort_size+num_to_write-1) /
+						  checkpoint_sort_size,
+						checkpoint_sort_size)));
+
+		// hmmm... should it equalize on the number of chunks?
+		for (i = 0; i < num_to_write; i += checkpoint_sort_size)
+			qsort(CheckpointBufferIds + i,
+				  (i + checkpoint_sort_size <= num_to_write ?
+				   checkpoint_sort_size : num_to_write - i),
+				  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+	}
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers again, and write the ones (still) marked with
+	 * BM_CHECKPOINT_NEEDED.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * TODO: do something clever about table spaces...
+	 * scan them in parallel with multiple threads?
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+	for (i = 0; i < num_to_write; i++)
 	{
+		int buf_id = CheckpointBufferIds[i];
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
@@ -1662,38 +1751,31 @@ BufferSync(int flags)
 		 */
 		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk, &context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write, &context);
 			}
 		}
-
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
 	}
 
 	/*
+	 * Loop over all buffers again, and write the ones (still) marked with
+	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
+	 * since we might as well dump soon-to-be-recycled buffers first.
+	 *
+	 * Note that we don't read the buffer alloc count here --- that should be
+	 * left untouched till the next BgBufferSync() call.
+	 */
+	/* OLD CODE REMOVED */
+
+	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
 	 */
@@ -1757,6 +1839,8 @@ BgBufferSync(void)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	FileFlushContext context;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -1935,11 +2019,13 @@ BgBufferSync(void)
 	num_to_scan = bufs_to_lap;
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
+	ResetFileFlushContext(&context);
 
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, bgwriter_flush_to_disk, &context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -1963,6 +2049,9 @@ BgBufferSync(void)
 
 	BgWriterStats.m_buf_written_clean += num_written;
 
+	PerformFileFlush(&context);
+	ResetFileFlushContext(&context);
+
 #ifdef BGW_DEBUG
 	elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
 		 recent_alloc, smoothed_alloc, strategy_delta, bufs_ahead,
@@ -2016,7 +2105,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2057,7 +2147,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2319,9 +2409,13 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2410,7 +2504,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -2830,7 +2926,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -2864,7 +2962,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2916,7 +3014,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..bb28aec 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,95 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else it is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer.
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1482,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3c9f14..c8706ba 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -569,6 +570,8 @@ const char *const config_group_names[] =
 	gettext_noop("Write-Ahead Log / Checkpoints"),
 	/* WAL_ARCHIVING */
 	gettext_noop("Write-Ahead Log / Archiving"),
+	/* BGWRITER */
+	gettext_noop("Background Writer"),
 	/* REPLICATION */
 	gettext_noop("Replication"),
 	/* REPLICATION_SENDING */
@@ -1009,6 +1012,27 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
+		{"bgwriter_flush_to_disk", PGC_SIGHUP, BGWRITER,
+			gettext_noop("Hint that bgwriter's writes are high priority."),
+			NULL
+		},
+		&bgwriter_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -2205,6 +2229,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"checkpoint_sort_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Sort chunks of pages before writing them, ...."),
+		 NULL
+		},
+		&checkpoint_sort_size,
+		1024*1024, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
@@ -9761,6 +9795,22 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" or "
+					"\"bgwriter_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..c483832 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -29,7 +29,10 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct FileFlushContext;
+typedef struct FileFlushContext FileFlushContext;
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..2bf0cf8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,9 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
+extern bool bgwriter_flush_to_disk;
+extern int  checkpoint_sort_size;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..150c283 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..da0e929 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -94,8 +94,11 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
+struct FileFlushContext;
+typedef struct FileFlushContext FileFlushContext;
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +123,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7a58ddb..b69af2d 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -68,6 +68,7 @@ enum config_group
 	WAL_SETTINGS,
 	WAL_CHECKPOINTS,
 	WAL_ARCHIVING,
+	BGWRITER,
 	REPLICATION,
 	REPLICATION_SENDING,
 	REPLICATION_MASTER,

#22

Cédric Villemain

cedric@2ndQuadrant.com

over 10 years ago

In reply to: Fabien COELHO (#21)

Re: checkpointer continuous flushing

Le 07/06/2015 16:53, Fabien COELHO a ï¿½crit :

+ï¿½ ï¿½ /*ï¿½Others:ï¿½sayï¿½thatï¿½dataï¿½shouldï¿½notï¿½beï¿½keptï¿½inï¿½memory...
+ï¿½ ï¿½ ï¿½*ï¿½Thisï¿½isï¿½notï¿½exactlyï¿½whatï¿½weï¿½wantï¿½toï¿½say,ï¿½becauseï¿½weï¿½wantï¿½toï¿½write
+ï¿½ ï¿½ ï¿½*ï¿½theï¿½dataï¿½forï¿½durabilityï¿½butï¿½weï¿½mayï¿½needï¿½itï¿½laterï¿½nevertheless.
+ï¿½ ï¿½ ï¿½*ï¿½Itï¿½seemsï¿½thatï¿½Linuxï¿½wouldï¿½freeï¿½theï¿½memoryï¿½*if*ï¿½theï¿½dataï¿½has
+ï¿½ ï¿½ ï¿½*ï¿½alreadyï¿½beenï¿½writtenï¿½doï¿½disk,ï¿½elseï¿½itï¿½isï¿½ignored.
+ï¿½ ï¿½ ï¿½*ï¿½Forï¿½FreeBSDï¿½thisï¿½mayï¿½haveï¿½theï¿½desiredï¿½effectï¿½ofï¿½movingï¿½the
+ï¿½ ï¿½ ï¿½*ï¿½dataï¿½toï¿½theï¿½ioï¿½layer.
+ï¿½ ï¿½ ï¿½*/
+ï¿½ ï¿½ rcï¿½=ï¿½posix_fadvise(context->fd,ï¿½context->offset,ï¿½context->nbytes,
+ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ï¿½ï¿½POSIX_FADV_DONTNEED);
+

It looks a bit hazardous, do you have a benchmark for freeBSD ?

Sources says:
case POSIX_FADV_DONTNEED:
/*
* Flush any open FS buffers and then remove pages
* from the backing VM object. Using vinvalbuf() here
* is a bit heavy-handed as it flushes all buffers for
* the given vnode, not just the buffers covering the
* requested range.

--
Cï¿½dric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Dï¿½veloppement, Expertise et Formation

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Cédric Villemain (#22)

Re: checkpointer continuous flushing

Hello Cï¿½dric,

It looks a bit hazardous, do you have a benchmark for freeBSD ?

No, I just consulted the FreeBSD man page for posix_fadvise. I someone can
run tests on something which HDDs is not linux, that would be nice.

Sources says:
case POSIX_FADV_DONTNEED:
/*
* Flush any open FS buffers and then remove pages
* from the backing VM object. Using vinvalbuf() here
* is a bit heavy-handed as it flushes all buffers for
* the given vnode, not just the buffers covering the
* requested range.

It is indeed heavy-handed, but that would probably trigger the expected
behavior which is to start writing to disk, so I would expect to see
benefits similar to those of "sync_file_range" on Linux.

Buffer writes from bgwriter & checkpointer are throttled, which reduces
the potential impact of a "heavy-handed" approach in the kernel.

Now if on some platforms the behavior is absurd, obviously it would be
better to turn the feature off on those.

Note that this is already used by pg in "initdb", but the impact would
probably be very small anyway.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#21)

1 attachment(s)

Re: checkpointer continuous flushing

Hello,

Here is version 3, including many performance tests with various settings,
representing about 100 hours of pgbench run. This patch aims at improving
checkpoint I/O behavior so that tps throughput is improved, late
transactions are less frequent, and overall performances are more stable.

* SOLILOQUIZING

- The bgwriter flushing option seems ineffective, it could be removed
from the patch?

I did that.

- Move fsync as early as possible, suggested by Andres Freund?

My opinion is that this should be left out for the nonce.

I did that.

- Take into account tablespaces, as pointed out by Andres Freund?

Alternatively, maybe it can be done from one thread, but it would probably
involve some strange hocus-pocus to switch frequently between tablespaces.

I did the hocus-pocus approach, including a quasi-proof (not sure what is
this mathematical object:-) in comments to show how/why it works.

* PATCH CONTENTS

- as version 1: simplified asynchronous flush based on Andres Freund
patch, with sync_file_range/posix_fadvise used to hint the OS that
the buffer must be sent to disk "now".

- as version 2: checkpoint buffer sorting based on a 2007 patch by
Takahiro Itagaki but with a smaller and static buffer allocated once.
Also, sorting is done by chunks of 131072 pages in the current version,
with a guc to change this value.

- as version 2: sync/advise calls are now merged if possible,
so less calls will be used, especially when buffers are sorted,
but also if there are few files written.

- new: the checkpointer balance its page writes per tablespace.
this is done by choosing to write pages for a tablespace for which
the progress ratio (written/to_write) is beyond the overall progress
ratio for all tablespace, and by doing that in a round robin manner
so that all tablespaces regularly get some attention. No threads.

- new: some more documentation is added.

- removed: "bgwriter_flush_to_write" is removed, as there was no clear
benefit on the (simple) tests. It could be considered for another patch.

- question: I'm not sure I understand the checkpointer memory management.
There is some exception handling in the checkpointer main. I wonder
whether the allocated memory would be lost in such event and should
be reallocated. The patch currently assumes that the memory is kept.

* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
random write activity on one table), checkpoint_completion_target=0.8, with
different settings on a 16GB 8-core host:

. tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
. small: scale=120 shared_buffers=2GB checkpoint_timeout=300s time=4000s
. medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s
. large: scale=1000 shared_buffers=4GB checkpoint_timeout=40min time=7500s

Note: figures noted with a star (*) had various issues during their run, so
pgbench progress figures were more or less incorrect, thus the standard
deviation computation is not to be trusted beyond "pretty bad".

Caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load. The good news is that all these tests tell
the same thing.

- full-speed 1-client

options | tps performance over per second data
flush | sort | tiny | small | medium | large
off | off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6
off | on | 699 +- 223 | 457 +- 315 | 479 +- 319 | 48.4 +- 28.8
on | off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3
on | on | 722 +- 119 | 550 +- 140 | 549 +- 180 | 47.2 +- 16.8

- full speed 4-clients

options | tps performance over per second data
flush | sort | tiny | small | medium
off | off | 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *
off | on | 2086 +- 673 | 819 +- 905 * | 807 +- 1029 *
on | off | 2212 +- 451 | 169 +- 1269 * | 160 +- 502 *
on | on | 2073 +- 437 | 743 +- 413 | 822 +- 467

- 100-tps 1-client max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 6.31 | 29.44 | 30.74
off | on | 6.23 | 8.93 | 7.12
on | off | 0.44 | 7.01 | 8.14
on | on | 0.59 | 0.83 | 1.84

- 200-tps 1-client max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 10.00 | 50.61 | 45.51
off | on | 8.82 | 12.75 | 12.89
on | off | 0.59 | 40.48 | 42.64
on | on | 0.53 | 1.76 | 2.59

- 400-tps 1-client (or 4 for medium) max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 12.0 | 64.28 | 68.6
off | on | 11.3 | 22.05 | 22.6
on | off | 1.1 | 67.93 | 67.9
on | on | 0.6 | 3.24 | 3.1

* CONCLUSION :

For most of these HDD tests, when both options are activated the tps
throughput is improved (+3 to +300%), late transactions are reduced (by
91% to 97%) and overall the performance is more stable (tps standard
deviation is typically halved).

The option effects are somehow orthogonal:

- latency is essentially limited by flushing, although sorting also
contributes.

- throughput is mostly improved thanks to sorting, with some occasional
small positive or negative effect from flushing.

In detail, some loads may benefit more from only one option activated. In
particular, flushing may have a small adverse effect on throughput in some
conditions, although not always. With SSD probably both options would
probably have limited benefit.

--
Fabien.

Attachments:

checkpoint-continuous-flush-3.patchtext/x-diff; name=checkpoint-continuous-flush-3.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1da7dfb..7a3d274 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2474,6 +2474,29 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort-size" xreflabel="checkpoint_sort_size">
+      <term><varname>checkpoint_sort_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort_size</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The number of pages in the chunks sorted together before being written
+        out to disk by a checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity. 
+        This sorting can be skipped for SSD backends as such storages have good
+        random write performance.
+        The default is <literal>131072</>.
+        This feature is turned off by setting the value to <literal>0</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
@@ -2495,6 +2518,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f4083c3..2b6aab7 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,27 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage, 
+   <xref linkend="guc-checkpoint-sort-size">, allows to sort chunks of pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature should then
+   be disabled by setting <varname>checkpoint_sort_size</> to <value>0</>.
+  </para>
+
+  <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9431ab5..49ec258 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..52dd7db 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -663,7 +663,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -698,6 +699,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cc973b5..3ea1028 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,10 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
+/* by default, sort by chunks of 1 GB worth of 8 kB buffers */
+int			checkpoint_sort_size = 128 * 1024;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -396,7 +400,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -409,7 +414,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1018,7 +1024,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1561,6 +1567,75 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Array of buffer ids of all buffers to checkpoint.
+ */
+static int * CheckpointBufferIds = NULL;
+
+/* Compare checkpoint buffers
+ */
+static int bufcmp(const int * pa, const int * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(*pa),
+		*b = GetBufferDescriptor(*pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode (ignore: not really needed),
+	 *   dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* compare relation */
+	if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+static void AllocateCheckpointBufferIds(void)
+{
+	/* Safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely.
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - spcNone: oid of the tablespace
+ * - num_to_write: number of pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ * - done: whether it is done
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+	bool done;
+} TableSpaceCheckpointStatus;
+
+/* entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1575,10 +1650,21 @@ static void
 BufferSync(int flags)
 {
 	int			buf_id;
-	int			num_to_scan;
 	int			num_to_write;
 	int			num_written;
+	int			i;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, active_spaces, space;
+	FileFlushContext * spcContext = NULL;
+
+	/*
+	 * Lazy allocation: this function is called through the checkpointer,
+	 * but also by initdb. Maybe the allocation could be moved to the callers.
+	 */
+	if (CheckpointBufferIds == NULL)
+		AllocateCheckpointBufferIds();
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1695,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1719,185 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write] = buf_id;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status & flush context arrays */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+			spcStatus[index].done = false;
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Sort buffer ids by chunks to help find sequential writes.
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.  Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.
+	 */
+	if (checkpoint_sort_size > 1)
+	{
+		/* debug...
+		ereport(WARNING,
+				(errcode(ERRCODE_WARNING),
+				 errmsg("Checkpoint: sorting %d buffers (%d chunks, size=%d)",
+						num_to_write,
+						(checkpoint_sort_size+num_to_write-1) /
+						  checkpoint_sort_size,
+						checkpoint_sort_size)));
+		*/
+
+		for (i = 0; i < num_to_write; i += checkpoint_sort_size)
+			qsort(CheckpointBufferIds + i,
+				  (i + checkpoint_sort_size <= num_to_write ?
+				   checkpoint_sort_size : num_to_write - i),
+				  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+	}
+
+	/* debug
+	ereport(WARNING,
+			(errcode(ERRCODE_WARNING),
+			 errmsg("Checkpoint: running on %d tablespaces",
+					nb_spaces)));
+	*/
+
+	/*
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of active spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	active_spaces = nb_spaces;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (active_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		int index;
+
+		/*
+		 * Select a tablespace depending on the current overall progress.
+		 *
+		 * The progress ratio of each unfinished tablespace is compared to
+		 * the overall progress ratio to find one with is not in advance
+		 * (i.e. tablespace ratio <= overall ratio).
+		 *
+		 * Existence: it is bound to exist otherwise the overall progress
+		 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+		 * and already written buffers (w1 & w2), we have:
+		 *
+		 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+		 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+		 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+		 *
+		 * The round robin ensures that each space is given some attention
+		 * till it is over the current ratio, before going to the next.
+		 *
+		 * Precision: using int32 computations for comparing fractions
+		 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+		 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+		 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+		 * integers are used in the comparison.
+		 */
+		while (spcStatus[space].done ||
+			   /* compare tablespace vs overall progress ratio:
+				* tablespace written/to_write > overall written/to_write
+				*/
+			   (int64) spcStatus[space].num_written * num_to_write >
+			   (int64) num_written * spcStatus[space].num_to_write)
+			space = (space + 1) % nb_spaces;	/* round robin */
+
+		/*
+		 * Find a valid buffer in the selected tablespace,
+		 * by continuing the tablespace specific buffer scan
+		 * where it was left.
+		 */
+		index = spcStatus[space].index;
+
+		while (index < num_to_write && bufHdr == NULL)
+		{
+			buf_id = CheckpointBufferIds[index];
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/* Skip if in another tablespace or not in checkpoint anymore.
+			 * No lock is acquired, see comments below.
+			 */
+			if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+				! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+			{
+				index ++;
+				buf_id = -1;
+				bufHdr = NULL;
+			}
+		}
+
+		/* Update tablespace writing status, will start over at next index */
+		spcStatus[space].index = index+1;
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1911,49 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			spcStatus[space].done = true;
+			active_spaces--;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
@@ -1939,7 +2200,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2016,7 +2278,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2057,7 +2320,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2319,9 +2582,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2410,7 +2680,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -2830,7 +3102,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -2864,7 +3138,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2916,7 +3190,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..daf03e4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 230c5cc..d71cef7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1009,6 +1010,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -2205,6 +2217,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"checkpoint_sort_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Set the number of disk-page buffers sorted together on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort_size,
+		128*1024, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
@@ -9760,6 +9782,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 06dfc06..630100d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort_size = 131072	# sort checkpoint buffers by chunks; 0 disables
+#checkpoint_flush_to_disk = off	# send checkpoint buffers to disk
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..0534155 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,8 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
+extern int  checkpoint_sort_size;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c740ee7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#25

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#24)

Re: checkpointer continuous flushing

Hi,

On 2015-06-17 08:24:38 +0200, Fabien COELHO wrote:

Here is version 3, including many performance tests with various settings,
representing about 100 hours of pgbench run. This patch aims at improving
checkpoint I/O behavior so that tps throughput is improved, late
transactions are less frequent, and overall performances are more stable.

First off: This is pretty impressive stuff. Being at pgcon, I don't have
time to look into this in detail, but I do plan to comment more
extensively.

- Move fsync as early as possible, suggested by Andres Freund?

My opinion is that this should be left out for the nonce.

"for the nonce" - what does that mean?

I did that.

I'm doubtful that it's a good idea to separate this out, if you did.

- as version 2: checkpoint buffer sorting based on a 2007 patch by
Takahiro Itagaki but with a smaller and static buffer allocated once.
Also, sorting is done by chunks of 131072 pages in the current version,
with a guc to change this value.

I think it's a really bad idea to do this in chunks. That'll mean we'll
frequently uselessly cause repetitive random IO, often interleaved. That
pattern is horrible for SSDs too. We should always try to do this at
once, and only fail back to using less memory if we couldn't allocate
everything.

* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
random write activity on one table), checkpoint_completion_target=0.8, with
different settings on a 16GB 8-core host:

. tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
. small: scale=120 shared_buffers=2GB checkpoint_timeout=300s time=4000s
. medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s
. large: scale=1000 shared_buffers=4GB checkpoint_timeout=40min time=7500s

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

Note: figures noted with a star (*) had various issues during their run, so
pgbench progress figures were more or less incorrect, thus the standard
deviation computation is not to be trusted beyond "pretty bad".

Caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load. The good news is that all these tests tell
the same thing.

- full-speed 1-client

options | tps performance over per second data
flush | sort | tiny | small | medium | large
off | off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6
off | on | 699 +- 223 | 457 +- 315 | 479 +- 319 | 48.4 +- 28.8
on | off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3
on | on | 722 +- 119 | 550 +- 140 | 549 +- 180 | 47.2 +- 16.8

- full speed 4-clients

options | tps performance over per second data
flush | sort | tiny | small | medium
off | off | 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *
off | on | 2086 +- 673 | 819 +- 905 * | 807 +- 1029 *
on | off | 2212 +- 451 | 169 +- 1269 * | 160 +- 502 *
on | on | 2073 +- 437 | 743 +- 413 | 822 +- 467

- 100-tps 1-client max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 6.31 | 29.44 | 30.74
off | on | 6.23 | 8.93 | 7.12
on | off | 0.44 | 7.01 | 8.14
on | on | 0.59 | 0.83 | 1.84

- 200-tps 1-client max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 10.00 | 50.61 | 45.51
off | on | 8.82 | 12.75 | 12.89
on | off | 0.59 | 40.48 | 42.64
on | on | 0.53 | 1.76 | 2.59

- 400-tps 1-client (or 4 for medium) max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 12.0 | 64.28 | 68.6
off | on | 11.3 | 22.05 | 22.6
on | off | 1.1 | 67.93 | 67.9
on | on | 0.6 | 3.24 | 3.1

So you've not run things at more serious concurrency, that'd be
interesting to see.

I'd also like to see concurrent workloads with synchronous_commit=off -
I've seen absolutely horrible latency behaviour for that, and I'm hoping
this will help. It's also a good way to simulate faster hardware than
you have.

It's also curious that sorting is detrimental for full speed 'tiny'.

* CONCLUSION :

For most of these HDD tests, when both options are activated the tps
throughput is improved (+3 to +300%), late transactions are reduced (by 91%
to 97%) and overall the performance is more stable (tps standard deviation
is typically halved).

The option effects are somehow orthogonal:

- latency is essentially limited by flushing, although sorting also
contributes.

- throughput is mostly improved thanks to sorting, with some occasional
small positive or negative effect from flushing.

In detail, some loads may benefit more from only one option activated. In
particular, flushing may have a small adverse effect on throughput in some
conditions, although not always.

With SSD probably both options would probably have limited benefit.

I doubt that. Small random writes have bad consequences for wear
leveling. You might not notice that with a short tests - again, I doubt
it - but it'll definitely become visible over time.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#25)

Re: checkpointer continuous flushing

Hello Andres,

- Move fsync as early as possible, suggested by Andres Freund?

My opinion is that this should be left out for the nonce.

"for the nonce" - what does that mean?

Nonce \Nonce\ (n[o^]ns), n. [For the nonce, OE. for the nones, ...
{for the nonce}, i. e. for the present time.

I'm doubtful that it's a good idea to separate this out, if you did.

Actually I did, because as explained in another mail the fsync time when
the other options are activated as reported in the logs is essentially
null, so it would not bring significant improvements on these runs,
and also the patch changes enough things as it is.

So this is an evidence-based decision.

I also agree that it seems interesting on principle and should be
beneficial in some case, but I would rather keep that on a TODO list
together with trying to do better things in the bgwriter and try to focus
on the current proposal which already changes significantly the
checkpointer throttling logic.

- as version 2: checkpoint buffer sorting based on a 2007 patch by
Takahiro Itagaki but with a smaller and static buffer allocated once.
Also, sorting is done by chunks of 131072 pages in the current version,
with a guc to change this value.

I think it's a really bad idea to do this in chunks.

The small problem I see is that for a very large setting there could be
several seconds or even minutes of sorting, which may or may not be
desirable, so having some control on that seems a good idea.

Another argument is that Tom said he wanted that:-)

In practice the value can be set at a high value so that it is nearly
always sorted in one go. Maybe value "0" could be made special and used to
trigger this behavior systematically, and be the default.

That'll mean we'll frequently uselessly cause repetitive random IO,

This is not an issue if the chunks are large enough, and anyway the guc
allows to change the behavior as desired. As I said, keeping some control
seems a good idea, and the "full sorting" can be made the default
behavior.

often interleaved. That pattern is horrible for SSDs too. We should
always try to do this at once, and only fail back to using less memory
if we couldn't allocate everything.

The memory is needed anyway in order to avoid a double or significantly
more heavy implementation for the throttling loop. It is allocated once on
the first checkpoint. The allocation could be moved to the checkpointer
initialization if this is a concern. The memory needed is one int per
buffer, which is smaller than the 2007 patch.

. tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

The point of tiny was to trigger a lot of checkpoints. The size is pretty
ridiculous anyway, as "tiny" implies. I think I did some tests on other
versions of the patch and longer checkpoint_timeout on pretty small
database that showed smaller benefit from the options, as one would
expect. I'll try to re-run some.

So you've not run things at more serious concurrency, that'd be
interesting to see.

I do not have a box available for "serious concurrency".

I'd also like to see concurrent workloads with synchronous_commit=off -
I've seen absolutely horrible latency behaviour for that, and I'm hoping
this will help. It's also a good way to simulate faster hardware than
you have.

It's also curious that sorting is detrimental for full speed 'tiny'.

Yep.

With SSD probably both options would probably have limited benefit.

I doubt that. Small random writes have bad consequences for wear
leveling. You might not notice that with a short tests - again, I doubt
it - but it'll definitely become visible over time.

Possibly. Testing such effects does not seem easy, though. At least I have
not seen "write stalls" on SSD, which is my primary concern.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#26)

Re: checkpointer continuous flushing

Hi,

On 2015-06-20 08:57:57 +0200, Fabien COELHO wrote:

Actually I did, because as explained in another mail the fsync time when the
other options are activated as reported in the logs is essentially null, so
it would not bring significant improvements on these runs,
and also the patch changes enough things as it is.

So this is an evidence-based decision.

Meh. You're testing on low concurrency.

- as version 2: checkpoint buffer sorting based on a 2007 patch by
Takahiro Itagaki but with a smaller and static buffer allocated once.
Also, sorting is done by chunks of 131072 pages in the current version,
with a guc to change this value.

I think it's a really bad idea to do this in chunks.

The small problem I see is that for a very large setting there could be
several seconds or even minutes of sorting, which may or may not be
desirable, so having some control on that seems a good idea.

If the sorting of the dirty blocks alone takes minutes, it'll never
finish writing that many buffers out. That's a utterly bogus argument.

Another argument is that Tom said he wanted that:-)

I don't think he said that when we discussed this last.

In practice the value can be set at a high value so that it is nearly always
sorted in one go. Maybe value "0" could be made special and used to trigger
this behavior systematically, and be the default.

You're just making things too complicated.

That'll mean we'll frequently uselessly cause repetitive random IO,

This is not an issue if the chunks are large enough, and anyway the guc
allows to change the behavior as desired.

I don't think this is true. If two consecutive blocks are dirty, but you
sync them in two different chunks, you *always* will cause additional
random IO. Either the drive will have to skip the write for that block,
or the os will prefetch the data. More importantly with SSDs it voids
the wear leveling advantages.

often interleaved. That pattern is horrible for SSDs too. We should always
try to do this at once, and only fail back to using less memory if we
couldn't allocate everything.

The memory is needed anyway in order to avoid a double or significantly more
heavy implementation for the throttling loop. It is allocated once on the
first checkpoint. The allocation could be moved to the checkpointer
initialization if this is a concern. The memory needed is one int per
buffer, which is smaller than the 2007 patch.

There's a reason the 2007 patch (and my revision of it last year) did
what it did. You can't just access buffer descriptors without
locking. Besides, causing additional cacheline bouncing during the
sorting process is a bad idea.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Jim Nasby

Jim.Nasby@BlueTreble.com

over 10 years ago

In reply to: Fabien COELHO (#26)

Re: checkpointer continuous flushing

On 6/20/15 2:57 AM, Fabien COELHO wrote:

- as version 2: checkpoint buffer sorting based on a 2007 patch by
Takahiro Itagaki but with a smaller and static buffer allocated once.
Also, sorting is done by chunks of 131072 pages in the current
version,
with a guc to change this value.

I think it's a really bad idea to do this in chunks.

The small problem I see is that for a very large setting there could be
several seconds or even minutes of sorting, which may or may not be
desirable, so having some control on that seems a good idea.

ISTM a more elegant way to handle that would be to start off with a very
small number of buffers and sort larger and larger lists while the OS is
busy writing/syncing.

Another argument is that Tom said he wanted that:-)

Did he elaborate why? I don't see him on this thread (though I don't
have all of it).

In practice the value can be set at a high value so that it is nearly
always sorted in one go. Maybe value "0" could be made special and used
to trigger this behavior systematically, and be the default.

It'd be nice if it was just self-tuning, with no GUC.

It looks like it'd be much better to get this committed without more
than we have now than to do without it though...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#27)

Re: checkpointer continuous flushing

Hello Andres,

So this is an evidence-based decision.

Meh. You're testing on low concurrency.

Well, I'm just testing on the available box.

I do not see the link between high concurrency and whether moving fsync as
early as possible would have a large performance impact. I think it might
be interesting if bgwriter is doing a lot of writes, but I'm not sure
under which configuration & load that would be.

I think it's a really bad idea to do this in chunks.

The small problem I see is that for a very large setting there could be
several seconds or even minutes of sorting, which may or may not be
desirable, so having some control on that seems a good idea.

If the sorting of the dirty blocks alone takes minutes, it'll never
finish writing that many buffers out. That's a utterly bogus argument.

Well, if in the future you have 8 TB of memory (I've seen a 512GB memory
server a few weeks ago), set shared_buffers=2TB, then if I'm not mistaken
in the worst case you may have 256 millions 8k-buffers to checkpoint. Then
it really depends on the I/O backend stuff used by the box, but if you
bought 8 TB of RAM probably you would have a nice I/O stuff attached.

Another argument is that Tom said he wanted that:-)

I don't think he said that when we discussed this last.

That is what I was recalling when I wrote this sentence:

/messages/by-id/6599.1409421040@sss.pgh.pa.us

But it had more to do with memory-allocation management.

In practice the value can be set at a high value so that it is nearly always
sorted in one go. Maybe value "0" could be made special and used to trigger
this behavior systematically, and be the default.

You're just making things too complicated.

ISTM that it is not really complicated, but anyway it is easy to change
the checkpoint_sort stuff to a boolean.

In the reported performance tests, the is usually just one chunk anyway,
sometimes two, so this gives an idea of the overall performance effect.

This is not an issue if the chunks are large enough, and anyway the guc
allows to change the behavior as desired.

I don't think this is true. If two consecutive blocks are dirty, but you
sync them in two different chunks, you *always* will cause additional
random IO.

I think that it could be a small number if the chunks are large, i.e. the
performance benefit of sorting larger and larger chunks is decreasing.

Either the drive will have to skip the write for that block,
or the os will prefetch the data. More importantly with SSDs it voids
the wear leveling advantages.

Possibly. I do not understand wear leveling done by SSD firmware.

often interleaved. That pattern is horrible for SSDs too. We should always
try to do this at once, and only fail back to using less memory if we
couldn't allocate everything.

The memory is needed anyway in order to avoid a double or significantly more
heavy implementation for the throttling loop. It is allocated once on the
first checkpoint. The allocation could be moved to the checkpointer
initialization if this is a concern. The memory needed is one int per
buffer, which is smaller than the 2007 patch.

There's a reason the 2007 patch (and my revision of it last year) did
what it did. You can't just access buffer descriptors without
locking.

I really think that you can because the sorting is really "advisory", i.e.
the checkpointer will work fine if the sorting is wrong or not done at
all, as it is now, when the checkpointer writes buffers. The only
condition is that the buffers must not be moved with their "to write in
this checkpoint" flag, but this is also necessary for the current
checkpointer stuff to work.

Moreover, this trick is alreay pre-existing from the patch I submitted:
some tests are done without locking, but the actual "buffer write" does
the locking and would skip it if the previous test was wrong, as described
in comments in the code.

Besides, causing additional cacheline bouncing during the
sorting process is a bad idea.

Hmmm. The impact would be to multiply the memory required by 3 or 4
(buf_id, relation, forknum, offset), instead of just buf_id, and I
understood that memory was a concern.

Moreover, once the sort process get the lines which contain the sorting
data from the buffer descriptor in its cache, I think that it should be
pretty much okay. Incidentally, they would probably have been brought to
cache by the scan to collect them. Also, I do not think that the sorting
time for 128000 buffers, and possible cache misses, was a big issue, but I
do not have a measure to defend that. I could try to collect some data
about that.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Jim Nasby (#28)

Re: checkpointer continuous flushing

Hello Jim,

The small problem I see is that for a very large setting there could be
several seconds or even minutes of sorting, which may or may not be
desirable, so having some control on that seems a good idea.

ISTM a more elegant way to handle that would be to start off with a very
small number of buffers and sort larger and larger lists while the OS is busy
writing/syncing.

You really have to have done a significant part/most/all of sorting before
starting to write.

Another argument is that Tom said he wanted that:-)

Did he elaborate why? I don't see him on this thread (though I don't have all
of it).

/messages/by-id/6599.1409421040@sss.pgh.pa.us

But it has more to do with memory management.

In practice the value can be set at a high value so that it is nearly
always sorted in one go. Maybe value "0" could be made special and used
to trigger this behavior systematically, and be the default.

It'd be nice if it was just self-tuning, with no GUC.

Hmmm. It can easilly be turned into a boolean, but otherwise I have no
clue about how to decide whether to sort and/or flush.

It looks like it'd be much better to get this committed without more than we
have now than to do without it though...

Yep, I think the figures are definitely encouraging.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Fabien COELHO

fabien.coelho@mines-paristech.fr

over 10 years ago

In reply to: Andres Freund (#25)

Re: checkpointer continuous flushing

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min
max_wal_size=1GB warmup=600 time=4000

flsh | full speed tps | percent of late tx, 4 clients, for tps:
/srt | 1 client | 4 clients | 100 | 200 | 400 | 800 | 1200 | 1600
N/N | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38
N/Y | 924 +- 122 | 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67
Y/N | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 | 0.71 | 0.81 | 1.26
Y/Y | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13

There seems to be a small 1-2% performance benefit with 4 clients, this is
reversed for 1 client, there are significantly and consistently less late
transactions when options are activated, the performance is more stable
(standard deviation reduced by 10-18%).

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times
over in 5 minutes, so the checkpoint basically writes everything over 220
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less
in order in memory, so would be written out in order.

medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
max_wal_size=4GB warmup=1200 time=7500

flsh | full speed tps | percent of late tx, 4 clients
/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,
flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

At 550 tps checkpoints are xlog-triggered and write about 1/3 of the
database, (170000 buffers to write very 220-260 seconds, 4 MB/s).

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#25)

Re: checkpointer continuous flushing

<sorry, resent stalled post, wrong from>

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min
max_wal_size=1GB warmup=600 time=4000

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times
over in 5 minutes, so the checkpoint basically writes everything in 220
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less
in order in memory, so may be written out in order anyway.

medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
max_wal_size=4GB warmup=1200 time=7500

At 550 tps checkpoints are xlog-triggered and write about 1/3 of the
database, (170000 buffers to write very 220-260 seconds, 4 MB/s).

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#32)

Re: checkpointer continuous flushing

On Mon, Jun 22, 2015 at 1:41 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

<sorry, resent stalled post, wrong from>

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min
max_wal_size=1GB warmup=600 time=4000

flsh | full speed tps | percent of late tx, 4 clients, for

tps:

/srt | 1 client | 4 clients | 100 | 200 | 400 | 800 | 1200 |

1600

N/N | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 |

2.38

N/Y | 924 +- 122 | 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 |

1.67

Y/N | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 | 0.71 | 0.81 |

1.26

Y/Y | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 |

1.13

There seems to be a small 1-2% performance benefit with 4 clients, this

is reversed for 1 client, there are significantly and consistently less
late transactions when options are activated, the performance is more stable

(standard deviation reduced by 10-18%).

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times

over in 5 minutes, so the checkpoint basically writes everything in 220
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less
in order in memory, so may be written out in order anyway.

medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
max_wal_size=4GB warmup=1200 time=7500

flsh | full speed tps | percent of late tx, 4 clients
/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,

flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients, can it be
due to reason that we started doing more stuff after holding bufhdr
lock in below code?

BufferSync()
{
..
for (buf_id = 0; buf_id < NBuffers; buf_id++)
{
volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1719,185 @@ BufferSync(int flags)

  if ((bufHdr->flags & mask) == mask)
  {
+ Oid spc;
+ TableSpaceCountEntry * entry;
+ bool found;
+
  bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+ CheckpointBufferIds[num_to_write] = buf_id;
  num_to_write++;
+
+ /* keep track of per tablespace buffers */
+ spc = bufHdr->tag.rnode.spcNode;
+ entry = (TableSpaceCountEntry *)
+ hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+ if (found) entry->count++;
+ else entry->count = 1;
  }
..
}

-
BufferSync()
{
..
- buf_id = StrategySyncStart(NULL, NULL);
- num_to_scan = NBuffers;
+ active_spaces = nb_spaces;
+ space = 0;
  num_written = 0;
- while (num_to_scan-- > 0)
+
+ while (active_spaces != 0)
..
}

The changed code doesn't seems to give any consideration to
clock-sweep point which might not be helpful for cases when checkpoint
could have flushed soon-to-be-recycled buffers. I think flushing the
sorted buffers w.r.t tablespaces is a good idea, but not giving any
preference to clock-sweep point seems to me that we would loose in
some cases by this new change.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#34

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#33)

Re: checkpointer continuous flushing

Hello Amit,

medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
max_wal_size=4GB warmup=1200 time=7500

flsh | full speed tps | percent of late tx, 4 clients
/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,

flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients
compared to 1 client?

can it be due to reason that we started doing more stuff after holding
bufhdr lock in below code?

I think it is very unlikely that the buffer being locked would be
simultaneously requested by one of the 4 clients for an UPDATE, so I do
not think it should have a significant impact.

BufferSync() [...]

BufferSync()
{
..
- buf_id = StrategySyncStart(NULL, NULL);
- num_to_scan = NBuffers;
+ active_spaces = nb_spaces;
+ space = 0;
num_written = 0;
- while (num_to_scan-- > 0)
+
+ while (active_spaces != 0)
..
}

The changed code doesn't seems to give any consideration to
clock-sweep point

Indeed.

which might not be helpful for cases when checkpoint could have flushed
soon-to-be-recycled buffers. I think flushing the sorted buffers w.r.t
tablespaces is a good idea, but not giving any preference to clock-sweep
point seems to me that we would loose in some cases by this new change.

I do not see how to do both, as these two orders seem more or less
unrelated? The traditionnal assumption is that the I/O are very slow and
they are to be optimized first, so going for buffer ordering to be nice to
the disk looks like the priority.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#25)

Re: checkpointer continuous flushing

I'd also like to see concurrent workloads with synchronous_commit=off -
I've seen absolutely horrible latency behaviour for that, and I'm hoping
this will help. It's also a good way to simulate faster hardware than
you have.

It helps. I've done a few runs, where the very-very-bad situation is
improved to... I would say very-bad:

medium3: scale=200 shared_buffers=4GB checkpoint_timeout=15min
max_wal_size=4GB warmup=1200 time=6000 clients=4
synchronous_commit=off

flush sort | tps | percent of seconds offline
off off | 296 | 83% offline
off on | 1496 | 33% offline
off on | 1641 | 59% offline
on on | 1515 | 31% offline

The offline figure is the percentage of seconds in the 6000 seconds run
where 0.0 tps are reported, or where nothing is reported because pgbench
is stuck.

It is somehow better... on an abysmal scale: sorting and flushing reduced
the offline time by a factor of 2.6. Too bad it is so high to begin with.
The tps is improved by a factor of 5 with either options.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Jim Nasby

Jim.Nasby@BlueTreble.com

over 10 years ago

In reply to: Fabien COELHO (#34)

Re: checkpointer continuous flushing

On 6/22/15 11:59 PM, Fabien COELHO wrote:

which might not be helpful for cases when checkpoint could have
flushed soon-to-be-recycled buffers. I think flushing the sorted
buffers w.r.t tablespaces is a good idea, but not giving any
preference to clock-sweep point seems to me that we would loose in
some cases by this new change.

I do not see how to do both, as these two orders seem more or less
unrelated? The traditionnal assumption is that the I/O are very slow
and they are to be optimized first, so going for buffer ordering to be
nice to the disk looks like the priority.

The point is that it's already expensive for backends to advance the
clock; if they then have to wait on IO as well it gets REALLY expensive.
So we want to avoid that.

Other than that though, it is pretty orthogonal, so perhaps another
indication that the clock should be handled separately from both
backends and bgwriter...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#34)

Re: checkpointer continuous flushing

On Tue, Jun 23, 2015 at 10:29 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min

max_wal_size=4GB warmup=1200 time=7500

flsh | full speed tps | percent of late tx, 4 clients
/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,

flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients
compared to 1 client?

What do you mean by "negative (-8 % with 4 clients) on throughput"
in above sentence? I thought by that you mean that there is dip
in TPS with patch as compare to HEAD at 4 clients.

Also I am not completely sure what's +- means in your data above?

can it be due to reason that we started doing more stuff after holding

bufhdr lock in below code?

I think it is very unlikely that the buffer being locked would be
simultaneously requested by one of the 4 clients for an UPDATE, so I do not
think it should have a significant impact.

BufferSync() [...]

BufferSync()
{
..
- buf_id = StrategySyncStart(NULL, NULL);
- num_to_scan = NBuffers;
+ active_spaces = nb_spaces;
+ space = 0;
num_written = 0;
- while (num_to_scan-- > 0)
+
+ while (active_spaces != 0)
..
}
The changed code doesn't seems to give any consideration to
clock-sweep point
Indeed.

which might not be helpful for cases when checkpoint could have flushed

soon-to-be-recycled buffers. I think flushing the sorted buffers w.r.t
tablespaces is a good idea, but not giving any preference to clock-sweep
point seems to me that we would loose in some cases by this new change.

I do not see how to do both, as these two orders seem more or less
unrelated?

I understand your point and I also don't have any specific answer
for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#38

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#37)

Re: checkpointer continuous flushing

flsh | full speed tps | percent of late tx, 4 clients
/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,

flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients
compared to 1 client?

What do you mean by "negative (-8 % with 4 clients) on throughput" in
above sentence? I thought by that you mean that there is dip in TPS with
patch as compare to HEAD at 4 clients.

Ok, I misunderstood your question. I thought you meant a dip between 1
client and 4 clients. I meant that when flush is turned on tps goes down
by 8% (743 to 681 tps) on this particular run. Basically tps improvements
mostly come from "sort", and "flush" has uncertain effects on tps
(throuput), but much more on latency and performance stability (lower late
rate, lower standard deviation).

Note that I'm not comparing to HEAD in the above tests, but with the new
options desactivated, which should be more or less comparable to current
HEAD, i.e. there is no sorting nor flushing done, but this is not strictly
speaking HEAD behavior. Probably I should get some figures with HEAD as
well to check the "more or less" assumption.

Also I am not completely sure what's +- means in your data above?

The first figure before "+-" is the tps, the second after is its standard
deviation computed in per-second traces. Some runs are very bad, with
pgbench stuck at times, and result on stddev larger than the average, they
ere noted with "*".

I understand your point and I also don't have any specific answer
for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.

Hmmm. My point of view is still that the logical priority is to optimize
for disk IO first, then look for compatible RAM optimisations later.

I can run tests with a small shared_buffers, but probably it would just
trigger a lot of checkpoints, or worse rely on the bgwriter to find space,
which would generate random IOs.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Jim Nasby (#36)

Re: checkpointer continuous flushing

I do not see how to do both, as these two orders seem more or less
unrelated? The traditionnal assumption is that the I/O are very slow
and they are to be optimized first, so going for buffer ordering to be
nice to the disk looks like the priority.

The point is that it's already expensive for backends to advance the clock;
if they then have to wait on IO as well it gets REALLY expensive. So we want
to avoid that.

I do not know what this clock stuff does. Note that the checkpoint buffer
scan is done once at the beginning of the checkpoint and its time is
relatively small compared to everything else in the checkpoint.

If this scan is an issue, it can be done in reverse order, or in some
other order, but I think it is better to do it in order for better cache
behavior, although the effect should be marginal.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#29)

1 attachment(s)

Re: checkpointer continuous flushing

Besides, causing additional cacheline bouncing during the
sorting process is a bad idea.

Hmmm. The impact would be to multiply the memory required by 3 or 4 (buf_id,
relation, forknum, offset), instead of just buf_id, and I understood that
memory was a concern.

Moreover, once the sort process get the lines which contain the sorting data
from the buffer descriptor in its cache, I think that it should be pretty
much okay. Incidentally, they would probably have been brought to cache by
the scan to collect them. Also, I do not think that the sorting time for
128000 buffers, and possible cache misses, was a big issue, but I do not have
a measure to defend that. I could try to collect some data about that.

I've collected some data by adding a "sort time" measure, with
checkpoint_sort_size=10000000 so that sorting is in one chunk, and done
some large checkpoints:

LOG: checkpoint complete: wrote 41091 buffers (6.3%);
0 transaction log file(s) added, 0 removed, 0 recycled;
sort=0.024 s, write=0.488 s, sync=8.790 s, total=9.837 s;
sync files=41, longest=8.717 s, average=0.214 s;
distance=404972 kB, estimate=404972 kB

LOG: checkpoint complete: wrote 212124 buffers (32.4%);
0 transaction log file(s) added, 0 removed, 0 recycled;
sort=0.078 s, write=128.885 s, sync=1.269 s, total=131.646 s;
sync files=43, longest=1.155 s, average=0.029 s;
distance=2102950 kB, estimate=2102950 kB

LOG: checkpoint complete: wrote 384427 buffers (36.7%);
0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.120 s, write=83.995 s, sync=13.944 s, total=98.035 s;
sync files=9, longest=13.724 s, average=1.549 s;
distance=3783305 kB, estimate=3783305 kB

LOG: checkpoint complete: wrote 809211 buffers (77.2%);
0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.358 s, write=138.146 s, sync=14.943 s, total=153.124 s;
sync files=13, longest=14.871 s, average=1.149 s;
distance=8075338 kB, estimate=8075338 kB

Summary of these checkpoints:

#buffers size sort
41091 328MB 0.024
212124 1.7GB 0.078
384427 2.9GB 0.120
809211 6.2GB 0.358

Sort times are pretty negligeable compared to the whole checkpoint time,
and under 0.1 s/GB of buffers sorted.

On a 512 GB server with shared_buffers=128GB (25%), this suggest a worst
case checkpoint sorting in a few seconds, and then you have a hundred GB
to write anyway. If we project on next decade 1 TB checkpoint that would
make sorting in under a minute... But then you have 1 TB of data to dump.

As a comparison point, I've done the large checkpoint with the default
sort size of 131072:

LOG: checkpoint complete: wrote 809211 buffers (77.2%);
0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.251 s, write=152.377 s, sync=15.062 s, total=167.453 s;
sync files=13, longest=14.974 s, average=1.158 s;
distance=8075338 kB, estimate=8075338 kB

The 0.251 sort time is to be compared to 0.358. Well, n.log(n) is not too
bad, as expected.

These figures suggest that sorting time and associated cache misses are
not a significant issue and thus are not worth bothering much about, and
also that probably a simple boolean option would be quite acceptable
instead of the chunk approach.

Attached is an updated version of the patch which turns the sort option
into a boolean, and also include the sort time in the checkpoint log.

There is still an open question about whether the sorting buffer
allocation is lost on some signals and should be reallocated in such
event.

--
Fabien.

Attachments:

checkpoint-continuous-flush-4.patchtext/x-diff; name=checkpoint-continuous-flush-4.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1da7dfb..d7c1ff8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2474,6 +2474,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
@@ -2495,6 +2517,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f4083c3..172a779 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,29 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9431ab5..49ec258 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..0ff48b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7956,11 +7956,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -7991,6 +7993,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8009,8 +8015,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8018,6 +8024,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..52dd7db 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -663,7 +663,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -698,6 +699,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cc973b5..2bfb067 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,10 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
+/* by default, sort by chunks of 1 GB worth of 8 kB buffers */
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -396,7 +400,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -409,7 +414,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1018,7 +1024,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1561,6 +1567,75 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Array of buffer ids of all buffers to checkpoint.
+ */
+static int * CheckpointBufferIds = NULL;
+
+/* Compare checkpoint buffers
+ */
+static int bufcmp(const int * pa, const int * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(*pa),
+		*b = GetBufferDescriptor(*pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode (ignore: not really needed),
+	 *   dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* compare relation */
+	if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+static void AllocateCheckpointBufferIds(void)
+{
+	/* Safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely.
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - spcNone: oid of the tablespace
+ * - num_to_write: number of pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ * - done: whether it is done
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+	bool done;
+} TableSpaceCheckpointStatus;
+
+/* entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1575,10 +1650,21 @@ static void
 BufferSync(int flags)
 {
 	int			buf_id;
-	int			num_to_scan;
 	int			num_to_write;
 	int			num_written;
+	int			i;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, active_spaces, space;
+	FileFlushContext * spcContext = NULL;
+
+	/*
+	 * Lazy allocation: this function is called through the checkpointer,
+	 * but also by initdb. Maybe the allocation could be moved to the callers.
+	 */
+	if (CheckpointBufferIds == NULL)
+		AllocateCheckpointBufferIds();
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1695,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1719,169 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write] = buf_id;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status & flush context arrays */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+			spcStatus[index].done = false;
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/*
+	 * Sort buffer ids to help find sequential writes.
+	 *
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.  Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.
+	 */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds + i, num_to_write,  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of active spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	active_spaces = nb_spaces;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (active_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		int index;
+
+		/*
+		 * Select a tablespace depending on the current overall progress.
+		 *
+		 * The progress ratio of each unfinished tablespace is compared to
+		 * the overall progress ratio to find one with is not in advance
+		 * (i.e. tablespace ratio <= overall ratio).
+		 *
+		 * Existence: it is bound to exist otherwise the overall progress
+		 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+		 * and already written buffers (w1 & w2), we have:
+		 *
+		 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+		 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+		 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+		 *
+		 * The round robin ensures that each space is given some attention
+		 * till it is over the current ratio, before going to the next.
+		 *
+		 * Precision: using int32 computations for comparing fractions
+		 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+		 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+		 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+		 * integers are used in the comparison.
+		 */
+		while (spcStatus[space].done ||
+			   /* compare tablespace vs overall progress ratio:
+				* tablespace written/to_write > overall written/to_write
+				*/
+			   (int64) spcStatus[space].num_written * num_to_write >
+			   (int64) num_written * spcStatus[space].num_to_write)
+			space = (space + 1) % nb_spaces;	/* round robin */
+
+		/*
+		 * Find a valid buffer in the selected tablespace,
+		 * by continuing the tablespace specific buffer scan
+		 * where it was left.
+		 */
+		index = spcStatus[space].index;
+
+		while (index < num_to_write && bufHdr == NULL)
+		{
+			buf_id = CheckpointBufferIds[index];
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/* Skip if in another tablespace or not in checkpoint anymore.
+			 * No lock is acquired, see comments below.
+			 */
+			if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+				! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+			{
+				index ++;
+				buf_id = -1;
+				bufHdr = NULL;
+			}
+		}
+
+		/* Update tablespace writing status, will start over at next index */
+		spcStatus[space].index = index+1;
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1895,49 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			spcStatus[space].done = true;
+			active_spaces--;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
@@ -1939,7 +2184,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2016,7 +2262,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2057,7 +2304,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2319,9 +2566,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2410,7 +2664,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -2830,7 +3086,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -2864,7 +3122,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2916,7 +3174,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..daf03e4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 230c5cc..2549873 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1009,6 +1010,27 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -9760,6 +9782,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 06dfc06..ae7f7cb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = off		# send buffers to disk on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..db0e2c3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,8 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c740ee7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#41

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#38)

Re: checkpointer continuous flushing

On Wed, Jun 24, 2015 at 9:50 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

flsh | full speed tps | percent of late tx, 4 clients

/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this
run,

flushing has a positive (+20% with 1 client) or negative (-8 % with 4

clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients
compared to 1 client?

What do you mean by "negative (-8 % with 4 clients) on throughput" in
above sentence? I thought by that you mean that there is dip in TPS with
patch as compare to HEAD at 4 clients.

Ok, I misunderstood your question. I thought you meant a dip between 1
client and 4 clients. I meant that when flush is turned on tps goes down by
8% (743 to 681 tps) on this particular run.

This 8% might matter if the dip is bigger with more clients and
more aggressive workload. Do you know what could lead to this
dip, because if we know what is the reason than it will be more
predictable to know if this is the max dip that could happen or it
could lead to bigger dip in other cases.

Basically tps improvements mostly come from "sort", and "flush" has
uncertain effects on tps (throuput), but much more on latency and
performance stability (lower late rate, lower standard deviation).

I agree that performance stability is important, but not sure if it
is good idea to sacrifice the throuput for it. If sort + flush always
gives better results, then isn't it better to perform these actions
together under one option.

Note that I'm not comparing to HEAD in the above tests, but with the new
options desactivated, which should be more or less comparable to current
HEAD, i.e. there is no sorting nor flushing done, but this is not strictly
speaking HEAD behavior. Probably I should get some figures with HEAD as
well to check the "more or less" assumption.

Also I am not completely sure what's +- means in your data above?

The first figure before "+-" is the tps, the second after is its standard
deviation computed in per-second traces. Some runs are very bad, with
pgbench stuck at times, and result on stddev larger than the average, they
ere noted with "*".

I understand your point and I also don't have any specific answer

for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.

Hmmm. My point of view is still that the logical priority is to optimize
for disk IO first, then look for compatible RAM optimisations later.

It is not only about RAM optimisation which we can do later, but also
about avoiding regression in existing use-cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#42

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#41)

Re: checkpointer continuous flushing

Hello Amit,

[...]
Ok, I misunderstood your question. I thought you meant a dip between 1
client and 4 clients. I meant that when flush is turned on tps goes down by
8% (743 to 681 tps) on this particular run.

This 8% might matter if the dip is bigger with more clients and
more aggressive workload. Do you know what could lead to this
dip, because if we know what is the reason than it will be more
predictable to know if this is the max dip that could happen or it
could lead to bigger dip in other cases.

I do not know the cause of the dip, and whether it would increase with
more clients. I do not have a box for such tests. If someone can provided
the box, I can provide test scripts:-)

The first, although higher, measure is really very unstable, with pg
totaly unresponsive (offline, really) at time.

I think that the flush option may always have a risk of (small)
detrimental effects on tps, because there are two steady states: one with
pg only doing wal-logged transactions with great tps, and one with pg
doing the checkpoint at nought tps. If this is on the same disk, even at
best the combination means that probably each operation will amper the
other one a little bit, so the combined tps performance would/could be
lower than doing one after the other and having pg offline 50% of the
time...

Please also note that this 8% "dip" is on a 681 (with the dip) vs 198 (no
options at all) a X 3.4 improvement compared to pg current behavior.

Basically tps improvements mostly come from "sort", and "flush" has
uncertain effects on tps (throuput), but much more on latency and
performance stability (lower late rate, lower standard deviation).

I agree that performance stability is important, but not sure if it
is good idea to sacrifice the throuput for it.

See discussion above. I think better stability may imply slightly lower
throughput on some load. That is why there are options and DBA to choose
them:-)

If sort + flush always gives better results, then isn't it better to
perform these actions together under one option.

Sure, but that is not currently the case. Also what is done is very
orthogonal, so I would tend to keep these separate. If one is always
beneficial and it is wished that it should be always activated, then the
option could be removed.

Hmmm. My point of view is still that the logical priority is to optimize
for disk IO first, then look for compatible RAM optimisations later.

It is not only about RAM optimisation which we can do later, but also
about avoiding regression in existing use-cases.

Hmmm. Currently I have not seen really significant regressions. I have
seen some less good impact of some options on some loads.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#38)

Re: checkpointer continuous flushing

Note that I'm not comparing to HEAD in the above tests, but with the new
options desactivated, which should be more or less comparable to current
HEAD, i.e. there is no sorting nor flushing done, but this is not strictly
speaking HEAD behavior. Probably I should get some figures with HEAD as well
to check the "more or less" assumption.

Just for answering myself on this point, I tried current HEAD vs patch v4
with sort OFF + flush OFF: the figures are indeed quite comparable (see
below), so although the internal implementation is different, the
performance when both options are off is still a reasonable approximation
of the performance without the patch, as I was expecting. What patch v4
still does with OFF/OFF which is not done by HEAD is balancing writes
among tablespaces, but there is only one disk on these tests so it does
not matter.

tps & stddev full speed:

HEAD OFF/OFF

tiny 1 client 727 +- 227 221 +- 246
small 1 client 158 +- 316 158 +- 325
medium 1 client 148 +- 285 157 +- 326
tiny 4 clients 2088 +- 786 2074 +- 699
small 4 clients 192 +- 648 188 +- 560
medium 4 clients 220 +- 654 220 +- 648

percent of late transactions:

HEAD OFF/OFF

tiny 4 clients 100 tps 6.31 6.67
small 4c 100 tps 35.68 35.23
medium 4c 100 tps 37.38 38.00
tiny 4c 200 tps 9.06 9.10
small 4c 200 tps 51.65 51.16
medium 4c 200 tps 51.35 50.20
tiny 4 clients 400 tps 11.4 10.5
small 4 clients 400 tps 66.4 67.6

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#43)

Re: checkpointer continuous flushing

On 2015-06-26 21:47:30 +0200, Fabien COELHO wrote:

tps & stddev full speed:

HEAD OFF/OFF

tiny 1 client 727 +- 227 221 +- 246

Huh?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#44)

Re: checkpointer continuous flushing

Hello Andres,

HEAD OFF/OFF

tiny 1 client 727 +- 227 221 +- 246

Huh?

Indeed, just to check that someone was reading this magnificent mail:-)

Just a typo because I reformated the figures for simpler comparison. 221
is really 721, quite close to 727.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#40)

3 attachment(s)

Re: checkpointer continuous flushing

Attached is an updated version of the patch which turns the sort option into
a boolean, and also include the sort time in the checkpoint log.

There is still an open question about whether the sorting buffer allocation
is lost on some signals and should be reallocated in such event.

In such case, probably the allocation should be managed from
CheckpointerMain, and the lazy allocation could remain for other callers
(I guess just "initdb").

Attachments:

checkpoint-continuous-flush-5.patchtext/x-diff; name=checkpoint-continuous-flush-5.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bbe1eb0..0257e34 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,6 +2483,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
@@ -2504,6 +2526,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..eea6668 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,29 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9431ab5..49ec258 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1dd31b3..e0bfa66 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7942,11 +7942,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -7977,6 +7979,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -7995,8 +8001,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8004,6 +8010,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..8c7b099 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,10 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
+/* by default, sort by chunks of 1 GB worth of 8 kB buffers */
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -396,7 +400,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -409,7 +414,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1018,7 +1024,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1561,6 +1567,75 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Array of buffer ids of all buffers to checkpoint.
+ */
+static int * CheckpointBufferIds = NULL;
+
+/* Compare checkpoint buffers
+ */
+static int bufcmp(const int * pa, const int * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(*pa),
+		*b = GetBufferDescriptor(*pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode (ignore: not really needed),
+	 *   dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* compare relation */
+	if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+static void AllocateCheckpointBufferIds(void)
+{
+	/* Safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely.
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - spcNone: oid of the tablespace
+ * - num_to_write: number of pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ * - done: whether it is done
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+	bool done;
+} TableSpaceCheckpointStatus;
+
+/* entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1575,10 +1650,20 @@ static void
 BufferSync(int flags)
 {
 	int			buf_id;
-	int			num_to_scan;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, active_spaces, space;
+	FileFlushContext * spcContext = NULL;
+
+	/*
+	 * Lazy allocation: this function is called through the checkpointer,
+	 * but also by initdb. Maybe the allocation could be moved to the callers.
+	 */
+	if (CheckpointBufferIds == NULL)
+		AllocateCheckpointBufferIds();
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1694,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1718,169 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write] = buf_id;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status & flush context arrays */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+			spcStatus[index].done = false;
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/*
+	 * Sort buffer ids to help find sequential writes.
+	 *
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.  Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.
+	 */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of active spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	active_spaces = nb_spaces;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (active_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		int index;
+
+		/*
+		 * Select a tablespace depending on the current overall progress.
+		 *
+		 * The progress ratio of each unfinished tablespace is compared to
+		 * the overall progress ratio to find one with is not in advance
+		 * (i.e. tablespace ratio <= overall ratio).
+		 *
+		 * Existence: it is bound to exist otherwise the overall progress
+		 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+		 * and already written buffers (w1 & w2), we have:
+		 *
+		 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+		 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+		 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+		 *
+		 * The round robin ensures that each space is given some attention
+		 * till it is over the current ratio, before going to the next.
+		 *
+		 * Precision: using int32 computations for comparing fractions
+		 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+		 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+		 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+		 * integers are used in the comparison.
+		 */
+		while (spcStatus[space].done ||
+			   /* compare tablespace vs overall progress ratio:
+				* tablespace written/to_write > overall written/to_write
+				*/
+			   (int64) spcStatus[space].num_written * num_to_write >
+			   (int64) num_written * spcStatus[space].num_to_write)
+			space = (space + 1) % nb_spaces;	/* round robin */
+
+		/*
+		 * Find a valid buffer in the selected tablespace,
+		 * by continuing the tablespace specific buffer scan
+		 * where it was left.
+		 */
+		index = spcStatus[space].index;
+
+		while (index < num_to_write && bufHdr == NULL)
+		{
+			buf_id = CheckpointBufferIds[index];
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/* Skip if in another tablespace or not in checkpoint anymore.
+			 * No lock is acquired, see comments below.
+			 */
+			if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+				! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+			{
+				index ++;
+				buf_id = -1;
+				bufHdr = NULL;
+			}
+		}
+
+		/* Update tablespace writing status, will start over at next index */
+		spcStatus[space].index = index+1;
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1894,49 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			spcStatus[space].done = true;
+			active_spaces--;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
@@ -1939,7 +2183,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2016,7 +2261,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2057,7 +2303,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2319,9 +2565,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2410,7 +2663,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -2830,7 +3085,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -2864,7 +3121,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2916,7 +3173,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..daf03e4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1bed525..80d9a3e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1013,6 +1014,27 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -9806,6 +9828,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 06dfc06..ae7f7cb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = off		# send buffers to disk on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..db0e2c3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,8 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c740ee7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#48

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Fabien COELHO (#47)

1 attachment(s)

Re: checkpointer continuous flushing

On 07/26/2015 06:01 PM, Fabien COELHO wrote:

Attached is very minor v5 update which does a rebase & completes the
cleanup of doing a full sort instead of a chuncked sort.

Some thoughts on this:

* I think we should drop the "flush" part of this for now. It's not as
clearly beneficial as the sorting part, and adds a great deal more code
complexity. And it's orthogonal to the sorting patch, so we can deal
with it separately.

* Is it really necessary to parallelize the I/O among tablespaces? I can
see the point, but I wonder if it makes any difference in practice.

* Is there ever any harm in sorting the buffers? The GUC is useful for
benchmarking, but could we leave it out of the final patch?

* Do we need to worry about exceeding the 1 GB allocation limit in
AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers.
That's a lot, but it's not totally crazy these days that someone might
do that. At the very least, we need to lower the maximum of
shared_buffers so that you can't hit that limit.

I ripped out the "flushing" part, keeping only the sorting. I refactored
the logic in BufferSync() a bit. There's now a separate function,
nextCheckpointBuffer(), that returns the next buffer ID from the sorted
list. The tablespace-parallelization behaviour in encapsulated there,
keeping the code in BufferSync() much simpler. See attached. Needs some
minor cleanup and commenting still before committing, and I haven't done
any testing besides a simple "make check".

- Heikki

Attachments:

checkpoint-sort-heikki-1.patchapplication/x-patch; name=checkpoint-sort-heikki-1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..1cec243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8048,8 +8054,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8057,6 +8063,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..084bbfb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -1562,6 +1563,101 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 }
 
 /*
+ * Array of buffer ids of all buffers to checkpoint.
+ */
+static int *CheckpointBufferIds = NULL;
+
+/* Compare checkpoint buffers
+ */
+static int bufcmp(const int * pa, const int * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(*pa),
+		*b = GetBufferDescriptor(*pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode (ignore: not really needed),
+	 *   dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* compare relation */
+	if (a->tag.rnode.spcNode < b->tag.rnode.spcNode)
+		return -1;
+	else if (a->tag.rnode.spcNode > b->tag.rnode.spcNode)
+		return 1;
+	if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+static void
+AllocateCheckpointBufferIds(void)
+{
+	/* Safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely.
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - spcNone: oid of the tablespace
+ * - num_to_write: number of pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	int index;
+	int index_end;
+} TableSpaceCheckpointStatus;
+
+static int		allocatedSpc = 0;
+static TableSpaceCheckpointStatus *spcStatus = NULL;
+static int		numSpc;
+static int		currSpc;
+
+static int
+nextCheckpointBuffer(void)
+{
+	int			result;
+
+	if (numSpc == 0)
+		return -1;
+
+	currSpc = (currSpc + 1) % numSpc;
+
+	result = CheckpointBufferIds[spcStatus[currSpc].index];
+	spcStatus[currSpc].index++;
+
+	if (spcStatus[currSpc].index == spcStatus[currSpc].index_end)
+	{
+		if (currSpc < numSpc - 1)
+		{
+			TableSpaceCheckpointStatus tmp = spcStatus[currSpc];
+			spcStatus[currSpc] = spcStatus[numSpc - 1];
+			spcStatus[numSpc - 1] = tmp;
+		}
+		numSpc--;
+	}
+
+	return result;
+}
+
+/*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
  * This is called at checkpoint time to write out all dirty shared buffers.
@@ -1575,11 +1671,17 @@ static void
 BufferSync(int flags)
 {
 	int			buf_id;
-	int			num_to_scan;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
 
+	/*
+	 * Lazy allocation: this function is called through the checkpointer,
+	 * but also by initdb. Maybe the allocation could be moved to the callers.
+	 */
+	if (CheckpointBufferIds == NULL)
+		AllocateCheckpointBufferIds();
+
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
@@ -1622,6 +1724,7 @@ BufferSync(int flags)
 		if ((bufHdr->flags & mask) == mask)
 		{
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write] = buf_id;
 			num_to_write++;
 		}
 
@@ -1633,18 +1736,97 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status & flush context arrays */
+
+	/*
+	 * Sort buffer ids to help find sequential writes.
+	 *
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.  Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.
+	 */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort && num_to_write > 1 && false)
+	{
+		Oid			lastspc;
+		Oid			spc;
+		int			i,
+					j;
+		volatile BufferDesc *bufHdr;
+
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+
+		if (allocatedSpc == 0)
+		{
+			allocatedSpc = 5;
+			spcStatus = (TableSpaceCheckpointStatus *)
+				palloc(sizeof(TableSpaceCheckpointStatus) * allocatedSpc);
+		}
+
+		bufHdr = GetBufferDescriptor(CheckpointBufferIds[0]);
+		spcStatus[0].index = 0;
+		lastspc = bufHdr->tag.rnode.spcNode;
+		j = 0;
+		for (i = 1; i < num_to_write; i++)
+		{
+			bufHdr = GetBufferDescriptor(CheckpointBufferIds[i]);
+
+			spc = bufHdr->tag.rnode.spcNode;
+			if (spc != lastspc && (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0)
+			{
+				if (allocatedSpc <= j)
+				{
+					allocatedSpc = j + 5;
+					spcStatus = (TableSpaceCheckpointStatus *)
+						repalloc(spcStatus, sizeof(TableSpaceCheckpointStatus) * allocatedSpc);
+				}
+
+				spcStatus[j].index_end = spcStatus[j + 1].index = i;
+				j++;
+				lastspc = spc;
+			}
+		}
+		spcStatus[j].index_end = num_to_write;
+		j++;
+		numSpc = j;
+		currSpc = 0;
+	}
+	else
+	{
+		if (allocatedSpc == 0)
+		{
+			allocatedSpc = 1;
+			spcStatus = (TableSpaceCheckpointStatus *)
+				palloc(sizeof(TableSpaceCheckpointStatus) * allocatedSpc);
+		}
+		spcStatus[0].index = 0;
+		spcStatus[0].index_end = num_to_write;
+		numSpc = 1;
+		currSpc = 0;
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of active spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while ((buf_id = nextCheckpointBuffer()) != -1)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 
@@ -1669,28 +1851,11 @@ BufferSync(int flags)
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
-
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1b7b914..e07daca 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..e84f380 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

#49

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Heikki Linnakangas (#48)

Re: checkpointer continuous flushing

Hello Heikki,

Thanks for having a look at the patch.

* I think we should drop the "flush" part of this for now. It's not as
clearly beneficial as the sorting part, and adds a great deal more code
complexity. And it's orthogonal to the sorting patch, so we can deal with it
separately.

I agree that it is orthogonal and that the two features could be in
distinct patches. The flush part is the first patch I really submitted
because it has significant effect on latency, and I was told to mix it
with sorting...

The flushing part really helps to keep "write stalls" under control in
many cases, for instance:

- 400-tps 1-client (or 4 for medium) max 100-ms latency

options | percent of late transactions
flush | sort | tiny | small | medium
off | off | 12.0 | 64.28 | 68.6
off | on | 11.3 | 22.05 | 22.6
on | off | 1.1 | 67.93 | 67.9
on | on | 0.6 | 3.24 | 3.1

The "percent of late transactions" is really the fraction of time the
database is unreachable because of write stalls... So sort without flush
is cleary not enough.

Another thing suggested by Andres is to fsync as early as possible, but
this is not a simple patch because is intermix things which are currently
in distinct parts of checkpoint processing, so I already decided that this
would be for another submission.

* Is it really necessary to parallelize the I/O among tablespaces? I can see
the point, but I wonder if it makes any difference in practice.

I think that if someone bothers with tablespace there is no reason to kill
them behind her. Without sorting you may hope that tablespaces will be
touched randomly enough, but once buffers are sorted you can probably find
cases where it would write on one table space and then on the other.

So I think that it really should be kept.

* Is there ever any harm in sorting the buffers? The GUC is useful for
benchmarking, but could we leave it out of the final patch?

I think that the performance show that it is basically always beneficial,
so the guc may be left out. However on SSD it is unclear to me whether it
is just a loss of time or whether it helps, say with wear-leveling. Maybe
best to keep it? Anyway it is definitely needed for testing.

* Do we need to worry about exceeding the 1 GB allocation limit in
AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers. That's a
lot, but it's not totally crazy these days that someone might do that. At the
very least, we need to lower the maximum of shared_buffers so that you can't
hit that limit.

Yep.

I ripped out the "flushing" part, keeping only the sorting. I refactored
the logic in BufferSync() a bit. There's now a separate function,
nextCheckpointBuffer(), that returns the next buffer ID from the sorted
list. The tablespace-parallelization behaviour in encapsulated there,

I do not understand the new tablespace-parallelization logic: there is no
test about the tablespace of the buffer in the selection process... Note
that I did wrote a proof for the one I put, and also did some detailed
testing on the side because I'm always wary of proofs, especially mines:-)

I notice that you assume that table space numbers are always small and
contiguous. Is that a fact? I was feeling more at ease with relying on a
hash table to avoid such an assumption.

keeping the code in BufferSync() much simpler. See attached. Needs some
minor cleanup and commenting still before committing, and I haven't done
any testing besides a simple "make check".

Hmmm..., just another detail, the patch does not sort:

+ if (checkpoint_sort && num_to_write > 1 && false)

I'll resubmit a patch with only the sorting part, and do the kind of
restructuring you suggest which is a good thing.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Heikki Linnakangas (#48)

Re: checkpointer continuous flushing

On 2015-08-08 20:49:03 +0300, Heikki Linnakangas wrote:

* I think we should drop the "flush" part of this for now. It's not as
clearly beneficial as the sorting part, and adds a great deal more code
complexity. And it's orthogonal to the sorting patch, so we can deal with it
separately.

I don't agree. For one I've seen it cause rather big latency
improvements, and we're horrible at that. But more importantly I think
the requirements of the flush logic influences how exactly the sorting
is done. Splitting them will just make it harder to do the flushing in a
not too big patch.

* Is it really necessary to parallelize the I/O among tablespaces? I can see
the point, but I wonder if it makes any difference in practice.

Today it's somewhat common to have databases that are bottlenecked on
write IO and all those writes being done by the checkpointer. If we
suddenly do the writes to individual tablespaces separately and
sequentially we'll be bottlenecked on the peak IO of a single
tablespace.

* Is there ever any harm in sorting the buffers? The GUC is useful for
benchmarking, but could we leave it out of the final patch?

Agreed.

* Do we need to worry about exceeding the 1 GB allocation limit in
AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers. That's
a lot, but it's not totally crazy these days that someone might do that. At
the very least, we need to lower the maximum of shared_buffers so that you
can't hit that limit.

We can just use the _huge variant?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Heikki Linnakangas (#48)

Re: checkpointer continuous flushing

Hi,

On 2015-08-08 20:49:03 +0300, Heikki Linnakangas wrote:

I ripped out the "flushing" part, keeping only the sorting. I refactored the
logic in BufferSync() a bit. There's now a separate function,
nextCheckpointBuffer(), that returns the next buffer ID from the sorted
list. The tablespace-parallelization behaviour in encapsulated there,
keeping the code in BufferSync() much simpler. See attached. Needs some
minor cleanup and commenting still before committing, and I haven't done any
testing besides a simple "make check".

Thought it'd be useful to review the current version as well. Some of
what I'm commenting on you'll probably already have though of under the
label of "minor cleanup".

/*
+ * Array of buffer ids of all buffers to checkpoint.
+ */
+static int *CheckpointBufferIds = NULL;
+
+/* Compare checkpoint buffers
+ */

Should be at the beginning of the file. There's a bunch more cases of that.

+/* Compare checkpoint buffers
+ */
+static int bufcmp(const int * pa, const int * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(*pa),
+		*b = GetBufferDescriptor(*pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode (ignore: not really needed),
+	 *   dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* compare relation */
+	if (a->tag.rnode.spcNode < b->tag.rnode.spcNode)
+		return -1;
+	else if (a->tag.rnode.spcNode > b->tag.rnode.spcNode)
+		return 1;
+	if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}

This definitely needs comments about ignoring the normal buffer header
locking.

Why are we ignoring the database directory? I doubt it'll make a huge
difference, but grouping metadata affecting operations by directory
helps.

+
+static void
+AllocateCheckpointBufferIds(void)
+{
+	/* Safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely.
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
+}

(wrong comment style...)

Heikki, you were concerned about the size of the allocation of this,
right? I don't think it's relevant - we used to allocate an array of
that size for the backend's private buffer pin array until 9.5, so in
theory we should be safe agains that. NBuffers is limited to INT_MAX/2
in guc.ċ, which ought to be sufficient?

+	/*
+	 * Lazy allocation: this function is called through the checkpointer,
+	 * but also by initdb. Maybe the allocation could be moved to the callers.
+	 */
+	if (CheckpointBufferIds == NULL)
+		AllocateCheckpointBufferIds();
+

I don't think it's a good idea to allocate this on every round. That
just means a lot of page table entries have to be built and torn down
regularly. It's not like checkpoints only run for 1% of the time or
such.

FWIW, I still think it's a much better idea to allocate the memory once
in shared buffers. It's not like that makes us need more memory overall,
and it'll be huge page allocations if configured. I also think that
sooner rather than later we're going to need more than one process
flushing buffers, and then it'll need to be moved there.

+	/*
+	 * Sort buffer ids to help find sequential writes.
+	 *
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.
+	 */

That reasoning makes it impossible to move the fsyncing of files into
the loop (whenever we move to a new file). That's not nice. The
formulation with "necessary property" doesn't seem very clear to me?

How about:
/*
* Note: Buffers are not locked in any way during sorting, but that's ok:
* A change in the buffer header is only relevant when it changes the
* buffer's identity. If the identity has changed it'll have been
* written out by BufferAlloc(), so there's no need for checkpointer to
* write it out anymore. The buffer might also get written out by a
* backend or bgwriter, but that's equally harmless.
*/

Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.

Hm. Is that actually the case for our qsort implementation? If the pivot
element changes its identity won't the result be pretty much random?

+
+	if (checkpoint_sort && num_to_write > 1 && false)
+	{

&& false - Huh?

+		qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+

Ick, I'd rather move the typecasts to the comparator.

+		for (i = 1; i < num_to_write; i++)
+		{
+			bufHdr = GetBufferDescriptor(CheckpointBufferIds[i]);
+
+			spc = bufHdr->tag.rnode.spcNode;
+			if (spc != lastspc && (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0)
+			{
+				if (allocatedSpc <= j)
+				{
+					allocatedSpc = j + 5;
+					spcStatus = (TableSpaceCheckpointStatus *)
+						repalloc(spcStatus, sizeof(TableSpaceCheckpointStatus) * allocatedSpc);
+				}
+
+				spcStatus[j].index_end = spcStatus[j + 1].index = i;
+				j++;
+				lastspc = spc;
+			}
+		}
+		spcStatus[j].index_end = num_to_write;

This really deserves some explanation.

Regards,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#51)

2 attachment(s)

Re: checkpointer continuous flushing

Hello Andres,

Thanks for your comments. Some answers and new patches included.

+ /*
+ * Array of buffer ids of all buffers to checkpoint.
+ */
+static int *CheckpointBufferIds = NULL;
Should be at the beginning of the file. There's a bunch more cases of that.

done.

+/* Compare checkpoint buffers
+ */
+static int bufcmp(const int * pa, const int * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(*pa),
+		*b = GetBufferDescriptor(*pb);
This definitely needs comments about ignoring the normal buffer header
locking.

Added.

Why are we ignoring the database directory? I doubt it'll make a huge
difference, but grouping metadata affecting operations by directory
helps.

I wanted to do the minimal comparisons to order buffers per file, so I
skipped everything else. My idea of a checkpoint is a lot of data in a few
files (at least compared to the data...), so I do not think that it is
worth it. I may be proven wrong!

+static void
+AllocateCheckpointBufferIds(void)
+{
+	/* Safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely.
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
+}

(wrong comment style...)

Fixed.

Heikki, you were concerned about the size of the allocation of this,
right? I don't think it's relevant - we used to allocate an array of
that size for the backend's private buffer pin array until 9.5, so in
theory we should be safe agains that. NBuffers is limited to INT_MAX/2
in guc.ċ, which ought to be sufficient?

I think that there is no issue with the current shared_buffers limit. I
could allocate and use 4 GB on my laptop without problem. I added a cast
to ensure that unsigned int are used for the size computation.

+ /* + * Lazy allocation: this function is called through the
checkpointer, + * but also by initdb. Maybe the allocation could be
moved to the callers. + */ + if (CheckpointBufferIds == NULL) +
AllocateCheckpointBufferIds(); +

I don't think it's a good idea to allocate this on every round.
That just means a lot of page table entries have to be built and torn
down regularly. It's not like checkpoints only run for 1% of the time or
such.

Sure. It is not allocated on every round, it is allocated once on the
first checkpoint, the variable tested is static. There is no free. Maybe
the allocation could be moved to the callers, though.

FWIW, I still think it's a much better idea to allocate the memory once
in shared buffers.

Hmmm. The memory does not need to be shared with other processes?

It's not like that makes us need more memory overall, and it'll be huge
page allocations if configured. I also think that sooner rather than
later we're going to need more than one process flushing buffers, and
then it'll need to be moved there.

That is an argument. I think that it could wait for the need to actually
arise.

+	/*
+	 * Sort buffer ids to help find sequential writes.
+	 *
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.
+	 */

That reasoning makes it impossible to move the fsyncing of files into
the loop (whenever we move to a new file). That's not nice.

I do not see why. Moving rsync ahead is definitely an idea that you
already pointed out, I have given it some thoughts, and it would require
a carefull implementation and some restructuring. For instance, you do not
want to issue fsync right after having done writes, you want to wait a
little bit so that the system had time to write the buffers to disk.

The formulation with "necessary property" doesn't seem very clear to me?

Removed.

How about: /* * Note: Buffers are not locked in any way during sorting,
but that's ok: * A change in the buffer header is only relevant when it
changes the * buffer's identity. If the identity has changed it'll have
been * written out by BufferAlloc(), so there's no need for checkpointer
to * write it out anymore. The buffer might also get written out by a *
backend or bgwriter, but that's equally harmless. */

This new version included.

Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.

Hm. Is that actually the case for our qsort implementation?

I think that it is hard to write a qsort which would fail that. That would
mean that it would compare the same items twice, which would be
inefficient.

If the pivot element changes its identity won't the result be pretty
much random?

That would be a very unlikely event, given the short time spent in qsort.
Anyway, this is not a problem, and is the beauty of the "advisory" sort:
if the sort is wrong because of any such rare event, it just mean that the
buffers would not be strictly in file order, which is currently the
case.... Well, too bad, but the correctness of the checkpoint does not
depend on it, that just mean that the checkpointer would come back twice
on one file, no big deal.

+	if (checkpoint_sort && num_to_write > 1 && false)
+	{

&& false - Huh?

Probably Heikki tests.

+		qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+
Ick, I'd rather move the typecasts to the comparator.

Done.

+ for (i = 1; i < num_to_write; i++)
+ { [...]

This really deserves some explanation.

I think that this version does not work. I've reinstated my version and a
lot of comments in the attached patches.

Please find attached two combined patches which provide both features one
after the other.

(a) shared buffer sorting

- I took Heikki hint about restructuring the buffer selection in a
separate function, which makes the code much more readable.

- I also followed Heikki intention (I think) that only active
table spaces are considered in the switching loop.

(b) add asynchronous flushes on top of the previous sort patch

I think that the many performance results I reported show that the
improvements need both features, and one feature without the other is much
less effective at improving responsiveness, which is my primary concern.
The TPS improvements are just a side effect.

I did not remove the gucs: I think it could be kept so that people can
test around with it, and they may be removed in the future? I would be
also fine if they are removed.

There are a lot of comments in some places. I think that they should be
kept because the code is subtle.

--
Fabien.

Attachments:

checkpoint-continuous-flush-6-a.patchtext/x-diff; name=checkpoint-continuous-flush-6-a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..1cec243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8048,8 +8054,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8057,6 +8063,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..c2bba56 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +96,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* Array of buffer ids of all buffers to checkpoint */
+static int * CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1561,6 +1565,146 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Compare checkpoint buffers.
+ * No lock is acquired, see comments below.
+ */
+static int bufcmp(const void * pa, const void * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(* (int *) pa),
+		*b = GetBufferDescriptor(* (int *) pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode (ignore: not really needed),
+	 *   dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* compare relation */
+	if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+static void AllocateCheckpointBufferIds(void)
+{
+	/*
+	 * Safe worst case allocation, all buffers belong to the checkpoint...
+	 * that is pretty unlikely. This allocation should be ok up to 4 GB
+	 * for the current maximum possible NBuffers (8 TB of shared_buffers).
+	 */
+	CheckpointBufferIds = (int *) palloc(sizeof(int) * (size_t) NBuffers);
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/* entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
+/* return the next buffer to write, or NULL if none.
+ * this function balances buffers over tablespaces.
+ */
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. tablespace ratio <= overall ratio).
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while (/* compare tablespace vs overall progress ratio:
+			* tablespace written/to_write > overall written/to_write
+			*/
+		(int64) spcStatus[space].num_written * num_to_write >
+		(int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index];
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/* Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* Update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index+1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1718,20 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
+
+	/*
+	 * Lazy allocation: BufferSync is called through the checkpointer, but
+	 * also by initdb.  Maybe the allocation should be moved to these callers.
+	 */
+	if (CheckpointBufferIds == NULL)
+		AllocateCheckpointBufferIds();
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1762,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1786,107 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write] = buf_id;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/*
+	 * Sort buffer ids to help find sequential writes.
+	 *
+	 * Note: Buffers are not locked in any way during sorting, but that's ok:
+	 * A change in the buffer header is only relevant when it changes the
+	 * buffer's identity. If the identity has changed it'll have been
+	 * written out by BufferAlloc(), so there's no need for checkpointer to
+	 * write it out anymore. The buffer might also get written out by a
+	 * backend or bgwriter, but that's equally harmless.
+	 *
+	 * Marked buffers must not be move during the checkpoint.
+	 * Also, qsort implementation should be resilient to occasional
+	 * contradictions (cmp(a,b) != -cmp(b,a)) because of possible
+	 * concurrent changes.
+	 */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1900,45 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..ff95e61 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -1798,6 +1809,9 @@ static struct config_int ConfigureNamesInt[] =
 	/*
 	 * We sometimes multiply the number of shared buffers by two without
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
+	 * Also, checkpoint uses a malloced int array to store index of shared
+	 * buffers for sorting, which results in a SIZE_MAX / sizeof(int) limit,
+	 * that is UINT_MAX / 4 == INT_MAX / 2 as well on a 32 bits system.
 	 */
 	{
 		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..e84f380 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-6-b.patchtext/x-diff; name=checkpoint-continuous-flush-6-b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1cec243..2551d95 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2497,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..eea6668 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,17 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c2bba56..63bb628 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,8 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
 bool		checkpoint_sort = true;
 
 /*
@@ -400,7 +402,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -413,7 +416,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1022,7 +1026,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1725,6 +1729,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/*
 	 * Lazy allocation: BufferSync is called through the checkpointer, but
@@ -1814,10 +1819,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1834,6 +1841,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1902,7 +1915,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1912,7 +1926,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1928,6 +1943,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1938,6 +1960,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2185,7 +2209,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2262,7 +2287,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2303,7 +2329,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2565,9 +2591,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2656,7 +2689,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3076,7 +3111,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3110,7 +3147,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3162,7 +3199,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..daf03e4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ff95e61..c5c996c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1025,6 +1026,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9809,6 +9820,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e84f380..66010b1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,7 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = off		# send buffers to disk on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c228f39..db0e2c3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c740ee7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#53

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#52)

Re: checkpointer continuous flushing

On 2015-08-10 19:07:12 +0200, Fabien COELHO wrote:

I think that there is no issue with the current shared_buffers limit. I
could allocate and use 4 GB on my laptop without problem. I added a cast to
ensure that unsigned int are used for the size computation.

You can't allocate 4GB with palloc(), it has a builtin limit against
allocating more than 1GB.

+ /* + * Lazy allocation: this function is called through the
checkpointer, + * but also by initdb. Maybe the allocation could be
moved to the callers. + */ + if (CheckpointBufferIds == NULL) +
AllocateCheckpointBufferIds(); +

I don't think it's a good idea to allocate this on every round.
That just means a lot of page table entries have to be built and torn down
regularly. It's not like checkpoints only run for 1% of the time or such.

Sure. It is not allocated on every round, it is allocated once on the first
checkpoint, the variable tested is static. There is no free. Maybe
the allocation could be moved to the callers, though.

Well, then everytime the checkpointer is restarted.

FWIW, I still think it's a much better idea to allocate the memory once
in shared buffers.

Hmmm. The memory does not need to be shared with other processes?

The point is that it's done at postmaster startup, and we're pretty much
guaranteed that the memory will availabl.e.

It's not like that makes us need more memory overall, and it'll be huge
page allocations if configured. I also think that sooner rather than later
we're going to need more than one process flushing buffers, and then it'll
need to be moved there.

That is an argument. I think that it could wait for the need to actually
arise.

Huge pages are used today.

+	/*
+	 * Sort buffer ids to help find sequential writes.
+	 *
+	 * Note: buffers are not locked in anyway, but that does not matter,
+	 * this sorting is really advisory, if some buffer changes status during
+	 * this pass it will be filtered out later.  The only necessary property
+	 * is that marked buffers do not move elsewhere.
+	 */

That reasoning makes it impossible to move the fsyncing of files into the
loop (whenever we move to a new file). That's not nice.

I do not see why.

Because it means that the sorting isn't necessarily correct. I.e. we
can't rely on it to determine whether a file has already been fsynced.

Also, qsort implementation
+	 * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
+	 * because of these possible concurrent changes.
Hm. Is that actually the case for our qsort implementation?
I think that it is hard to write a qsort which would fail that. That would
mean that it would compare the same items twice, which would be inefficient.

What? The same two elements aren't frequently compared pairwise with
each other, but of course an individual element is frequently compared
with other elements. Consider what happens when the chosen pivot element
changes its identity after already dividing half. The two partitions
will not be divided in any meaning full way anymore. I don't see how
this will results in a meaningful sort.

If the pivot element changes its identity won't the result be pretty much
random?

That would be a very unlikely event, given the short time spent in
qsort.

Meh, we don't want to rely on "likeliness" on such things.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#53)

Re: checkpointer continuous flushing

Hello Andres,

You can't allocate 4GB with palloc(), it has a builtin limit against
allocating more than 1GB.

Argh, too bad, I assumed very naively that palloc was malloc in disguise.

[...]

Well, then everytime the checkpointer is restarted.

Hm...

The point is that it's done at postmaster startup, and we're pretty much
guaranteed that the memory will availabl.e.

Ok ok, I stop resisting... I'll have a look.

Would it also fix the 1 GB palloc limit on the same go? I guess so...

That reasoning makes it impossible to move the fsyncing of files into the
loop (whenever we move to a new file). That's not nice.

I do not see why.

Because it means that the sorting isn't necessarily correct. I.e. we
can't rely on it to determine whether a file has already been fsynced.

Ok, I understand your point.

Then the file would be fsynced twice: if the fsync is done properly (data
have already been flushed to disk) then it would not cost much, and doing
it sometimes twice on some file would not be a big issue. The code could
also detect such event and log a warning, which would give a hint about
how often it occurs in practice.

Hm. Is that actually the case for our qsort implementation?

I think that it is hard to write a qsort which would fail that. That would
mean that it would compare the same items twice, which would be inefficient.

What? The same two elements aren't frequently compared pairwise with
each other, but of course an individual element is frequently compared
with other elements.

Sure.

Consider what happens when the chosen pivot element changes its identity
after already dividing half. The two partitions will not be divided in
any meaning full way anymore. I don't see how this will results in a
meaningful sort.

It would be partly meaningful, which is enough for performance, and does
not matter for correctness: currently buffers are not sorted at all and it
works, even if it does not work well.

If the pivot element changes its identity won't the result be pretty much
random?

That would be a very unlikely event, given the short time spent in
qsort.

Meh, we don't want to rely on "likeliness" on such things.

My main argument is that even if it occurs, and the qsort result is partly
wrong, it does not change correctness, it just mean that the actual writes
will be less in order than wished. If it occurs, one pivot separation
would be quite strange, but then others would be right, so the buffers
would be "partly sorted".

Another issue I see is that even if buffers are locked within cmp, the
status may change between two cmp... I do not think that locking all
buffers for sorting them is an option. So on the whole, I think that
locking buffers for sorting is probably not possible with the simple (and
efficient) lightweight approach used in the patch.

The good news, as I argued before, is that the order is only advisory to
help with performance, but the correctness is really that all checkpoint
buffers are written and fsync is called in the end, and does not depend on
the buffer order. That is how it currently works anyway.

If you block on this then I'll put a heavy weight approach, but that would
be a waste of memory in my opinion, hence my argumentation for the
lightweight approach.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#54)

2 attachment(s)

Re: checkpointer continuous flushing

Ok ok, I stop resisting... I'll have a look.

Here is a v7 a&b version which uses shared memory instead of palloc.

--
Fabien.

Attachments:

checkpoint-continuous-flush-7-a.patchtext/x-diff; name=checkpoint-continuous-flush-7-a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..1cec243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8048,8 +8054,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8057,6 +8063,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..ec2436f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (int *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(int), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..ba5298d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +96,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* Array of buffer ids of all buffers to checkpoint */
+int * CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1561,6 +1565,136 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Compare checkpoint buffers.
+ * No lock is acquired, see comments below.
+ */
+static int bufcmp(const void * pa, const void * pb)
+{
+	BufferDesc
+		*a = GetBufferDescriptor(* (int *) pa),
+		*b = GetBufferDescriptor(* (int *) pb);
+
+	/* tag: rnode, forkNum (different files), blockNum
+	 * rnode: { spcNode (ignore: not really needed),
+	 *   dbNode (ignore: this is a directory), relNode }
+	 * spcNode: table space oid, not that there are at least two
+	 * (pg_global and pg_default).
+	 */
+	/* compare relation */
+	if (a->tag.rnode.relNode < b->tag.rnode.relNode)
+		return -1;
+	else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->tag.forkNum < b->tag.forkNum)
+		return -1;
+	else if (a->tag.forkNum > b->tag.forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->tag.blockNum < b->tag.blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/* entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
+/* return the next buffer to write, or NULL if none.
+ * this function balances buffers over tablespaces.
+ */
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. tablespace ratio <= overall ratio).
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while (/* compare tablespace vs overall progress ratio:
+			* tablespace written/to_write > overall written/to_write
+			*/
+		(int64) spcStatus[space].num_written * num_to_write >
+		(int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index];
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/* Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* Update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index+1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1708,13 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1745,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1769,107 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write] = buf_id;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Sort buffer ids to help find sequential writes.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Note: Buffers are not locked in any way during sorting, but that's ok:
+	 * A change in the buffer header is only relevant when it changes the
+	 * buffer's identity. If the identity has changed it'll have been
+	 * written out by BufferAlloc(), so there's no need for checkpointer to
+	 * write it out anymore. The buffer might also get written out by a
+	 * backend or bgwriter, but that's equally harmless.
+	 *
+	 * Marked buffers must not be move during the checkpoint.
+	 * Also, qsort implementation should be resilient to occasional
+	 * contradictions (cmp(a,b) != -cmp(b,a)) because of possible
+	 * concurrent changes.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
+				  (int(*)(const void *, const void *)) bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
+	/*
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
+	 *
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
+	 */
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1883,45 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..ff95e61 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -1798,6 +1809,9 @@ static struct config_int ConfigureNamesInt[] =
 	/*
 	 * We sometimes multiply the number of shared buffers by two without
 	 * checking for overflow, so we mustn't allow more than INT_MAX / 2.
+	 * Also, checkpoint uses a malloced int array to store index of shared
+	 * buffers for sorting, which results in a SIZE_MAX / sizeof(int) limit,
+	 * that is UINT_MAX / 4 == INT_MAX / 2 as well on a 32 bits system.
 	 */
 	{
 		{"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..e84f380 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..4cb3a60 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,8 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+extern int *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-7-b.patchtext/x-diff; name=checkpoint-continuous-flush-7-b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1cec243..2551d95 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2497,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..eea6668 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,17 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ba5298d..aa7694c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,8 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
 bool		checkpoint_sort = true;
 
 /*
@@ -400,7 +402,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -413,7 +416,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1022,7 +1026,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1715,6 +1719,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1797,10 +1802,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1817,6 +1824,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1885,7 +1898,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1895,7 +1909,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1911,6 +1926,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1921,6 +1943,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2168,7 +2192,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2245,7 +2270,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2286,7 +2312,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2548,9 +2574,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2639,7 +2672,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3059,7 +3094,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3093,7 +3130,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3145,7 +3182,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..daf03e4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ff95e61..c5c996c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1025,6 +1026,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9809,6 +9820,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e84f380..66010b1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,7 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = off		# send buffers to disk on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c228f39..db0e2c3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c740ee7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#56

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#54)

Re: checkpointer continuous flushing

On August 10, 2015 8:24:21 PM GMT+02:00, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Andres,

You can't allocate 4GB with palloc(), it has a builtin limit against
allocating more than 1GB.

Argh, too bad, I assumed very naively that palloc was malloc in
disguise.

It is, but there's some layering (memory pools/contexts) on top. You can get huge allocations with polloc_huge.

Then the file would be fsynced twice: if the fsync is done properly
(data
have already been flushed to disk) then it would not cost much, and
doing
it sometimes twice on some file would not be a big issue. The code
could
also detect such event and log a warning, which would give a hint about

how often it occurs in practice.

Right. At the cost of keeping track of all files...

If the pivot element changes its identity won't the result be

pretty much

random?

That would be a very unlikely event, given the short time spent in
qsort.

Meh, we don't want to rely on "likeliness" on such things.

My main argument is that even if it occurs, and the qsort result is
partly
wrong, it does not change correctness, it just mean that the actual
writes
will be less in order than wished. If it occurs, one pivot separation
would be quite strange, but then others would be right, so the buffers
would be "partly sorted".

It doesn't matter for correctness today, correct. But it makes out impossible to rely on or too.

Another issue I see is that even if buffers are locked within cmp, the
status may change between two cmp...

Sure. That's not what in suggesting. Earlier versions of the patch kept an array of buffer headers exactly because of that.

I do not think that locking all

buffers for sorting them is an option. So on the whole, I think that
locking buffers for sorting is probably not possible with the simple
(and
efficient) lightweight approach used in the patch.

Yes, the other version has a higher space overhead. I'm not convinced that's meaningful in comparison to shared buffets in space.
And rather doubtful it a loss performance wise in a loaded server. All the buffer headers are touched on other cores and doing the sort with indirection will greatly increase bus traffic.

The good news, as I argued before, is that the order is only advisory
to
help with performance, but the correctness is really that all
checkpoint
buffers are written and fsync is called in the end, and does not depend
on
the buffer order. That is how it currently works anyway

It's not particularly desirable to have a performance feature that works less well if the server is heavily and concurrently loaded. The likelihood of bogus sort results will increase with the churn rate in shared buffers.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Andres Freund (#56)

Re: checkpointer continuous flushing

On Tue, Aug 11, 2015 at 4:28 AM, Andres Freund wrote:

On August 10, 2015 8:24:21 PM GMT+02:00, Fabien COELHO wrote:

You can't allocate 4GB with palloc(), it has a builtin limit against
allocating more than 1GB.

Argh, too bad, I assumed very naively that palloc was malloc in
disguise.

It is, but there's some layering (memory pools/contexts) on top. You can get huge allocations with polloc_huge.

palloc_huge does not exist yet ;)
There is either repalloc_huge or palloc_extended now, though
implementing one would be trivial.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#56)

2 attachment(s)

Re: checkpointer continuous flushing

Hello Andres,

[...] Right. At the cost of keeping track of all files...

Sure. Pg already tracks all files, and probably some more tracking would
be necessary for an early fsync feature to know what are those already
fsync'ed and what are those not yet fsync'ed.

Yes, the other version has a higher space overhead.

Yep, this is my concern.

I'm not convinced that's meaningful in comparison to shared buffers in
space. And rather doubtful it a loss performance wise in a loaded
server. All the buffer headers are touched on other cores and doing the
sort with indirection will greatly increase bus traffic.

The measures I collected and reported showed that the sorting time is
basically insignificant, so bus traffic induced by sorting does not seem
to be an issue.

[...] It's not particularly desirable to have a performance feature that
works less well if the server is heavily and concurrently loaded. The
likelihood of bogus sort results will increase with the churn rate in
shared buffers.

Hm.

In conclusion I'm not convinced that it is worth the memory, but I'm also
tired of arguing, and hopefully nobody else cares about a few more bytes
per shared_buffers, so why should I care?

Here is a v8, I reduced the memory overhead of the "heavy weight" approach
from 24 to 16 bytes per buffer, so it is medium weight:-). It might be
compacted further down to 12 bytes by combining the 2 bits of forkNum
either with relNode or blockNum, and use a uint64_t comparison field with
all data so that the comparison code would be simpler and faster.
I also fixed the computation of the shmem size which I had not updated
when switching to shmem.

The patches still include the two guc, but it is easy to remove one or the
other. They are useful is someone wants to test. The default is on for
sort, and off for flush. Maybe it should be on for both.

--
Fabien.

Attachments:

checkpoint-continuous-flush-8-a.patchtext/x-diff; name=checkpoint-continuous-flush-8-a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..1cec243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8048,8 +8054,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8057,6 +8063,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CheckpointSortItem), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..c5643ce 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +96,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* array of buffer ids & sort criterion of all buffers to checkpoint */
+CheckpointSortItem *CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1561,6 +1565,129 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Compare checkpoint buffers.
+ */
+static int bufcmp(const void * pa, const void * pb)
+{
+	CheckpointSortItem
+		*a = (CheckpointSortItem *) pa,
+		*b = (CheckpointSortItem *) pb;
+
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/* entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
+/* return the next buffer to write, or -1.
+ * this function balances buffers over tablespaces.
+ */
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. tablespace ratio <= overall ratio).
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while (/* compare tablespace vs overall progress ratio:
+			* tablespace written/to_write > overall written/to_write
+			*/
+		(int64) spcStatus[space].num_written * num_to_write >
+		(int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index].buf_id;
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/* Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* Update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index+1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1701,13 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1738,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1762,111 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write].buf_id = buf_id;
+			CheckpointBufferIds[num_to_write].relNode = bufHdr->tag.rnode.relNode;
+			CheckpointBufferIds[num_to_write].forkNum = bufHdr->tag.forkNum;
+			CheckpointBufferIds[num_to_write].blockNum = bufHdr->tag.blockNum;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Sort buffer ids to help find sequential writes.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Note: Buffers are not locked in any way during sorting, but that's ok:
+	 * A change in the buffer header is only relevant when it changes the
+	 * buffer's identity. If the identity has changed it'll have been
+	 * written out by BufferAlloc(), so there's no need for checkpointer to
+	 * write it out anymore. The buffer might also get written out by a
+	 * backend or bgwriter, but that's equally harmless.
+	 *
+	 * Marked buffers must not be move during the checkpoint.
+	 * Also, qsort implementation should be resilient to occasional
+	 * contradictions (cmp(a,b) != -cmp(b,a)) because of possible
+	 * concurrent changes.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(CheckpointSortItem),
+			  bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
+	/*
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
+	 *
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
+	 */
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		if (buf_id != -1)
+			bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1880,45 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..cf1e505 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..e84f380 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..7fde0dc 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,22 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * Maybe the sort criterion could be compacted to reduce memory requirement
+ * and for faster comparison?
+ */
+typedef struct CheckpointSortItem {
+	int buf_id;
+	Oid relNode;
+	ForkNumber	forkNum; /* only 4 values */
+	BlockNumber blockNum;
+} CheckpointSortItem;
+
+extern CheckpointSortItem *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-8-b.patchtext/x-diff; name=checkpoint-continuous-flush-8-b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1cec243..2551d95 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2497,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..eea6668 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,17 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c5643ce..251bee2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,8 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = false;
 bool		checkpoint_sort = true;
 
 /*
@@ -400,7 +402,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -413,7 +416,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1022,7 +1026,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1708,6 +1712,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1793,10 +1798,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1813,6 +1820,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1882,7 +1895,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1892,7 +1906,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1908,6 +1923,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1918,6 +1940,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2165,7 +2189,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2242,7 +2267,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2283,7 +2309,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2545,9 +2571,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2636,7 +2669,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3056,7 +3091,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3090,7 +3127,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3142,7 +3179,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..daf03e4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cf1e505..617d511 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1025,6 +1026,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		false,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9806,6 +9817,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e84f380..66010b1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,7 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = off		# send buffers to disk on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c228f39..db0e2c3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c740ee7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#59

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#58)

Re: checkpointer continuous flushing

Here is a v8,

I collected a few performance figures with this patch on an old box with 8
cores, 16 GB, RAID 1 HDD, under Ubuntu precise.

postgresql.conf:
shared_buffers = 4GB
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
max_wal_size = 4GB

init> pgbench -i -s 250
warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4

# 400 tps throttled "simple update" test
sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4

sort/flush : percent of skipped/late transactions
on on : 2.7
on off : 16.2
off on : 68.4
off off : 68.7

# 200 tps
sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4

sort/flush : percent of skipped/late transactions
on on : 2.7
on off : 9.5
off on : 47.4
off off : 48.8

The large "percent of skipped/late transactions" is to be understood as
"fraction of time with postgresql offline because of a write stall".

# full speed 1 client
sh> pgbench -M prepared -N -P 1 -T 4000

sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
on on : 631 +- 131 (0.1%)
on off : 564 +- 303 (12.0%)
off on : 167 +- 315 (76.8%) # stuck...
off off : 177 +- 305 (71.2%) # ~ current pg

# full speed 2 threads 4 clients
sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4

sort/flush : tps avg & stddev (percent of time below 10.0 tps)
on on : 1058 +- 455 (0.1%)
on off : 1056 +- 942 (32.8%)
off on : 170 +- 500 (88.3%) # stuck...
off off : 209 +- 506 (82.0%) # ~ current pg

The combined features provide a tps speedup of 3-5 on these runs, and
allow to have some control on write stalls. Flushing is not effective on
unsorted buffers, at least on these example.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#59)

Re: checkpointer continuous flushing

Hi Fabien,

On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:

sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
on on : 631 +- 131 (0.1%)
on off : 564 +- 303 (12.0%)
off on : 167 +- 315 (76.8%) # stuck...
off off : 177 +- 305 (71.2%) # ~ current pg

What exactly do you mean with 'stuck'?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#58)

Re: checkpointer continuous flushing

On 2015-08-11 17:15:22 +0200, Fabien COELHO wrote:

+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}

I'm a bit wary that this might cause significant regressions on
platforms not supporting sync_file_range, but support posix_fadvise()
for workloads that are bigger than shared_buffers. Consider what happens
if the workload does *not* fit into shared_buffers but *does* fit into
the OS's buffer cache. Suddenly reads will go to disk again, no?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#60)

Re: checkpointer continuous flushing

Hello Andres,

On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:

sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
on on : 631 +- 131 (0.1%)
on off : 564 +- 303 (12.0%)
off on : 167 +- 315 (76.8%) # stuck...
off off : 177 +- 305 (71.2%) # ~ current pg

What exactly do you mean with 'stuck'?

I mean that the during the I/O storms induced by the checkpoint pgbench
sometimes get stuck, i.e. does not report its progression every second (I
run with "-P 1"). This occurs when sort is off, either with or without
flush, for instance an extract from the off/off medium run:

progress: 573.0 s, 5.0 tps, lat 933.022 ms stddev 83.977
progress: 574.0 s, 777.1 tps, lat 7.161 ms stddev 37.059
progress: 575.0 s, 148.9 tps, lat 4.597 ms stddev 10.708
progress: 814.4 s, 0.0 tps, lat -nan ms stddev -nan
progress: 815.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 816.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 817.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 818.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 819.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 820.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 821.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 822.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 823.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 824.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 825.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 826.0 s, 0.0 tps, lat -nan ms stddev -nan

There is a 239.4 seconds gap in pgbench output. This occurs from time to
time and may represent a significant part of the run, and I count these
"stuck" times as 0 tps. Sometimes pgbench is stuck performance wise but
manages nevetheless to report a "0.0 tps" every second, as above after it
unstuck.

The actual origin of the issue with a stuck client (pgbench, libpq, OS,
postgres...) is unclear to me, but the whole system does not behave well
under an I/O storm anyway, and I have not succeeded in understanding where
pgbench is stuck when it does not report its progress. I tried some runs
with gdb but it did not get stuck and reported a lot of "0.0 tps" during
the storms.

Here are a few more figures with the v8 version of the patch, on a host
with 8 cores, 16 GB, RAID 1 HDD, under Ubuntu precise. I already reported
the medium case, and the small case turned afterwards.

small postgresql.conf:
shared_buffers = 2GB
checkpoint_timeout = 300s # this is the default
checkpoint_completion_target = 0.8
# initialization: pgbench -i -s 120

medium postgresql.conf: ## ALREADY REPORTED
shared_buffers = 4GB
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
max_wal_size = 4GB
# initialization: pgbench -i -s 250

warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4

# 400 tps throttled test
sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4

options / percent of skipped/late transactions
sort/flush / small medium
on on : 3.5 2.7
on off : 24.6 16.2
off on : 66.1 68.4
off off : 63.2 68.7

# 200 tps throttled test
sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4

options / percent of skipped/late transactions
sort/flush / small medium
on on : 1.9 2.7
on off : 14.3 9.5
off on : 45.6 47.4
off off : 47.4 48.8

# 100 tps throttled test
sh> pgbench -M prepared -N -P 1 -T 4000 -R 100 -L 100 -j 2 -c 4

options / percent of skipped/late transactions
sort/flush / small medium
on on : 0.9 1.8
on off : 9.3 7.9
off on : 5.0 13.0
off off : 31.2 31.9

# full speed 1 client
sh> pgbench -M prepared -N -P 1 -T 4000

options / tps avg & stddev (percent of time below 10.0 tps)
sort/flush / small medium
on on : 564 +- 148 ( 0.1%) 631 +- 131 ( 0.1%)
on off : 470 +- 340 (21.7%) 564 +- 303 (12.0%)
off on : 157 +- 296 (66.2%) 167 +- 315 (76.8%)
off off : 154 +- 251 (61.5%) 177 +- 305 (71.2%)

# full speed 2 threads 4 clients
sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4

options / tps avg & stddev (percent of time below 10.0 tps)
sort/flush / small medium
on on : 757 +- 417 ( 0.1%) 1058 +- 455 ( 0.1%)
on off : 752 +- 893 (48.4%) 1056 +- 942 (32.8%)
off on : 173 +- 521 (83.0%) 170 +- 500 (88.3%)
off off : 199 +- 512 (82.5%) 209 +- 506 (82.0%)

In all cases, the "sort on & flush on" provides the best results, with tps
speedup from 3-5, and overall high responsiveness (& lower latency).

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#40)

Re: checkpointer continuous flushing

<Oops, stalled post, sorry wrong "From", resent..>

Hello Andres,

+ rc = posix_fadvise(context->fd, context->offset, [...]

I'm a bit wary that this might cause significant regressions on
platforms not supporting sync_file_range, but support posix_fadvise()
for workloads that are bigger than shared_buffers. Consider what happens
if the workload does *not* fit into shared_buffers but *does* fit into
the OS's buffer cache. Suddenly reads will go to disk again, no?

That is an interesting question!

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
implementation of posix_fadvise, so it may differ between OS.

This is a reason why I think that flushing should be kept a guc, even if the
sort guc is removed and always on. The sync_file_range implementation is
clearly always very beneficial for Linux, and the posix_fadvise may or may
not induce a good behavior depending on the underlying system.

This is also a reason why the default value for the flush guc is currently
set to false in the patch. The documentation should advise to turn it on for
Linux and to test otherwise. Or if Linux is assumed to be often a host, then
maybe to set the default to on and to suggest that on some systems it may be
better to have it off. (Another reason to keep it "off" is that I'm not sure
about what happens with such HD flushing features on virtual servers).

Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
and it was as bad as Linux (namely the database and even the box was offline
for long minutes...), and if you can avoid that having to read back some data
may be not that bad a down payment.

The issue is largely mitigated if the data is not removed from
shared_buffers, because the OS buffer is just a copy of already hold data.
What I would do on such systems is to increase shared_buffers and keep
flushing on, that is to count less on the system cache and more on postgres
own cache.

Overall, I'm not convince that the practice of relying on the OS cache is a
good one, given what it does with it, at least on Linux.

Now, if someone could provide a dedicated box with posix_fadvise (say
FreeBSD, maybe others...) for testing that would allow to provide data
instead of speculating... and then maybe to decide to change its default
value.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: alpine.DEB.2.10.1508171456480.28260@sto

#64

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#47)

Re: checkpointer continuous flushing

On 2015-08-17 15:21:22 +0200, Fabien COELHO wrote:

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
implementation of posix_fadvise, so it may differ between OS.

As long as fadvise has no 'undirty' option, I don't see how that
problem goes away. You're telling the OS to throw the buffer away, so
unless it ignores it that'll have consequences when you read the page
back in.

This is a reason why I think that flushing should be kept a guc, even if the
sort guc is removed and always on. The sync_file_range implementation is
clearly always very beneficial for Linux, and the posix_fadvise may or may
not induce a good behavior depending on the underlying system.

That's certainly an argument.

This is also a reason why the default value for the flush guc is currently
set to false in the patch. The documentation should advise to turn it on for
Linux and to test otherwise. Or if Linux is assumed to be often a host, then
maybe to set the default to on and to suggest that on some systems it may be
better to have it off.

I'd say it should then be an os-specific default. No point in making
people work for it needlessly on linux and/or elsewhere.

(Another reason to keep it "off" is that I'm not sure about what
happens with such HD flushing features on virtual servers).

I don't see how that matters? Either the host will entirely ignore
flushing, and thus the sync_file_range and the fsync won't cost much, or
fsync will be honored, in which case the pre-flushing is helpful.

Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
and it was as bad as Linux (namely the database and even the box was offline
for long minutes...), and if you can avoid that having to read back some
data may be not that bad a down payment.

I don't see how that'd alleviate my fear. Sure, the latency for many
workloads will be better, but I don't how that argument says anything
about the reads? And we'll not just use this in cases it'd be
beneficial...

The issue is largely mitigated if the data is not removed from
shared_buffers, because the OS buffer is just a copy of already hold data.
What I would do on such systems is to increase shared_buffers and keep
flushing on, that is to count less on the system cache and more on postgres
own cache.

That doesn't work that well for a bunch of reasons. For one it's
completely non-adaptive. With the OS's page cache you can rely on free
memory being used for caching *and* it be available should a query or
another program need lots of memory.

Overall, I'm not convince that the practice of relying on the OS cache is a
good one, given what it does with it, at least on Linux.

The alternatives aren't super realistic near-term though. Using direct
IO efficiently on the set of operating systems we support is
*hard*. It's more or less trivial to hack pg up to use direct IO for
relations/shared_buffers, but it'll perform utterly horribly in many
many cases.

To pick one thing out: Without the OS buffering writes any write will
have to wait for the disks, instead being asynchronous. That'll make
writes performed by backends a massive bottleneck.

Now, if someone could provide a dedicated box with posix_fadvise (say
FreeBSD, maybe others...) for testing that would allow to provide data
instead of speculating... and then maybe to decide to change its default
value.

Testing, as an approximation, how it turns out to work on linux would be
a good step.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: alpine.DEB.2.10.1508171456480.28260@sto

#65

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#64)

2 attachment(s)

Re: checkpointer continuous flushing

Hello Andres,

[...] posix_fadvise().

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
implementation of posix_fadvise, so it may differ between OS.

As long as fadvise has no 'undirty' option, I don't see how that
problem goes away. You're telling the OS to throw the buffer away, so
unless it ignores it that'll have consequences when you read the page
back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out
*but* keep them nevertheless. As the buffer is kept, the OS page is a
duplicate, and freeing it should not harm, at least immediatly.

The situation is different if the memory is reused in between, which is
the work of the bgwriter I think, based on LRU/LFU heuristics, but such
writes are not flushed by the current patch.

Now, if a buffer was recently updated it should not be selected by the
bgwriter, if the LRU/LFU heuristics works as expected, which mitigate the
issue somehow...

To sum up, I agree that it is indeed possible that flushing with
posix_fadvise could reduce read OS-memory hits on some systems for some
workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm
fine with that.

[...] I'd say it should then be an os-specific default. No point in
making people work for it needlessly on linux and/or elsewhere.

Ok. Version 9 attached does that, "on" for Linux, "off" for others because
of the potential issues you mentioned.

(Another reason to keep it "off" is that I'm not sure about what
happens with such HD flushing features on virtual servers).

I don't see how that matters? Either the host will entirely ignore
flushing, and thus the sync_file_range and the fsync won't cost much, or
fsync will be honored, in which case the pre-flushing is helpful.

Possibly. I know that I do not know:-) The distance between the database
and real hardware is so great in VM, that I think that it may have any
effect, including good, bad or none:-)

Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
and it was as bad as Linux (namely the database and even the box was offline
for long minutes...), and if you can avoid that having to read back some
data may be not that bad a down payment.

I don't see how that'd alleviate my fear.

I'm trying to mitigate your fears, not to alleviate them:-)

Sure, the latency for many workloads will be better, but I don't how
that argument says anything about the reads?

It just says that there may be a compromise, better in some case, possibly
not so in others, because posix_fadvise does not really say what the
database would like to say to the OS, this is why I wrote such a large
comment about it in the source file in the first place.

And we'll not just use this in cases it'd be beneficial...

I'm fine if it is off by default for some systems. If people want to avoid
write stalls they can use the option, but it may have adverse effect on
the tps in some cases, that's life? Not using the option also has adverse
effects in some cases, because you have write stalls... and currently you
do not have the choice, so it would be a progress.

The issue is largely mitigated if the data is not removed from
shared_buffers, because the OS buffer is just a copy of already hold data.
What I would do on such systems is to increase shared_buffers and keep
flushing on, that is to count less on the system cache and more on postgres
own cache.

That doesn't work that well for a bunch of reasons. For one it's
completely non-adaptive. With the OS's page cache you can rely on free
memory being used for caching *and* it be available should a query or
another program need lots of memory.

Yep. I was thinking about a dedicated database server, not a shared one.

Overall, I'm not convince that the practice of relying on the OS cache is a
good one, given what it does with it, at least on Linux.

The alternatives aren't super realistic near-term though. Using direct
IO efficiently on the set of operating systems we support is
*hard*. [...]

Sure. This is not necessarily what I had in mind.

Currently pg "write"s stuff to the OS, and then suddenly calls "fsync" out
of the blue, hoping that in between the OS will actually have done a good
job with the underlying hardware. This is pretty naive, the fsync
generates write storms, and the database is offline: trying to improve
these things is the motivation for this patch.

Now if you think of the bgwriter, it does pretty much the same, and
probably may generate plenty of random I/Os, because the underlying
LRU/LFU heuristics used to select buffers does not care about the file
structures.

So I think that to get good performance the database must take some
control over the OS. That does not mean that direct I/O needs to be
involved, although maybe it could, but this patch shows that it is not
needed to improve things.

Now, if someone could provide a dedicated box with posix_fadvise (say
FreeBSD, maybe others...) for testing that would allow to provide data
instead of speculating... and then maybe to decide to change its default
value.

Testing, as an approximation, how it turns out to work on linux would be
a good step.

Do you mean testing with posix_fadvise on Linux?

I did think about it, but the documented behavior of this call on Linux is
disappointing: if the buffer has been written to disk, it is freed by the
OS. If not, nothing is done. Given that the flush is called pretty close
after writes, mostly the buffer will not have been written to disk yet,
and the call would just be a no-op... So I concluded that there is no
point in trying that on Linux because it will have no effect other than
loosing some time, IMO.

Really, a useful test would be FreeBSD, when posix_fadvise does move
things to disk, although the actual offsets & length are ignored, but I do
not think that it would be a problem. I do not know about other systems
and what they do with posix_fadvise.

--
Fabien.

Attachments:

checkpoint-continuous-flush-9-a.patchtext/x-diff; name=checkpoint-continuous-flush-9-a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..1cec243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8048,8 +8054,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8057,6 +8063,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CheckpointSortItem), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cd3aaad..ca295f1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +96,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* array of buffer ids & sort criterion of all buffers to checkpoint */
+CheckpointSortItem *CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1561,6 +1565,129 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* Compare checkpoint buffers.
+ */
+static int bufcmp(const void * pa, const void * pb)
+{
+	CheckpointSortItem
+		*a = (CheckpointSortItem *) pa,
+		*b = (CheckpointSortItem *) pb;
+
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/* Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/* entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
+/* return the next buffer to write, or -1.
+ * this function balances buffers over tablespaces.
+ */
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. tablespace ratio <= overall ratio).
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while (/* compare tablespace vs overall progress ratio:
+			* tablespace written/to_write > overall written/to_write
+			*/
+		(int64) spcStatus[space].num_written * num_to_write >
+		(int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index].buf_id;
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/* Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* Update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index+1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1701,13 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1738,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1762,111 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write].buf_id = buf_id;
+			CheckpointBufferIds[num_to_write].relNode = bufHdr->tag.rnode.relNode;
+			CheckpointBufferIds[num_to_write].forkNum = bufHdr->tag.forkNum;
+			CheckpointBufferIds[num_to_write].blockNum = bufHdr->tag.blockNum;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Sort buffer ids to help find sequential writes.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Note: Buffers are not locked in any way during sorting, but that's ok:
+	 * A change in the buffer header is only relevant when it changes the
+	 * buffer's identity. If the identity has changed it'll have been
+	 * written out by BufferAlloc(), so there's no need for checkpointer to
+	 * write it out anymore. The buffer might also get written out by a
+	 * backend or bgwriter, but that's equally harmless.
+	 *
+	 * Marked buffers must not be move during the checkpoint.
+	 * Also, qsort implementation should be resilient to occasional
+	 * contradictions (cmp(a,b) != -cmp(b,a)) because of possible
+	 * concurrent changes.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(CheckpointSortItem),
+			  bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
+	/*
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
+	 *
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
+	 */
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		if (buf_id != -1)
+			bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1880,45 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..cf1e505 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..e84f380 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..7fde0dc 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,22 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * Maybe the sort criterion could be compacted to reduce memory requirement
+ * and for faster comparison?
+ */
+typedef struct CheckpointSortItem {
+	int buf_id;
+	Oid relNode;
+	ForkNumber	forkNum; /* only 4 values */
+	BlockNumber blockNum;
+} CheckpointSortItem;
+
+extern CheckpointSortItem *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-9-b.patchtext/x-diff; name=checkpoint-continuous-flush-9-b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1cec243..917b2fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2497,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>on</> on Linux, <literal>off</> otherwise.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..eea6668 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,17 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ca295f1..3bd2043 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,8 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = DEFAULT_CHECKPOINT_FLUSH_TO_DISK;
 bool		checkpoint_sort = true;
 
 /*
@@ -400,7 +402,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -413,7 +416,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1022,7 +1026,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1708,6 +1712,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1793,10 +1798,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1813,6 +1820,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1882,7 +1895,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1892,7 +1906,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1908,6 +1923,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1918,6 +1940,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2165,7 +2189,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2242,7 +2267,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2283,7 +2309,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2545,9 +2571,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2636,7 +2669,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3058,7 +3093,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3092,7 +3129,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3144,7 +3181,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..daf03e4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/* Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/* Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* Same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* file has changed; actually flush previous file before restarting
+		 * to accumulate flushes
+		 */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cf1e505..94b0d5b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1025,6 +1026,17 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		/* see bufmgr.h: true on Linux, false otherwise */
+		DEFAULT_CHECKPOINT_FLUSH_TO_DISK,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9806,6 +9818,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e84f380..a5495da 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = ?		# send buffers to disk on checkpoint
+					# default is on if Linux, off otherwise
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c228f39..4fd3ff5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,14 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK true
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK false
+#endif /* HAVE_SYNC_FILE_RANGE */
+
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c740ee7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,22 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/* FileFlushContext:
+ * This structure is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offset)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext{
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +86,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#66

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#65)

Re: checkpointer continuous flushing

On Tue, Aug 18, 2015 at 1:02 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Andres,

[...] posix_fadvise().

My current thinking is "maybe yes, maybe no":-), as it may depend on the
OS
implementation of posix_fadvise, so it may differ between OS.

As long as fadvise has no 'undirty' option, I don't see how that
problem goes away. You're telling the OS to throw the buffer away, so
unless it ignores it that'll have consequences when you read the page
back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out
*but* keep them nevertheless. As the buffer is kept, the OS page is a
duplicate, and freeing it should not harm, at least immediatly.

This theory could makes sense if we can predict in some way that
the data we are flushing out of OS cache won't be needed soon.
After flush, we can only rely to an extent that data could be found in
shared_buffers if the usage_count is high, other wise it could be
replaced any moment by backend needing the buffer and there is no
free buffer. Now here one way to think is that if the usage_count is
low, then anyway it's okay to assume that this won't be needed in near
future, however I don't think relying only on usage_count for such a thing
is good idea.

To sum up, I agree that it is indeed possible that flushing with

posix_fadvise could reduce read OS-memory hits on some systems for some
workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm
fine with that.

One point to think here is on what basis user can decide make
this option on, is it predictable in any way?
I think one case could be when the data set fits in shared_buffers.

In general, providing an option is a good idea if user can decide with
ease when to use that option or we can give some clear recommendation
for the same otherwise one has to recommend that test your workload
with this option and if it works then great else don't use it which might
also
be okay in some cases, but it is better to be clear.

One minor point, while glancing through the patch, I noticed that couple
of multiline comments are not written in the way which is usually used
in code (Keep the first line as empty).

+/* Status of buffers to checkpoint for a particular tablespace,

+ * used internally in BufferSync.

+ * - space: oid of the tablespace

+ * - num_to_write: number of checkpoint pages counted for this tablespace

+ * - num_written: number of pages actually written out

+/* entry structure for table space to count hashtable,

+ * used internally in BufferSync.

+ */

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#67

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#66)

2 attachment(s)

Re: checkpointer continuous flushing

Hello Amit,

So the option is best kept as "off" for now, without further data, I'm
fine with that.

One point to think here is on what basis user can decide make
this option on, is it predictable in any way?
I think one case could be when the data set fits in shared_buffers.

Yep.

In general, providing an option is a good idea if user can decide with
ease when to use that option or we can give some clear recommendation
for the same otherwise one has to recommend that test your workload with
this option and if it works then great else don't use it which might
also be okay in some cases, but it is better to be clear.

My opinion, which is not backed by any data (anyone can feel free to
provide a FreeBSD box for testing...) is that it would mostly be an
improvement if you have a significant write load to have the flush option
on when running on non-Linux systems which provide posix_fadvise.

If you have a lot of reads and few writes, then postgresql currently works
reasonably enough, which is why people do not complain too much about
write stalls, and I expect that the situation would not be significantly
degraded.

Now there are competing positive and negative effects induced by using
posix_fadvise, and moreover its implementation varries from OS to OS, so
without running some experiments it is hard to be definite.

One minor point, while glancing through the patch, I noticed that couple
of multiline comments are not written in the way which is usually used
in code (Keep the first line as empty).

Indeed.

Please find attached a v10, where I have reviewed comments for style &
contents, and also slightly extended the documentation about the flush
option to hint that it is essentially useful for high write loads. Without
further data, I think it is not obvious to give more definite advices.

--
Fabien.

Attachments:

checkpoint-continuous-flush-10-a.patchtext/x-diff; name=checkpoint-continuous-flush-10-a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..1cec243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8048,8 +8054,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8057,6 +8063,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CheckpointSortItem), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cd3aaad..8caf774 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +96,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* array of buffer ids & sort criterion of all buffers to checkpoint */
+CheckpointSortItem *CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1561,6 +1565,130 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* checkpoint buffers comparison */
+static int bufcmp(const void * pa, const void * pb)
+{
+	CheckpointSortItem
+		*a = (CheckpointSortItem *) pa,
+		*b = (CheckpointSortItem *) pb;
+
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/*
+ * Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/*
+ * Entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
+/*
+ * Return the next buffer to write, or -1.
+ * this function balances buffers over tablespaces, see comment inside.
+ */
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. overall ratio > tablespace ratio,
+	 *  i.e. tablespace written/to_write > overall written/to_write
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while ((int64) spcStatus[space].num_written * num_to_write >
+		   (int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index].buf_id;
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/*
+		 * Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index + 1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1702,13 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1739,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1763,97 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write].buf_id = buf_id;
+			CheckpointBufferIds[num_to_write].relNode = bufHdr->tag.rnode.relNode;
+			CheckpointBufferIds[num_to_write].forkNum = bufHdr->tag.forkNum;
+			CheckpointBufferIds[num_to_write].blockNum = bufHdr->tag.blockNum;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/* sort buffer ids to help find sequential writes */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(CheckpointSortItem),
+			  bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		if (buf_id != -1)
+			bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1867,46 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 *
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..cf1e505 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..e84f380 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..32f2006 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,23 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * This structure is allocated per buffer in shared memory, so it should be
+ * kept as little as possible.  Maybe the sort criterion could be compacted
+ * to reduce memory requirement and for faster comparison?
+ */
+typedef struct CheckpointSortItem {
+	int buf_id;
+	Oid relNode;
+	ForkNumber	forkNum; /* hm... enum with only 4 values */
+	BlockNumber blockNum;
+} CheckpointSortItem;
+
+extern CheckpointSortItem *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-10-b.patchtext/x-diff; name=checkpoint-continuous-flush-10-b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1cec243..917b2fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2497,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>on</> on Linux, <literal>off</> otherwise.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..1b658f2 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,18 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput on some OS.  It should be beneficial for high write
+   loads on HDD.  This feature probably brings no benefit on SSD, as the I/O
+   write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8caf774..436ead2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,8 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = DEFAULT_CHECKPOINT_FLUSH_TO_DISK;
 bool		checkpoint_sort = true;
 
 /*
@@ -400,7 +402,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -413,7 +416,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1022,7 +1026,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1709,6 +1713,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1794,10 +1799,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1814,6 +1821,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1869,7 +1882,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1879,7 +1893,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1896,6 +1911,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1906,6 +1928,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2153,7 +2177,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2230,7 +2255,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2271,7 +2297,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2533,9 +2559,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2624,7 +2657,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3046,7 +3081,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3080,7 +3117,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3132,7 +3169,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..e880a9e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/*
+		 * Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* other file: do flush previous file & reset flush accumulator */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cf1e505..9219330 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1025,6 +1026,17 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		/* see bufmgr.h: true on Linux, false otherwise */
+		DEFAULT_CHECKPOINT_FLUSH_TO_DISK,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9806,6 +9818,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* ! (HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE) */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e84f380..a5495da 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = ?		# send buffers to disk on checkpoint
+					# default is on if Linux, off otherwise
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c228f39..4fd3ff5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,14 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK true
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK false
+#endif /* HAVE_SYNC_FILE_RANGE */
+
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c7b2a6d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,24 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/*
+ * FileFlushContext structure:
+ *
+ * This is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offsets)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext {
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +88,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#68

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#67)

Re: checkpointer continuous flushing

On Tue, Aug 18, 2015 at 12:38 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

So the option is best kept as "off" for now, without further data, I'm

fine with that.

One point to think here is on what basis user can decide make
this option on, is it predictable in any way?
I think one case could be when the data set fits in shared_buffers.

Yep.

In general, providing an option is a good idea if user can decide with

ease when to use that option or we can give some clear recommendation for
the same otherwise one has to recommend that test your workload with this
option and if it works then great else don't use it which might also be
okay in some cases, but it is better to be clear.

My opinion, which is not backed by any data (anyone can feel free to
provide a FreeBSD box for testing...) is that it would mostly be an
improvement if you have a significant write load to have the flush option
on when running on non-Linux systems which provide posix_fadvise.

If you have a lot of reads and few writes, then postgresql currently works
reasonably enough, which is why people do not complain too much about write
stalls, and I expect that the situation would not be significantly degraded.

Now there are competing positive and negative effects induced by using
posix_fadvise, and moreover its implementation varries from OS to OS, so
without running some experiments it is hard to be definite.

Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.
Ofcourse, that is not mandatory to proceed with this patch, but still can
help you to prove your point as you might not have access to different
kind of systems to run the tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#69

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#68)

3 attachment(s)

Re: checkpointer continuous flushing

Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.

Sure... note that I already did that on this thread, without any echo...
but I can do it again...

Tests should be run on a dedicated host. If it has n cores, I suggest to
share them between postgres checkpointer & workers and pgbench threads so
as to avoid thread competition to use cores. With 8 cores I used up to 2
threads & 4 clients, so that there is 2 core left for the checkpointer and
other stuff (i.e. I also run iotop & htop in parallel...). Although it may
seem conservative to do so, I think that the point of the test is to
exercise checkpoints and not to test the process scheduler of the OS.

Here are the latest version of my test scripts:

(1) cp_test.sh <name> <test>

Run "test" with setup "name". Currently it runs 4000 seconds pgbench with
the 4 possible on/off combinations for sorting & flushing, after some
warmup. The 4000 second is chosen so that there are a few checkpoint
cycles. For larger checkpoint times, I suggest to extend the run time to
see at least 3 checkpoints during the run.

More test settings can be added to the 2 "case"s. Postgres settings,
especially shared_buffers, should be set to a pertinent value wrt the
memory of the test host.

The test run with postgres version found in the PATH, so ensure that the
right version is found!

(2) cp_test_count.py one-test-output.log

For rate limited runs, look at the final figures and compute the number of
late & skipped transactions. This can also be done by hand.

(3) avg.py

For full speed runs, compute stats about per second tps:

sh> grep 'progress:' one-test-output.log | cut -d' ' -f4 | \
./avg.py --limit=10 --length=4000
warning: 633 missing data, extending with zeros
avg over 4000: 199.290575 ± 512.114070 [0.000000, 0.000000, 4.000000, 5.000000, 2280.900000]
percent of values below 10.0: 82.5%

The figures I reported are the 199 (average tps), 512 (standard deviation
on per second figures), 82.5% (percent of time below 10 tps, aka postgres
is basically unresponsive). In brakets, the min q1 median q3 and max tps
seen in the run.

Ofcourse, that is not mandatory to proceed with this patch, but still can
help you to prove your point as you might not have access to different
kind of systems to run the tests.

I agree that more tests would be useful to decide which default value for
the flushing option is the better. For Linux, all tests so far suggest
"on" is the best choice, but for other systems that use posix_fadvise, it
is really an open question.

Another option would be to give me a temporary access for some available
host, I'm used to running these tests...

--
Fabien.

#70

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#69)

Re: checkpointer continuous flushing

On Wed, Aug 19, 2015 at 12:13 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Sure, I think what can help here is a testcase/'s (in form of script file

or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.

Sure... note that I already did that on this thread, without any echo...
but I can do it again...

Thanks.

I have tried your scripts and found some problem while using avg.py
script.
grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py
--limit=10 --length=300
: No such file or directory

I didn't get chance to poke into avg.py script (the command without
avg.py works fine). Python version on the m/c, I planned to test is
Python 2.7.5.

Today while reading the first patch (checkpoint-continuous-flush-10-a),
I have given some thought to below part of patch which I would like
to share with you.

+static int

+NextBufferToWrite(

+ TableSpaceCheckpointStatus *spcStatus, int nb_spaces,

+ int *pspace, int num_to_write, int num_written)

+ int space = *pspace, buf_id = -1, index;

+ /*

+ * Select a tablespace depending on the current overall progress.

+ *

+ * The progress ratio of each unfinished tablespace is compared to

+ * the overall progress ratio to find one with is not in advance

+ * (i.e. overall ratio > tablespace ratio,

+ * i.e. tablespace written/to_write > overall written/to_write

Here, I think above calculation can go for toss if backend or bgwriter
starts writing buffers when checkpoint is in progress. The tablespace
written parameter won't be able to consider the one's written by backends
or bgwriter. Now it may not big thing to worry but I find Heikki's version
worth considering, he has not changed the overall idea of this patch, but
the calculations are somewhat simpler and hence less chance of going
wrong.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#71

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#70)

Re: checkpointer continuous flushing

Hello Amit,

I have tried your scripts and found some problem while using avg.py
script.
grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py
--limit=10 --length=300
: No such file or directory

I didn't get chance to poke into avg.py script (the command without
avg.py works fine). Python version on the m/c, I planned to test is
Python 2.7.5.

Strange... What does "/usr/bin/env python" say? Can the script be started
on its own at all? I think that the script should work both with python2
and python3, at least it does on my laptop...

Today while reading the first patch (checkpoint-continuous-flush-10-a),
I have given some thought to below part of patch which I would like
to share with you.
+ * Select a tablespace depending on the current overall progress.
+ *
+ * The progress ratio of each unfinished tablespace is compared to
+ * the overall progress ratio to find one with is not in advance
+ * (i.e. overall ratio > tablespace ratio,
+ *  i.e. tablespace written/to_write > overall written/to_write

Here, I think above calculation can go for toss if backend or bgwriter
starts writing buffers when checkpoint is in progress. The tablespace
written parameter won't be able to consider the one's written by backends
or bgwriter.

Sure... This is *already* the case with the current checkpointer, the
schedule is performed with respect to the initial number of buffers it
think it will have to write, and if someone else writes these buffers then
the schedule is skewed a little bit, or more... I have not changed this
logic, but I extended it to handle several tablespaces.

If this (the checkpointer progress evaluation used for its schedule is
sometimes wrong because of other writes) is proven to be a major
performance issue, then the processes which writes the checkpointed
buffers behind its back should tell the checkpointer about it, probably
with some shared data structure, so that the checkpointer can adapt its
schedule.

This is an independent issue, that may be worth to address some day. My
opinion is that when the bgwriter or backends quick in to write buffers,
they are basically generating random I/Os on HDD and killing tps and
latency, so it is a very bad time anyway, thus I'm not sure that this is
the next problem to address to improve pg performance and responsiveness.

Now it may not big thing to worry but I find Heikki's version worth
considering, he has not changed the overall idea of this patch, but the
calculations are somewhat simpler and hence less chance of going wrong.

I do not think that Heikki version worked wrt to balancing writes over
tablespaces, and I'm not sure it worked at all. However I reused some of
his ideas to simplify and improve the code.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#71)

Re: checkpointer continuous flushing

On Sun, Aug 23, 2015 at 12:33 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

I have tried your scripts and found some problem while using avg.py

script.
grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py
--limit=10 --length=300
: No such file or directory

I didn't get chance to poke into avg.py script (the command without

avg.py works fine). Python version on the m/c, I planned to test is
Python 2.7.5.

Strange... What does "/usr/bin/env python" say?

Python 2.7.5 (default, Apr 9 2015, 11:07:29)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Can the script be started on its own at all?

I have tried like below which results in same error, also I tried few
other variations but could not succeed.
./avg.py
: No such file or directory

Here, I think above calculation can go for toss if backend or bgwriter

starts writing buffers when checkpoint is in progress. The tablespace
written parameter won't be able to consider the one's written by backends
or bgwriter.

Sure... This is *already* the case with the current checkpointer, the
schedule is performed with respect to the initial number of buffers it
think it will have to write, and if someone else writes these buffers then
the schedule is skewed a little bit, or more... I have not changed this
logic, but I extended it to handle several tablespaces.

I don't know how good or bad it is to build further on somewhat skewed
logic, but the point is that unless it is required why to use it.

I do not think that Heikki version worked wrt to balancing writes over
tablespaces,

I also think that it doesn't balances over tablespaces, but the question
is why do we need to balance over tablespaces, can we reliably
predict in someway which indicates that performing balancing over
tablespace can help the workload. I think here we are doing more
engineering than required for this patch.

and I'm not sure it worked at all.

Okay, his version might have some bugs, but then those could be
fixed as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#73

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#72)

Re: checkpointer continuous flushing

Hello Amit,

Can the script be started on its own at all?

I have tried like below which results in same error, also I tried few
other variations but could not succeed.
./avg.py

Hmmm... Ensure that the script is readable and executable:

sh> chmod a+rx ./avg.py

Also check the file:

sh> file ./avg.py
./avg.py: Python script, UTF-8 Unicode text executable

Sure... This is *already* the case with the current checkpointer, the
schedule is performed with respect to the initial number of buffers it
think it will have to write, and if someone else writes these buffers then
the schedule is skewed a little bit, or more... I have not changed this

I don't know how good or bad it is to build further on somewhat skewed
logic,

The logic is no more skewed that it is with the current version: your
remark about the estimation which may be wrong in some cases is clearly
valid, but it is orthogonal (independent, unrelated, different) to what is
addressed by this patch.

I currently have no reason to believe that the issue you raise is a major
performance issue, but if so it may be addressed by another patch by
whoever want to do so.

What I have done is to demonstrate that generating a lot of random I/Os is
a major performance issue (well, sure), and this patch addresses this
point and provide major speedup (*3-5) and latency reductions (from +60%
unavailability to nearly full availability) for high OLTP write load, by
reordering and flushing checkpoint buffers in a sensible way.

but the point is that unless it is required why to use it.

This is really required to avoid predictable performance regressions, see
below.

I do not think that Heikki version worked wrt to balancing writes over
tablespaces,

I also think that it doesn't balances over tablespaces, but the question
is why do we need to balance over tablespaces, can we reliably predict
in someway which indicates that performing balancing over tablespace can
help the workload.

The reason for the tablespace balancing is that in the current postgres
buffers are written more or less randomly, so it is (probably) implicitely
and statistically balanced over tablespaces because of this randomness,
and indeed, AFAIK, people with multi tablespace setup have not complained
that postgres was using the disks sequentially.

However, once the buffers are sorted per file, the order becomes
deterministic and there is no more implicit balancing, which means that if
someone has a pg setup with several disks it will write sequentially on
these instead of in parallel.

This regression was pointed out by Andres Freund, I agree that such a
regression for high end systems must be avoided, hence the tablespace
balancing.

I think here we are doing more engineering than required for this patch.

I do not think so, I think that Andres remark is justified to avoid a
performance regression on high end systems which use tablespaces, which is
really undesirable.

About the balancing code, it is not that difficult, even if it is not
trivial: the point is to select the tablespace for which the progress
ratio (written/to_write) is below the overall progress ratio, so that it
catches up, and do so in a round robin maner, so that all tablespaces get
to write things. I also have both written a proof and tested the logic (in
a separate script).

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#73)

Re: checkpointer continuous flushing

On Mon, Aug 24, 2015 at 4:15 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

[stuff]

Moved to next CF 2015-09.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#67)

Re: checkpointer continuous flushing

On 2015-08-18 09:08:43 +0200, Fabien COELHO wrote:

Please find attached a v10, where I have reviewed comments for style &
contents, and also slightly extended the documentation about the flush
option to hint that it is essentially useful for high write loads. Without
further data, I think it is not obvious to give more definite advices.

v10b misses the checkpoint_sort part of the patch, and thus cannot be applied.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#75)

1 attachment(s)

Re: checkpointer continuous flushing

Hello Andres,

Please find attached a v10, where I have reviewed comments for style &
contents, and also slightly extended the documentation about the flush
option to hint that it is essentially useful for high write loads.
Without further data, I think it is not obvious to give more definite
advices.

v10b misses the checkpoint_sort part of the patch, and thus cannot be
applied.

Yes, indeed, the second part is expected to be applied on top of v10a.

Please find attached the cumulated version (v10a + v10b).

--
Fabien.

Attachments:

checkpoint-continuous-flush-10.patchtext/x-diff; name=checkpoint-continuous-flush-10.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e3dc23b..927294b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
@@ -2475,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>on</> on Linux, <literal>off</> otherwise.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..1b658f2 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,30 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput on some OS.  It should be beneficial for high write
+   loads on HDD.  This feature probably brings no benefit on SSD, as the I/O
+   write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8048,8 +8054,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8057,6 +8063,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CheckpointSortItem), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cd3aaad..436ead2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,9 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = DEFAULT_CHECKPOINT_FLUSH_TO_DISK;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +98,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* array of buffer ids & sort criterion of all buffers to checkpoint */
+CheckpointSortItem *CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -396,7 +402,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -409,7 +416,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1018,7 +1026,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1561,6 +1569,130 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* checkpoint buffers comparison */
+static int bufcmp(const void * pa, const void * pb)
+{
+	CheckpointSortItem
+		*a = (CheckpointSortItem *) pa,
+		*b = (CheckpointSortItem *) pb;
+
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/*
+ * Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/*
+ * Entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
+/*
+ * Return the next buffer to write, or -1.
+ * this function balances buffers over tablespaces, see comment inside.
+ */
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. overall ratio > tablespace ratio,
+	 *  i.e. tablespace written/to_write > overall written/to_write
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while ((int64) spcStatus[space].num_written * num_to_write >
+		   (int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index].buf_id;
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/*
+		 * Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index + 1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1706,14 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1744,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1768,105 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write].buf_id = buf_id;
+			CheckpointBufferIds[num_to_write].relNode = bufHdr->tag.rnode.relNode;
+			CheckpointBufferIds[num_to_write].forkNum = bufHdr->tag.forkNum;
+			CheckpointBufferIds[num_to_write].blockNum = bufHdr->tag.blockNum;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found) entry->count++;
+			else entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status & flush context arrays */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/* sort buffer ids to help find sequential writes */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write,  sizeof(CheckpointSortItem),
+			  bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		if (buf_id != -1)
+			bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1880,57 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 *
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
@@ -1939,7 +2177,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state =
+			SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2016,7 +2255,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2057,7 +2297,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2319,9 +2559,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2410,7 +2657,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -2832,7 +3081,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -2866,7 +3117,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2918,7 +3169,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..e880a9e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/*
+		 * Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* other file: do flush previous file & reset flush accumulator */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..9219330 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1013,6 +1014,28 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		/* see bufmgr.h: true on Linux, false otherwise */
+		DEFAULT_CHECKPOINT_FLUSH_TO_DISK,
+		check_flush_to_disk, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
@@ -9795,6 +9818,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* ! (HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE) */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 695a88f..01b1c96 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,9 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = ?		# send buffers to disk on checkpoint
+					# default is on if Linux, off otherwise
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..32f2006 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,23 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * This structure is allocated per buffer in shared memory, so it should be
+ * kept as little as possible.  Maybe the sort criterion could be compacted
+ * to reduce memory requirement and for faster comparison?
+ */
+typedef struct CheckpointSortItem {
+	int buf_id;
+	Oid relNode;
+	ForkNumber	forkNum; /* hm... enum with only 4 values */
+	BlockNumber blockNum;
+} CheckpointSortItem;
+
+extern CheckpointSortItem *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..4fd3ff5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -55,6 +55,15 @@ extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
 
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK true
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK false
+#endif /* HAVE_SYNC_FILE_RANGE */
+
+extern bool checkpoint_flush_to_disk;
+extern bool checkpoint_sort;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c7b2a6d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,24 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/*
+ * FileFlushContext structure:
+ *
+ * This is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offsets)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext {
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +88,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#77

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#76)

Re: checkpointer continuous flushing

On 2015-08-27 14:32:39 +0200, Fabien COELHO wrote:

v10b misses the checkpoint_sort part of the patch, and thus cannot be
applied.

Yes, indeed, the second part is expected to be applied on top of v10a.

Oh, sorry. I'd somehow assumed they were two variants of the same patch
(one with "slim" sorting and the other without).

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#77)

Re: checkpointer continuous flushing

v10b misses the checkpoint_sort part of the patch, and thus cannot be
applied.

Yes, indeed, the second part is expected to be applied on top of v10a.

Oh, sorry. I'd somehow assumed they were two variants of the same patch
(one with "slim" sorting and the other without).

The idea is that as these two features could be committed separately.
However, experiments show that flushing is really efficient when sorting
is done first, and moreover the two features conflict, so I've made two
dependent patches.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#73)

Re: checkpointer continuous flushing

On Mon, Aug 24, 2015 at 12:45 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Also check the file:

sh> file ./avg.py
./avg.py: Python script, UTF-8 Unicode text executable

There were some CRLF line terminators, after removing those, it worked
fine and here are the results of some of the tests done for sorting patch
(checkpoint-continuous-flush-10-a) :

Config Used
----------------------
M/c details

--------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Test details
------------------
warmup=60
scale=300
max_connections=150
shared_buffers=8GB
checkpoint_timeout=2min
time=7200
synchronous_commit=on
max_wal_size=5GB

parallelism - 128 clients, 128 threads

Sort - off
avg over 7200: 8256.382528 ± 6218.769282 [0.000000, 76.050000,
10975.500000, 13105.950000, 21729.000000]
percent of values below 10.0: 19.5%

Sort - on
avg over 7200: 8375.930639 ± 6148.747366 [0.000000, 84.000000,
10946.000000, 13084.000000, 20289.900000]
percent of values below 10.0: 18.6%

Before going to conclusion, let me try to explain above data (I am
explaining again even though Fabien has explained, to make it clear
if someone has not read his mail)

Let's try to understand with data for sorting - off option

avg over 7200: 8256.382528 ± 6218.769282

8256.382528 - average tps for 7200s pgbench run
6218.769282 - standard deviation on per second figures

[0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]

These 5 values can be read as minimum TPS, q1, median TPS, q3,
maximum TPS over 7200s pgbench run. As far as I understand q1
and q3 median of subset of values which I didn't focussed much.

percent of values below 10.0: 19.5%

Above means percent of time the result is below 10 tps.

Now about test results, these tests are done for pgbench full speed runs
and the above results indicate that there is approximately 1.5%
improvement in avg. TPS and ~1% improvement in tps values which are
below 10 with sorting on and there is almost no improvement in median or
maximum TPS values, instead they or slightly less when sorting is
on which could be due to run-to-run variation.

I have done more tests as well by varying time and number of clients
keeping other configuration same as above, but the results are quite
similar.

The results of sorting patch for the tests done indicate that the win is not
big enough with just doing sorting during checkpoints, we should consider
flush patch along with sorting. I would like to perform some tests with
both
the patches together (sort + flush) unless somebody else thinks that sorting
patch alone is beneficial and we should test some other kind of scenarios to
see it's benefit.

The reason for the tablespace balancing is that in the current postgres

buffers are written more or less randomly, so it is (probably) implicitely
and statistically balanced over tablespaces because of this randomness, and
indeed, AFAIK, people with multi tablespace setup have not complained that
postgres was using the disks sequentially.

However, once the buffers are sorted per file, the order becomes

deterministic and there is no more implicit balancing, which means that if
someone has a pg setup with several disks it will write sequentially on
these instead of in parallel.

What if tablespaces are not on separate disks or not enough hardware
support to make Writes parallel? I think for such cases it might be
better to do it sequentially.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#80

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#79)

Re: checkpointer continuous flushing

Hello Amit,

IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Wow! Thanks for trying the patch on such high-end hardware!

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

What is the OS? The FS?

warmup=60

Quite short, but probably ok.

scale=300

Means about 4-4.5 GB base.

time=7200
synchronous_commit=on

shared_buffers=8GB

This is small wrt hardware, but given the scale setup I think that it
should not matter much.

max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

You did not update checkpoint_completion_target, which means 0.5 so that
the checkpoint is scheduled to run in at most 1 minute, which suggest at
least 130 MB/s write performance for the checkpoint.

parallelism - 128 clients, 128 threads

Given 192 hw threads, I would have tried used 128 clients & 64 threads, so
that each pgbench client has its own dedicated postgres in a thread, and
that postgres processes are not competing with pgbench. Now as pgbench is
mostly sleeping, probably that does not matter much... I may also be
totally wrong:-)

Sort - off
avg over 7200: 8256.382528 ± 6218.769282 [0.000000, 76.050000,
10975.500000, 13105.950000, 21729.000000]
percent of values below 10.0: 19.5%

The max performance is consistent with 128 threads * 200 (random) writes
per second.

Sort - on
avg over 7200: 8375.930639 ± 6148.747366 [0.000000, 84.000000,
10946.000000, 13084.000000, 20289.900000]
percent of values below 10.0: 18.6%

This is really a small improvement, probably in the error interval of the
measure. I would not trust much 1.5% tps or 0.9% availability
improvements.

I think that we could conclude that on your (great) setup, with these
configuration parameter, this patch does not harm performance. This is a
good thing, even if I would have hoped to see better performance.

Before going to conclusion, let me try to explain above data (I am
explaining again even though Fabien has explained, to make it clear
if someone has not read his mail)

Let's try to understand with data for sorting - off option

avg over 7200: 8256.382528 ± 6218.769282

8256.382528 - average tps for 7200s pgbench run
6218.769282 - standard deviation on per second figures

[0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]

These 5 values can be read as minimum TPS, q1, median TPS, q3,
maximum TPS over 7200s pgbench run. As far as I understand q1
and q3 median of subset of values which I didn't focussed much.

q1 = 84 means that 25% of the time the performance was below 84 tps, about
1% of the average performance, which I would translate as "pg is pretty
unresponsive 25% of the time".

This is the kind of issue I really want to address, the eventual tps
improvements are just a side effect.

percent of values below 10.0: 19.5%

Above means percent of time the result is below 10 tps.

Which means "postgres is really unresponsive 19.5% of the time".

If you count zeros, you will get "postgres was totally unresponsive X% of
the time".

Now about test results, these tests are done for pgbench full speed runs
and the above results indicate that there is approximately 1.5%
improvement in avg. TPS and ~1% improvement in tps values which are
below 10 with sorting on and there is almost no improvement in median or
maximum TPS values, instead they or slightly less when sorting is
on which could be due to run-to-run variation.

Yes, I agree.

I have done more tests as well by varying time and number of clients
keeping other configuration same as above, but the results are quite
similar.

Given the hardware, I would suggest to raise checkpoint_timeout,
shared_buffers and max_wal_size, and use checkpoint_completion_target=0.8.
I would expect that it should improve performance both with and without
sorting.

It would be interesting to have informations from checkpoint logs
(especially how many buffers written in how long, whether checkpoints are
time or xlog triggered, ...).

The results of sorting patch for the tests done indicate that the win is
not big enough with just doing sorting during checkpoints,

ISTM that you do too much generalization: The win is not big "under this
configuration and harware".

I think that the patch may have very small influence under some
conditions, but should not degrade performance significantly, and on the
other hand it should provide great improvements under some (other)
conditions.

So having no performance degradation is a good result, even if I would
hope to get better results. It would be interesting to understand why
random disk writes do not perform too poorly on this box: size of I/O
queue, kind of (expensive:-) disks, write caches, file system, raid
level...

we should consider flush patch along with sorting.

I also think that it would be interesting.

I would like to perform some tests with both the patches together (sort
+ flush) unless somebody else thinks that sorting patch alone is
beneficial and we should test some other kind of scenarios to see it's
benefit.

Yep. Is it a Linux box? If not, does it support posix_fadvise()?

The reason for the tablespace balancing is [...]

What if tablespaces are not on separate disks

I would expect that it might very slightly degrade performance, but only
marginally.

or not enough hardware support to make Writes parallel?

I'm not sure that balancing or not writes over tablespaces would change
anything to an I/O bottleneck which is not the disk write performance, so
I would say "no impact" in that case.

I think for such cases it might be better to do it sequentially.

Writing sequentially to different disks would be a bug, and degrade
performance significantly on a setup with several disks, up to dividing
the performance by the number of disks... so I do think that a patch which
predictability and significantly degrades performance on high-end harware
is a reasonable option.

If you want to be able to disactivate balancing, it could be done with a
guc, but I cannot see good reasons to want to do that: it would complicate
the code and it does not make much sense to use many tablespaces on one
disk, while anyone who uses several tablespaces on several disks is
probably expecting to see her expensive disks actually used in parallel.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#80)

Re: checkpointer continuous flushing

On Mon, Aug 31, 2015 at 12:40 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Wow! Thanks for trying the patch on such high-end hardware!

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

Speed of Reads -
Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec
Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec

Copy speed -

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

What is the OS? The FS?

OS info -
Linux <m/c addr> 3.10.0-123.1.2.el7.ppc64 #1 SMP Wed Jun 4 15:23:17 EDT
2014 ppc64 ppc64 ppc64 GNU/Linux

FS - ext4

shared_buffers=8GB

This is small wrt hardware, but given the scale setup I think that it

should not matter much.

Yes, I was testing the case for Read-Write transactions when all the data
fits in shared_buffers, so this is okay.

max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

We can check with larger value, but do you expect some different
results and why?

checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

I wanted to test by triggering more checkpoints, but I can test with
larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

You did not update checkpoint_completion_target, which means 0.5 so that

the checkpoint is scheduled to run in at most 1 minute, which suggest at
least 130 MB/s write performance for the checkpoint.

The value used in your script was 0.8 for checkpoint_completion_target
which I have not changed during tests.

parallelism - 128 clients, 128 threads

Given 192 hw threads, I would have tried used 128 clients & 64 threads,

so that each pgbench client has its own dedicated postgres in a thread, and
that postgres processes are not competing with pgbench. Now as pgbench is
mostly sleeping, probably that does not matter much... I may also be
totally wrong:-)

In next run, I can use it with 64 threads, lets settle on other parameters
first for which you expect there could be a clear win with the first patch.

Given the hardware, I would suggest to raise checkpoint_timeout,

shared_buffers and max_wal_size, and use checkpoint_completion_target=0.8.
I would expect that it should improve performance both with and without
sorting.

I don't think increasing shared_buffers would have any impact, because
8GB is sufficient for 300 scale factor data, checkpoint_completion_target is
already 0.8 in my previous tests. Lets try with checkpoint_timeout = 10 min
and max_wal_size = 15GB, do you have any other suggestion?

It would be interesting to have informations from checkpoint logs

(especially how many buffers written in how long, whether checkpoints are
time or xlog triggered, ...).

The results of sorting patch for the tests done indicate that the win is

not big enough with just doing sorting during checkpoints,

ISTM that you do too much generalization: The win is not big "under this

configuration and harware".

Hmm.. nothing like that, this was based on couple of tests done by
me and I am open to do some more if you or anybody feels that the
first patch (checkpoint-continuous-flush-10-a) can alone gives benefit,
in-fact I have started these tests with the intention to see if first
patch gives benefit, then that could be evaluated and eventually
committed separately.

I think that the patch may have very small influence under some

conditions, but should not degrade performance significantly, and on the
other hand it should provide great improvements under some (other)
conditions.

True, let us try to find conditions/scenarios where you think it can give
big boost, suggestions are welcome.

What if tablespaces are not on separate disks

I would expect that it might very slightly degrade performance, but only

marginally.

If you want to be able to disactivate balancing, it could be done with a

guc, but I cannot see good reasons to want to do that: it would complicate
the code and it does not make much sense to use many tablespaces on one
disk, while anyone who uses several tablespaces on several disks is
probably expecting to see her expensive disks actually used in parallel.

I think we can leave this for committer to take a call or if anybody
else has any opinion, because there is nothing wrong in what you
have done, but I am not clear if there is a clear need for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#82

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#81)

Re: checkpointer continuous flushing

Hello Amit,

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

Speed of Reads -
Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec
Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

Copy speed -

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"?? Looks more
like several SSD... Or the file is kept in memory and not committed to
disk yet? Try a "sync" afterwards??

If these are SSD, or if there is some SSD cache on top of the HDD, I would
not expect the patch to do much, because the SSD random I/O writes are
pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.

max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

We can check with larger value, but do you expect some different
results and why?

Because checkpoints are xlog triggered (which depends on max_wal_size) or
time triggered (which depends on checkpoint_timeout). Given the large tps,
I expect that the WAL is filled very quickly hence may trigger checkpoints
every ... that is the question.

checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

I wanted to test by triggering more checkpoints, but I can test with
larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

For a +2 hours test, I would suggest 10 or 15 minutes.

It would be useful to know about checkpoint stats before suggesting values
for max_wal_size and checkpoint_timeout.

[...] The value used in your script was 0.8 for
checkpoint_completion_target which I have not changed during tests.

Ok.

parallelism - 128 clients, 128 threads [...]

In next run, I can use it with 64 threads, lets settle on other parameters
first for which you expect there could be a clear win with the first patch.

Ok.

Given the hardware, I would suggest to raise checkpoint_timeout,
shared_buffers and max_wal_size, [...]. I would expect that it should
improve performance both with and without sorting.

I don't think increasing shared_buffers would have any impact, because
8GB is sufficient for 300 scale factor data,

It fits at the beginning, but when updates and inserts are performed
postgres adds new pages (update = delete + insert), and the deleted space
is eventually reclaimed by vacuum later on.

Now if space is available in the page it is reused, so what really happens
is not that simple...

At 8500 tps the disk space extension for tables may be up to 3 MB/s at the
beginning, and would evolve but should be at least about 0.6 MB/s (insert
in history, assuming updates are performed in page), on average.

So whether the database fits in 8 GB shared buffer during the 2 hours of
the pgbench run is an open question.

checkpoint_completion_target is already 0.8 in my previous tests. Lets
try with checkpoint_timeout = 10 min and max_wal_size = 15GB, do you
have any other suggestion?

Maybe shared_buffers = 32GB to ensure that it is a "in buffer" run ?

It would be interesting to have informations from checkpoint logs
(especially how many buffers written in how long, whether checkpoints
are time or xlog triggered, ...).

Information still welcome.

Hmm.. nothing like that, this was based on couple of tests done by
me and I am open to do some more if you or anybody feels that the
first patch (checkpoint-continuous-flush-10-a) can alone gives benefit,
in-fact I have started these tests with the intention to see if first
patch gives benefit, then that could be evaluated and eventually
committed separately.

Ok.

My initial question remains: is the setup using HDDs? For SSD there should
be probably no significant benefit with sorting, although it should not
harm, and I'm not sure about flushing.

True, let us try to find conditions/scenarios where you think it can give
big boost, suggestions are welcome.

HDDs?

I think we can leave this for committer to take a call or if anybody
else has any opinion, because there is nothing wrong in what you
have done, but I am not clear if there is a clear need for the same.

I may have an old box available with two disks, so that I can run some
tests with table spaces, but with very few cores.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#82)

Re: checkpointer continuous flushing

On Tue, Sep 1, 2015 at 5:30 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

Speed of Reads -
Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec
Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

Yes, there is no SSD in system. I have confirmed the same. There are RAID
spinning drives.

Copy speed -

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"??

One thing to notice is that if I don't remove the output file (output.img)
the
speed is much slower, see the below output. I think this means in our case
we will get ~320 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.28086 s, 1.7 GB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 6.72301 s, 319 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 6.73963 s, 319 MB/s

If I remove the file each time:

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.2855 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.27725 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.27417 s, 1.7 GB/s

rm /data/akapila/output.img

Looks more like several SSD... Or the file is kept in memory and not
committed to disk yet? Try a "sync" afterwards??

If these are SSD, or if there is some SSD cache on top of the HDD, I would
not expect the patch to do much, because the SSD random I/O writes are
pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.

Yes, me too. I think we should try to reach on consensus for exact scenarios
and configuration where this patch('es) can give benefit or we want to
verify
if there is any regression as I have access to this m/c for a very-very
limited
time. This m/c might get formatted soon for some other purpose.

max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

We can check with larger value, but do you expect some different
results and why?

Because checkpoints are xlog triggered (which depends on max_wal_size) or
time triggered (which depends on checkpoint_timeout). Given the large tps,
I expect that the WAL is filled very quickly hence may trigger checkpoints
every ... that is the question.

checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

I wanted to test by triggering more checkpoints, but I can test with
larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

For a +2 hours test, I would suggest 10 or 15 minutes.

Okay, lets keep it as 10 minutes.

I don't think increasing shared_buffers would have any impact, because

8GB is sufficient for 300 scale factor data,

It fits at the beginning, but when updates and inserts are performed
postgres adds new pages (update = delete + insert), and the deleted space
is eventually reclaimed by vacuum later on.

Now if space is available in the page it is reused, so what really happens
is not that simple...

At 8500 tps the disk space extension for tables may be up to 3 MB/s at the
beginning, and would evolve but should be at least about 0.6 MB/s (insert
in history, assuming updates are performed in page), on average.

So whether the database fits in 8 GB shared buffer during the 2 hours of
the pgbench run is an open question.

With this kind of configuration, I have noticed that more than 80%
of updates are HOT updates, not much bloat, so I think it won't
cross 8GB limit, but still I can keep it to 32GB if you have any doubts.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#84

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#83)

Re: checkpointer continuous flushing

Hello Amit,

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

Yes, there is no SSD in system. I have confirmed the same. There are RAID
spinning drives.

Ok...

I guess that there is some kind of cache to explain these great tps
figures, probably on the RAID controller. What does "lspci" says? Does
"hdparm" suggests that the write cache is enabled? It would be fine if the
I/O system has a BBU, but that could also hide some of the patch
benefits...

A tentative explanation for the similar figures with and without sorting
could be that depending on the controller cache size (may be 1GB or more)
and firmware, the I/O system reorders disk writes so that they are
basically sequential and the fact that pg sorts them beforehand has little
or no impact. This may also be help by the fact that buffers are not
really in random order to begin with as the warmup phase does an initial
"select stuff from table".

There could be other possible factors such as the file system details,
"WAFL" hacks... the tricks are endless:-)

Checking for the right explanation would involve removing the
unconditional select warmup to use only a long and random warmup, and
probably trying a much larger than cache database, and/or disabling the
write cache, reading the hardware documentation in detail... But this is
also a lot of bother and time.

Maybe the simplest approach would be to disable the write cache for the
test. Is that possible?

Woops, 1.6 GB/s write... same questions, "rotating plates"??

One thing to notice is that if I don't remove the output file
(output.img) the speed is much slower, see the below output. I think
this means in our case we will get ~320 MB/s

I would say that the OS was doing something here, and 320 MB/s looks more
like an actual HDD RAID system sequential write performance.

If these are SSD, or if there is some SSD cache on top of the HDD, I would
not expect the patch to do much, because the SSD random I/O writes are
pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.

Yes, me too. I think we should try to reach on consensus for exact
scenarios and configuration where this patch('es) can give benefit or we
want to verify if there is any regression as I have access to this m/c
for a very-very limited time. This m/c might get formatted soon for
some other purpose.

Yep, it would be great if you have time for a flush test before it
disappears... I think it is advisable to disable the write cache as it may
also hide the impact of flushing.

So whether the database fits in 8 GB shared buffer during the 2 hours of
the pgbench run is an open question.

With this kind of configuration, I have noticed that more than 80%
of updates are HOT updates, not much bloat, so I think it won't
cross 8GB limit, but still I can keep it to 32GB if you have any doubts.

The problem with performance tests is that you want to test one thing, but
there are many factors that intervene and you may end up testing something
else, such as lock contention or process scheduler or whatever, rather
than what you were trying to put in evidence. So I would suggest to be on
the safe side and use the larger value.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#84)

Re: checkpointer continuous flushing

I would be curious whether flushing helps, though.

Yes, me too. I think we should try to reach on consensus for exact
scenarios and configuration where this patch('es) can give benefit or we
want to verify if there is any regression as I have access to this m/c for
a very-very limited time. This m/c might get formatted soon for some other
purpose.

Yep, it would be great if you have time for a flush test before it
disappears... I think it is advisable to disable the write cache as it may
also hide the impact of flushing.

Still thinking... Depending on the results, it might be interesting to
have these tests run with the write cache enabled as well, to check how
much it interferes positively with performance.

I would guess "quite a lot".

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#76)

Re: checkpointer continuous flushing

Hi,

Here's a bunch of comments on this (hopefully the latest?) version of
the patch:

* I'm not sure I like the FileWrite & FlushBuffer API changes. Do you
forsee other callsites needing similar logic? Wouldn't it be just as
easy to put this logic into the checkpointing code?

* We don't do one-line ifs; function parameters are always in the same
line as the function name

* Wouldn't a binary heap over the tablespaces + progress be nicer? If
you make the sorting criterion include the tablespace id you wouldn't
need the lookahead loop in NextBufferToWrite(). Isn't the current
approach O(NBuffers^2) in the worst case?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#86)

Re: checkpointer continuous flushing

Hello Andres,

Here's a bunch of comments on this (hopefully the latest?)

Who knows?! :-)

version of the patch:

* I'm not sure I like the FileWrite & FlushBuffer API changes. Do you
forsee other callsites needing similar logic?

I foresee that the bgwriter should also do something more sensible than
generating random I/Os over HDDs, and this is also true for workers... But
this is for another time, maybe.

Wouldn't it be just as easy to put this logic into the checkpointing
code?

Not sure it would simplify anything, because the checkpointer currently
knows about buffers but flushing is about files, which are hidden from
view.

Doing it with this API change means that the code does not have to compute
twice in which file is a buffer: The buffer/file boundary has to be broken
somewhere anyway so that flushing can be done when needed, and the
solution I took seems the simplest way to do it, without having to make
the checkpointer too much file concious.

* We don't do one-line ifs;

Ok, I'll return them.

function parameters are always in the same line as the function name

Ok, I'll try to improve.

* Wouldn't a binary heap over the tablespaces + progress be nicer?

I'm not sure where it would fit exactly.

Anyway, I think it would complicate the code significantly (compared to
the straightforward array), so I would not do anything like that without a
strong intensive, such as an actual failing case.

Moreover such a data structure would probably require some kind of pointer
(probably 8 bytes added per node, maybe more), and the amount of memory is
already a concern, at least to me, and moreover it has to reside in shared
memory which does not simplify allocation of tree data structures.

If you make the sorting criterion include the tablespace id you wouldn't
need the lookahead loop in NextBufferToWrite().

Yep, I thought of it. It would mean 4 more bytes per buffer, and bsearch
to find the boundaries, so significantly less simple code. I think that
the current approach is ok as the number of tablespace should be small.

It may be improved upon later if there is a motivation to do so.

Isn't the current approach O(NBuffers^2) in the worst case?

ISTM that the overall lookahead complexity is Nbuffers * Ntablespace:
buffers are scanned once for each tablespace. I assume that the number of
tablespace is kept low, and having a simpler code which use less memory
seems a good idea.

ISTM that using a tablespace in the sorting would reduce the complexity
to ln(NBuffers) * Ntablespace for finding the boundaries, and then
Nbuffers * (Ntablespace/Ntablespace) = NBuffers for scanning, at the
expense of more memory and code complexity.

So this is a voluntary design decision.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#86)

2 attachment(s)

Re: checkpointer continuous flushing

Here is a rebased two-part v11.

* We don't do one-line ifs;

I've found one instance.

function parameters are always in the same line as the function name

ISTM that I did that, or maybe I did not understand what I've done wrong.

--
Fabien.

Attachments:

checkpoint-continuous-flush-11a.patchtext/x-diff; name=checkpoint-continuous-flush-11a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e3dc23b..96c9a2f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 127bc58..74412a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7999,11 +7999,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8034,6 +8036,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8052,8 +8058,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8061,6 +8067,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CheckpointSortItem), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cd3aaad..cc951e1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +96,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* array of buffer ids & sort criterion of all buffers to checkpoint */
+CheckpointSortItem *CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1561,6 +1565,130 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* checkpoint buffers comparison */
+static int bufcmp(const void * pa, const void * pb)
+{
+	CheckpointSortItem
+		*a = (CheckpointSortItem *) pa,
+		*b = (CheckpointSortItem *) pb;
+
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/*
+ * Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/*
+ * Entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
+/*
+ * Return the next buffer to write, or -1.
+ * this function balances buffers over tablespaces, see comment inside.
+ */
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. overall ratio > tablespace ratio,
+	 *  i.e. tablespace written/to_write > overall written/to_write
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while ((int64) spcStatus[space].num_written * num_to_write >
+		   (int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index].buf_id;
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/*
+		 * Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index + 1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1702,13 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1739,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1763,99 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write].buf_id = buf_id;
+			CheckpointBufferIds[num_to_write].relNode = bufHdr->tag.rnode.relNode;
+			CheckpointBufferIds[num_to_write].forkNum = bufHdr->tag.forkNum;
+			CheckpointBufferIds[num_to_write].blockNum = bufHdr->tag.blockNum;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found)
+				entry->count++;
+			else
+				entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/* sort buffer ids to help find sequential writes */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write, sizeof(CheckpointSortItem),
+			  bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		if (buf_id != -1)
+			bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1869,46 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 *
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..cf1e505 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 695a88f..d4dfc25 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..32f2006 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,23 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * This structure is allocated per buffer in shared memory, so it should be
+ * kept as little as possible.  Maybe the sort criterion could be compacted
+ * to reduce memory requirement and for faster comparison?
+ */
+typedef struct CheckpointSortItem {
+	int buf_id;
+	Oid relNode;
+	ForkNumber	forkNum; /* hm... enum with only 4 values */
+	BlockNumber blockNum;
+} CheckpointSortItem;
+
+extern CheckpointSortItem *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-11b.patchtext/x-diff; name=checkpoint-continuous-flush-11b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 96c9a2f..927294b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2497,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>on</> on Linux, <literal>off</> otherwise.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..1b658f2 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,18 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput on some OS.  It should be beneficial for high write
+   loads on HDD.  This feature probably brings no benefit on SSD, as the I/O
+   write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6a6fc3b..2a8f645 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cc951e1..9da996e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -80,6 +80,8 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = DEFAULT_CHECKPOINT_FLUSH_TO_DISK;
 bool		checkpoint_sort = true;
 
 /*
@@ -400,7 +402,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -413,7 +416,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1022,7 +1026,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1709,6 +1713,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1796,10 +1801,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1816,6 +1823,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1871,7 +1884,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1881,7 +1895,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1898,6 +1913,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1908,6 +1930,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2155,7 +2179,7 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state = SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2232,7 +2256,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2273,7 +2298,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2535,9 +2560,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2626,7 +2658,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3048,7 +3082,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3082,7 +3118,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3134,7 +3170,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..e880a9e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/*
+		 * Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* other file: do flush previous file & reset flush accumulator */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cf1e505..9219330 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1025,6 +1026,17 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		/* see bufmgr.h: true on Linux, false otherwise */
+		DEFAULT_CHECKPOINT_FLUSH_TO_DISK,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9806,6 +9818,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* ! (HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE) */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d4dfc25..01b1c96 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = ?		# send buffers to disk on checkpoint
+					# default is on if Linux, off otherwise
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c228f39..4fd3ff5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,14 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK true
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK false
+#endif /* HAVE_SYNC_FILE_RANGE */
+
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c7b2a6d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,24 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/*
+ * FileFlushContext structure:
+ *
+ * This is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offsets)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext {
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +88,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#89

Petr Jelinek

petr@2ndquadrant.com

over 10 years ago

In reply to: Fabien COELHO (#88)

Re: checkpointer continuous flushing

On 2015-09-06 19:05, Fabien COELHO wrote:

Here is a rebased two-part v11.

function parameters are always in the same line as the function name

ISTM that I did that, or maybe I did not understand what I've done wrong.

I see one instance of this issue
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)

Also
+static int bufcmp(const void * pa, const void * pb)
+{

should IMHO be formatted as
+static int
+bufcmp(const void * pa, const void * pb)
+{

And I think we generally put the struct typedefs at the top of the C
file and don't mix them with function definitions (I am talking about
the TableSpaceCheckpointStatus and TableSpaceCountEntry).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Petr Jelinek (#89)

2 attachment(s)

Re: checkpointer continuous flushing

Hello Petr,

function parameters are always in the same line as the function name

ISTM that I did that, or maybe I did not understand what I've done wrong.
I see one instance of this issue
+static int
+NextBufferToWrite(
+	TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+	int *pspace, int num_to_write, int num_written)

Ok, I was looking for function calls.

should IMHO be formatted as
+static int
+bufcmp(const void * pa, const void * pb)
+{

Indeed.

And I think we generally put the struct typedefs at the top of the C file and
don't mix them with function definitions (I am talking about the
TableSpaceCheckpointStatus and TableSpaceCountEntry).

Ok, moved up.

Thanks for the hints! Two-part v12 attached fixes these.

--
Fabien.

Attachments:

checkpoint-continuous-flush-12a.patchtext/x-diff; name=checkpoint-continuous-flush-12a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e3dc23b..96c9a2f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 127bc58..74412a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7999,11 +7999,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8034,6 +8036,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8052,8 +8058,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8061,6 +8067,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CheckpointSortItem), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cd3aaad..dae5954 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -75,11 +75,36 @@ typedef struct PrivateRefCountEntry
 /* 64 bytes, about the size of a cache line on common systems */
 #define REFCOUNT_ARRAY_ENTRIES 8
 
+/*
+ * Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/*
+ * Entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -95,6 +120,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* array of buffer ids & sort criterion of all buffers to checkpoint */
+CheckpointSortItem *CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1561,6 +1589,106 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* checkpoint buffers comparison */
+static int
+bufcmp(const void * pa, const void * pb)
+{
+	CheckpointSortItem
+		*a = (CheckpointSortItem *) pa,
+		*b = (CheckpointSortItem *) pb;
+
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/*
+ * Return the next buffer to write, or -1.
+ * this function balances buffers over tablespaces, see comment inside.
+ */
+static int
+NextBufferToWrite(TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+				  int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. overall ratio > tablespace ratio,
+	 *  i.e. tablespace written/to_write > overall written/to_write
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while ((int64) spcStatus[space].num_written * num_to_write >
+		   (int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index].buf_id;
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/*
+		 * Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index + 1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1574,11 +1702,13 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1609,6 +1739,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1763,99 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write].buf_id = buf_id;
+			CheckpointBufferIds[num_to_write].relNode = bufHdr->tag.rnode.relNode;
+			CheckpointBufferIds[num_to_write].forkNum = bufHdr->tag.forkNum;
+			CheckpointBufferIds[num_to_write].blockNum = bufHdr->tag.blockNum;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found)
+				entry->count++;
+			else
+				entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/* sort buffer ids to help find sequential writes */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write, sizeof(CheckpointSortItem),
+			  bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		if (buf_id != -1)
+			bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1660,39 +1869,46 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 *
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..cf1e505 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1013,6 +1013,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 695a88f..d4dfc25 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6dacee2..dbd4757 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..32f2006 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,23 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * This structure is allocated per buffer in shared memory, so it should be
+ * kept as little as possible.  Maybe the sort criterion could be compacted
+ * to reduce memory requirement and for faster comparison?
+ */
+typedef struct CheckpointSortItem {
+	int buf_id;
+	Oid relNode;
+	ForkNumber	forkNum; /* hm... enum with only 4 values */
+	BlockNumber blockNum;
+} CheckpointSortItem;
+
+extern CheckpointSortItem *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..c228f39 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-12b.patchtext/x-diff; name=checkpoint-continuous-flush-12b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 96c9a2f..927294b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2497,6 +2497,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>on</> on Linux, <literal>off</> otherwise.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..1b658f2 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,18 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput on some OS.  It should be beneficial for high write
+   loads on HDD.  This feature probably brings no benefit on SSD, as the I/O
+   write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6a6fc3b..2a8f645 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dae5954..74f3914 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -104,6 +104,8 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = DEFAULT_CHECKPOINT_FLUSH_TO_DISK;
 bool		checkpoint_sort = true;
 
 /*
@@ -424,7 +426,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -437,7 +440,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1046,7 +1050,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1709,6 +1713,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1796,10 +1801,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1816,6 +1823,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1871,7 +1884,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1881,7 +1895,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1898,6 +1913,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1908,6 +1930,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2155,7 +2179,7 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state = SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2232,7 +2256,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2273,7 +2298,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2535,9 +2560,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2626,7 +2658,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3048,7 +3082,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3082,7 +3118,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3134,7 +3170,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..e880a9e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/*
+		 * Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* other file: do flush previous file & reset flush accumulator */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cf1e505..9219330 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1025,6 +1026,17 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		/* see bufmgr.h: true on Linux, false otherwise */
+		DEFAULT_CHECKPOINT_FLUSH_TO_DISK,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9806,6 +9818,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* ! (HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE) */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d4dfc25..01b1c96 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = ?		# send buffers to disk on checkpoint
+					# default is on if Linux, off otherwise
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c228f39..4fd3ff5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,14 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK true
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK false
+#endif /* HAVE_SYNC_FILE_RANGE */
+
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c7b2a6d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,24 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/*
+ * FileFlushContext structure:
+ *
+ * This is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offsets)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext {
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +88,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#91

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#87)

Re: checkpointer continuous flushing

On 2015-09-06 16:05:01 +0200, Fabien COELHO wrote:

Wouldn't it be just as easy to put this logic into the checkpointing code?

Not sure it would simplify anything, because the checkpointer currently
knows about buffers but flushing is about files, which are hidden from
view.

It'd not really simplify things, but it'd keep it local.

* Wouldn't a binary heap over the tablespaces + progress be nicer?

I'm not sure where it would fit exactly.

Imagine a binaryheap.h style heap over a structure like (tablespaceid,
progress, progress_inc, nextbuf) where the comparator compares the progress.

Anyway, I think it would complicate the code significantly (compared to the
straightforward array)

I doubt it. I mean instead of your GetNext you'd just do:
next_tblspc = DatumGetPointer(binaryheap_first(heap));
if (next_tblspc == 0)
return 0;
next_tblspc.progress += next_tblspc.progress_slice;
binaryheap_replace_first(PointerGetDatum(next_tblspc));

return next_tblspc.nextbuf++;

progress_slice is the number of buffers in the tablespace divided by the
number of total buffers, to avoid doing any sort of expensive math in
the more frequently executed path.

Moreover such a data structure would probably require some kind of pointer
(probably 8 bytes added per node, maybe more), and the amount of memory is
already a concern, at least to me, and moreover it has to reside in shared
memory which does not simplify allocation of tree data structures.

I'm not seing where you'd need an extra pointer? Maybe the
misunderstanding is that I'm proposing to do a heap over the
*tablespaces* not the actual buffers.

If you make the sorting criterion include the tablespace id you wouldn't
need the lookahead loop in NextBufferToWrite().

Yep, I thought of it. It would mean 4 more bytes per buffer, and bsearch to
find the boundaries, so significantly less simple code.

What for would you need to bsearch?

I think that the current approach is ok as the number of tablespace
should be small.

Right that's often the case.

Isn't the current approach O(NBuffers^2) in the worst case?

ISTM that the overall lookahead complexity is Nbuffers * Ntablespace:
buffers are scanned once for each tablespace.

Which in the worst case is NBuffers * 2...

ISTM that using a tablespace in the sorting would reduce the complexity to
ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
(Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
memory and code complexity.

Afaics finding the boundaries can be done as part of the enumeration of
tablespaces in BufferSync(). That code needs to be moved, but that's not
too bad. I don't see the code be that much more complicated?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Andres Freund (#91)

Re: checkpointer continuous flushing

On Mon, Sep 7, 2015 at 3:09 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-06 16:05:01 +0200, Fabien COELHO wrote:

Wouldn't it be just as easy to put this logic into the checkpointing

code?

Not sure it would simplify anything, because the checkpointer currently
knows about buffers but flushing is about files, which are hidden from
view.

It'd not really simplify things, but it'd keep it local.

How about using the value of guc (checkpoint_flush_to_disk) and
AmCheckpointerProcess to identify whether to do async flush in FileWrite?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#93

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#92)

Re: checkpointer continuous flushing

Hello Amit,

It'd not really simplify things, but it'd keep it local.

How about using the value of guc (checkpoint_flush_to_disk) and
AmCheckpointerProcess to identify whether to do async flush in
FileWrite?

ISTM that what you suggest would just replace the added function arguments
with global variables to communicate and keep the necessary data for
managing the asynchronous flushing, which is called per tablespace
(1) on file changes (2) when the checkpointer is going to sleep.

Although it can be done obviously, I prefer to have functions arguments
rather than global variables, on principle.

Also, because of (2) and of the dependency on the number of tablespaces
being flushed, the flushing stuff cannot be fully hidden from the
checkpointer anyway.

Also I think that probably the bgwriter should do something similar, so
function parameters would be useful to drive flushing from it, rather than
adding yet another set of global variables, or share the same variables
for somehow different purposes.

So having these added parameters look reasonable to me.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#85)

Re: checkpointer continuous flushing

On Sat, Sep 5, 2015 at 12:26 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I would be curious whether flushing helps, though.

Yes, me too. I think we should try to reach on consensus for exact
scenarios and configuration where this patch('es) can give benefit or we
want to verify if there is any regression as I have access to this m/c for
a very-very limited time. This m/c might get formatted soon for some other
purpose.

Yep, it would be great if you have time for a flush test before it
disappears... I think it is advisable to disable the write cache as it may
also hide the impact of flushing.

Still thinking... Depending on the results, it might be interesting to
have these tests run with the write cache enabled as well, to check how
much it interferes positively with performance.

I have done some tests with both the patches(sort+flush) and below
are results:

M/c details

--------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Test - 1 (Data Fits in shared_buffers)
--------------------------------------------------------
non-default settings used in script provided by Fabien upthread

used below options for pgbench and the same is used for rest
of tests as well.

fw) ## full speed parallel write pgbench
run="FW"
opts="-M prepared -P 1 -T $time $para"
;;

warmup=1000
scale=300
max_connections=300
shared_buffers=32GB
checkpoint_timeout=10min
time=7200
synchronous_commit=on
max_wal_size=15GB

para="-j 64 -c 128"
checkpoint_completion_target=0.8

checkpoint_flush_to_disk="on off"
checkpoint_sort="on off"

Flush - off and Sort - off
avg over 7203: 27480.350104 ± 12791.098857 [0.000000, 16009.400000,
32109.200000, 37629.000000, 51671.400000]
percent of values below 10.0: 2.8%

Flush - off and Sort - on
avg over 7200: 27482.501264 ± 12552.036065 [0.000000, 16587.250000,
31225.950000, 37516.450000, 51296.900000]
percent of values below 10.0: 2.8%

Flush - on and Sort - off
avg over 7200: 25214.757292 ± 11059.709509 [5268.000000, 14188.400000,
26472.450000, 35626.100000, 51479.000000]
percent of values below 10.0: 0.0%

Flush - on and Sort - on
avg over 7200: 26819.631722 ± 10589.745016 [5191.700000, 16825.450000,
29429.750000, 35707.950000, 51475.100000]
percent of values below 10.0: 0.0%

For this test run, the best results are when both the sort and flush options
are enabled, the value of lowest TPS is increased substantially without
sacrificing much on average or median TPS values (though there is ~9%
dip in median TPS value). When only sorting is enabled, there is neither
significant gain nor any loss. When only flush is enabled, there is
significant degradation in both average and median value of TPS ~8%
and ~21% respectively.

Test - 2 (Data doesn't fit in shared_buffers, but fits in RAM)
----------------------------------------------------------------------------------------
warmup=1000
scale=3000
max_connections=300
shared_buffers=32GB
checkpoint_timeout=10min
time=7200
synchronous_commit=on
max_wal_size=25GB

para="-j 64 -c 128"
checkpoint_completion_target=0.8

checkpoint_flush_to_disk="on off"
checkpoint_sort="on off"

Flush - off and Sort - off
avg over 7200: 5050.059444 ± 4884.528702 [0.000000, 98.100000, 4699.100000,
10125.950000, 13631.000000]
percent of values below 10.0: 7.7%

Flush - off and Sort - on
avg over 7200: 6194.150264 ± 4913.525651 [0.000000, 98.100000, 8982.000000,
10558.000000, 14035.200000]
percent of values below 10.0: 11.0%

Flush - on and Sort - off
avg over 7200: 2771.327472 ± 1860.963043 [287.900000, 2038.850000,
2375.500000, 2679.000000, 12862.000000]
percent of values below 10.0: 0.0%

Flush - on and Sort - on
avg over 7200: 6110.617722 ± 1939.381029 [1652.200000, 5215.100000,
5724.000000, 6196.550000, 13828.000000]
percent of values below 10.0: 0.0%

For this test run, again the best results are when both the sort and flush
options are enabled, the value of lowest TPS is increased substantially
and the average and median value of TPS has also increased to
~21% and ~22% respectively. When only sorting is enabled, there is a
significant gain in average and median TPS values, but then there is also
an increase in number of times when TPS is below 10 which is bad.
When only flush is enabled, there is significant degradation in both average
and median value of TPS to ~82% and ~97% respectively, now I am not
sure if such a big degradation could be expected for this case or it's just
a problem in this run, I have not repeated this test.

Test - 3 (Data doesn't fit in shared_buffers, but fits in RAM)
----------------------------------------------------------------------------------------
Same configuration and settings as above, but this time, I have enforced
Flush to use posix_fadvise() rather than sync_file_range() (basically
changed
code to comment out sync_file_range() and enable posix_fadvise()).

Flush - on and Sort - on
avg over 7200: 3400.915069 ± 739.626478 [1642.100000, 2965.550000,
3271.900000, 3558.800000, 6763.000000]
percent of values below 10.0: 0.0%

On using posix_fadvise(), the results for best case (both flush and sort as
on) shows significant degradation in average and median TPS values
by ~48% and ~43% which indicates that probably using posix_fadvise()
with the current options might not be the best way to achieve Flush.

Overall, I think this patch (sort+flush) brings a lot of value on table in
terms of stablizing the TPS during checkpoint, however some of the
cases like use of posix_fadvise() and the case (all data fits in
shared_buffers)
where the value of median TPS is regressed could be investigated
to see what can be done to improve them. I think more tests can be done
to ensure the benefit or regression of this patch, but for now this is what
best I can do.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#95

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#94)

Re: checkpointer continuous flushing

Hello Amit,

I have done some tests with both the patches(sort+flush) and below
are results:

Thanks a lot for these runs on this great harware!

Test - 1 (Data Fits in shared_buffers)

Rounded for easier comparison:

flush/sort
off off: 27480.4 ï¿œ 12791.1 [ 0, 16009, 32109, 37629, 51671] (2.8%)
off on : 27482.5 ï¿œ 12552.0 [ 0, 16587, 31226, 37516, 51297] (2.8%)

The two above case are pretty indistinguishable, sorting has no impact.
The 2.8% means more than 1 minute offline per hour (not necessarily a
whole minute, it may be distributed over the whole hour).

on off: 25214.8 ï¿œ 11059.7 [5268, 14188, 26472, 35626, 51479] (0.0%)
on on : 26819.6 ï¿œ 10589.7 [5192, 16825, 29430, 35708, 51475] (0.0%)

For this test run, the best results are when both the sort and flush
options are enabled, the value of lowest TPS is increased substantially
without sacrificing much on average or median TPS values (though there
is ~9% dip in median TPS value). When only sorting is enabled, there is
neither significant gain nor any loss. When only flush is enabled,
there is significant degradation in both average and median value of TPS
~8% and ~21% respectively.

I interpret the five numbers in bracket as an indicator of performance
stability: they should be equal for perfect stability. Once they show some
stability, the next point for me is to focus at the average performance. I
do not see a median decrease as a big issue if the average is reasonably
good.

Thus I essentially note the -2.5% dip on average of on-on vs off-on. I
would say that it is probably significant, although it might be in the
error margin of the measure. Not sure whether the little stddev reduction
is really significant. Anyway the benefit is clear: 100% availability.

Flushing without sorting is a bad idea (tm), not a surprise.

Test - 2 (Data doesn't fit in shared_buffers, but fits in RAM)

flush/sort
off off: 5050.1 ï¿œ 4884.5 [ 0, 98, 4699, 10126, 13631] ( 7.7%)
off on : 6194.2 ï¿œ 4913.5 [ 0, 98, 8982, 10558, 14035] (11.0%)
on off: 2771.3 ï¿œ 1861.0 [ 288, 2039, 2375, 2679, 12862] ( 0.0%)
on on : 6110.6 ï¿œ 1939.3 [1652, 5215, 5724, 6196, 13828] ( 0.0%)

I'm not sure that the off-on vs on-on -1.3% avg tps dip is significant,
but it may be. With both flushing and sorting pg becomes fully available,
and the standard deviation is devided by more than 2, so the benefit is
clear.

For this test run, again the best results are when both the sort and flush
options are enabled, the value of lowest TPS is increased substantially
and the average and median value of TPS has also increased to
~21% and ~22% respectively. When only sorting is enabled, there is a
significant gain in average and median TPS values, but then there is also
an increase in number of times when TPS is below 10 which is bad.
When only flush is enabled, there is significant degradation in both average
and median value of TPS to ~82% and ~97% respectively, now I am not
sure if such a big degradation could be expected for this case or it's just
a problem in this run, I have not repeated this test.

Yes, I agree that it is strange that sorting without flushing on its own
both improves performance (+20% tps) but seems to degrade availability at
the same time. A rerun would have helped to check whether it is a fluke or
it is reproducible.

Test - 3 (Data doesn't fit in shared_buffers, but fits in RAM)
----------------------------------------------------------------------------------------
Same configuration and settings as above, but this time, I have enforced
Flush to use posix_fadvise() rather than sync_file_range() (basically
changed code to comment out sync_file_range() and enable posix_fadvise()).

On using posix_fadvise(), the results for best case (both flush and sort as
on) shows significant degradation in average and median TPS values
by ~48% and ~43% which indicates that probably using posix_fadvise()
with the current options might not be the best way to achieve Flush.

Yes, indeed.

The way posix_fadvise is implemented on Linux is between no effect and bad
effect (the buffer is erased). You hit the later quite strongly... As you
are doing a "not fit in shared buffer" test, it is essential that buffers
are kept in ram, but posix_fadvise on Linux just instructs to erase the
buffer from memory if it was already passed to the I/O subsystem, which
given the probably large I/O device cache on your host should be done
pretty quickly, so that later read must be fetch back from the device
(either cache or disk), which means a drop in performance.

Note that FreeBSD implementation seems more convincing, although less good
than Linux sync_file_range function. I've no idea about other systems.

Overall, I think this patch (sort+flush) brings a lot of value on table
in terms of stablizing the TPS during checkpoint, however some of the
cases like use of posix_fadvise() and the case (all data fits in
shared_buffers) where the value of median TPS is regressed could be
investigated to see what can be done to improve them. I think more
tests can be done to ensure the benefit or regression of this patch, but
for now this is what best I can do.

Thanks a lot, again, for these tests!

I think that we may conclude, on these run:

(1) sorting seems not to harm performance, and may help a lot.

(2) Linux flushing with sync_file_range may degrade a little raw tps
average in some case, but definitely improves performance stability
(always 100% availability when on !).

(3) posix_fadvise on Linux is a bad idea... the good news is that it
is not needed there:-) How good or bad an idea it is on other system
is an open question...

These results are consistent with the current default values in the patch:
sorting is on by default, flushing is on with Linux and off otherwise
(posix_fadvise).

Also, as the effect on other systems is unclear, I think it is best to
keep both settings as GUCs for now.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#95)

Re: checkpointer continuous flushing

On Tue, Sep 8, 2015 at 8:09 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Thanks a lot, again, for these tests!

I think that we may conclude, on these run:

(1) sorting seems not to harm performance, and may help a lot.

I agree with first part, but about helping a lot, I am not sure based on
the tests conducted by me, among all the runs, it has shown improvement
in average TPS is one case and that too with a dip in number of times the
TPS is below 10.

(2) Linux flushing with sync_file_range may degrade a little raw tps
average in some case, but definitely improves performance stability
(always 100% availability when on !).

Agreed, I think the benefit is quite clear, but it would be better if we try
to do some more test for the cases (data fits in shared_buffers) where
we saw small regression just to make sure that regression is small.

(3) posix_fadvise on Linux is a bad idea... the good news is that it
is not needed there:-) How good or bad an idea it is on other system
is an open question...

I don't know what is the best way to verify that, if some body else has
access to such a m/c, please help to get that verified.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#97

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#96)

Re: checkpointer continuous flushing

Hello Amit,

I think that we may conclude, on these run:

(1) sorting seems not to harm performance, and may help a lot.

I agree with first part, but about helping a lot, I am not sure

I'm focussing on the "sort" dimension alone, that is I'm comparing the
average tps performance with sorting with the same test without sorting, :
There are 4 cases from your tests, if I'm not mistaken:

- T1 flush=off 27480 -> 27482 : +0.0%
- T1 flush=on 25214 -> 26819 : +6.3%
- T2 flush=off 5050 -> 6194 : +22.6%
- T2 flush=on 2771 -> 6110 : +120.4%

The average improvement induced by sort=on is +50%, if you do not agree on
"a lot", maybe we can agree on "significantly":-)

based on the tests conducted by me, among all the runs, it has shown
improvement in average TPS is one case and that too with a dip in number
of times the TPS is below 10.

(2) Linux flushing with sync_file_range may degrade a little raw tps
average in some case, but definitely improves performance stability
(always 100% availability when on !).

Agreed, I think the benefit is quite clear, but it would be better if we try
to do some more test for the cases (data fits in shared_buffers) where
we saw small regression just to make sure that regression is small.

I've already reported a lot of tests (several hundred of hours on two
different hosts), and I did not have such a dip, but the hardware was more
"usual" or "casual", really different from your runs.

If you can run more tests, great!

I think that the main safeguard to handle the (small) uncertainty is to
keep gucs to control these features.

(3) posix_fadvise on Linux is a bad idea... the good news is that it
is not needed there:-) How good or bad an idea it is on other system
is an open question...

I don't know what is the best way to verify that, if some body else has
access to such a m/c, please help to get that verified.

Yep. There has been such calls on this thread which were not very
effective, up to now.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Amit Kapila (#96)

Re: checkpointer continuous flushing

On Tue, Sep 8, 2015 at 11:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

(3) posix_fadvise on Linux is a bad idea... the good news is that it
is not needed there:-) How good or bad an idea it is on other system
is an open question...

I don't know what is the best way to verify that, if some body else has
access to such a m/c, please help to get that verified.

Why wouldn't we just leave it out then? Putting it in when the one
platform we've tried it on shows a regression makes no sense. We
shouldn't include it and then remove it if someone can prove it's bad;
we should only include it in the first place if we have good
benchmarks showing that it is good.

Does anyone have a big Windows box they can try this on?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Robert Haas (#98)

Re: checkpointer continuous flushing

(3) posix_fadvise on Linux is a bad idea... the good news is that it
is not needed there:-) How good or bad an idea it is on other system
is an open question...

I don't know what is the best way to verify that, if some body else has
access to such a m/c, please help to get that verified.

Why wouldn't we just leave it out then? Putting it in when the one
platform we've tried it on shows a regression makes no sense. We
shouldn't include it and then remove it if someone can prove it's bad;
we should only include it in the first place if we have good benchmarks
showing that it is good.

Does anyone have a big Windows box they can try this on?

Just a box with a disk would be enough, it does not need to be big!

As I wrote before, FreeBSD would be a good candidate because the
posix_fadvise seems much more reasonable than on Linux, and should be
profitable, so it would be a pity to remove it.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#99)

Re: checkpointer continuous flushing

On 2015-09-09 20:56:15 +0200, Fabien COELHO wrote:

As I wrote before, FreeBSD would be a good candidate because the
posix_fadvise seems much more reasonable than on Linux, and should be
profitable, so it would be a pity to remove it.

Why do you think it's different on fbsd? Also, why is it unreasonable
that DONNEED removes stuff from the cache?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#91)

Re: checkpointer continuous flushing

Hello Andres,

Wouldn't it be just as easy to put this logic into the checkpointing
code?

Not sure it would simplify anything, because the checkpointer currently
knows about buffers but flushing is about files, which are hidden from
view.

It'd not really simplify things, but it'd keep it local.

Ok, it would be local, but it would also mean that the checkpointer would
have to deal explicitely with files, whereas it currently does not have
to.

I think that the current buffer/file boundary is on engineering principle
a good one, so I tried to break it as little as possible to enable the
feature, and I wanted to avoid to have to do a buffer to file translation
twice, once in the checkpointer and once when writing the buffer.

* Wouldn't a binary heap over the tablespaces + progress be nicer?

I'm not sure where it would fit exactly.

Imagine a binaryheap.h style heap over a structure like (tablespaceid,
progress, progress_inc, nextbuf) where the comparator compares the progress.

It would replace what is currently an array. The balancing code needs to
enumerate all tablespaces in a round-robin way so as to ensure that all
tablespaces are given some attention, otherwise you could have a balance
on two tablespaces but others could be left out. The array makes this
property straightforward.

Anyway, I think it would complicate the code significantly (compared to the
straightforward array)

I doubt it. I mean instead of your GetNext you'd just do:
next_tblspc = DatumGetPointer(binaryheap_first(heap));
if (next_tblspc == 0)
return 0;
next_tblspc.progress += next_tblspc.progress_slice;
binaryheap_replace_first(PointerGetDatum(next_tblspc));

return next_tblspc.nextbuf++;

Compare to the array, this tree approach would required ln(Ntablespace)
work to extract and reinsert the tablespace under progress, so there is no
complexity advantage.

Moreover, given that in most cases there are 1 or 2 tablespaces, a tree
structure is really on the heavy side.

progress_slice is the number of buffers in the tablespace divided by the
number of total buffers, to avoid doing any sort of expensive math in
the more frequently executed path.

If there are many buffers, I'm not too sure about rounding issues and the
like, so the current approach with a rational seems more secure.

[...] I'm not seing where you'd need an extra pointer?

Indeed, I misunderstood.

[...] What for would you need to bsearch?

To find the tablespace boundaries in the sorted buffer array in
log(NBuffers) * Ntablespace, instead of NBuffers.

I think that the current approach is ok as the number of tablespace
should be small.

Right that's often the case.

Yep.

ISTM that using a tablespace in the sorting would reduce the complexity to
ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
(Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
memory and code complexity.

Afaics finding the boundaries can be done as part of the enumeration of
tablespaces in BufferSync(). That code needs to be moved, but that's not
too bad. I don't see the code be that much more complicated?

Hmmm. you are proposing to replace prooved and heavilly tested code by a
more complex tree data structures distributed quite differently around the
source, and no very clear benefit.

So I would prefer to keep the code as is, that is pretty straightforward,
and wait for a strong incentive before doing anything fancier. ISTM that
there are other places in pg need attention more that further tweaking
this patch.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Fabien COELHO (#101)

Re: checkpointer continuous flushing

On 2015-09-09 21:29:12 +0200, Fabien COELHO wrote:

Imagine a binaryheap.h style heap over a structure like (tablespaceid,
progress, progress_inc, nextbuf) where the comparator compares the progress.

It would replace what is currently an array.

It'd still be one afterwards.

The balancing code needs to enumerate all tablespaces in a round-robin
way so as to ensure that all tablespaces are given some attention,
otherwise you could have a balance on two tablespaces but others could
be left out. The array makes this property straightforward.

Why would a heap as I've described it require that?

Anyway, I think it would complicate the code significantly (compared to the
straightforward array)

I doubt it. I mean instead of your GetNext you'd just do:
next_tblspc = DatumGetPointer(binaryheap_first(heap));
if (next_tblspc == 0)
return 0;
next_tblspc.progress += next_tblspc.progress_slice;
binaryheap_replace_first(PointerGetDatum(next_tblspc));

return next_tblspc.nextbuf++;

Compare to the array, this tree approach would required ln(Ntablespace) work
to extract and reinsert the tablespace under progress, so there is no
complexity advantage.

extract/reinsert is actually O(1).

progress_slice is the number of buffers in the tablespace divided by the
number of total buffers, to avoid doing any sort of expensive math in
the more frequently executed path.

If there are many buffers, I'm not too sure about rounding issues and the
like, so the current approach with a rational seems more secure.

Meh. The amount of imbalance introduced by rounding won't matter.

ISTM that using a tablespace in the sorting would reduce the complexity to
ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
(Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
memory and code complexity.

Afaics finding the boundaries can be done as part of the enumeration of
tablespaces in BufferSync(). That code needs to be moved, but that's not
too bad. I don't see the code be that much more complicated?

Hmmm. you are proposing to replace prooved and heavilly tested code by a
more complex tree data structures distributed quite differently around the
source, and no very clear benefit.

There's no "proved and heavily tested code" touched here.

So I would prefer to keep the code as is, that is pretty straightforward,
and wait for a strong incentive before doing anything fancier.

I find the proposed code not particularly pretty, so I don't really buy
the straightforwardness argument.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#100)

Re: checkpointer continuous flushing

As I wrote before, FreeBSD would be a good candidate because the
posix_fadvise seems much more reasonable than on Linux, and should be
profitable, so it would be a pity to remove it.

Why do you think it's different on fbsd? Also, why is it unreasonable
that DONNEED removes stuff from the cache?

Yep, I agree that this part is a bad point, obviously, but at least there
is also some advantage: I understood that buffers are actually pushed
towards the disk when calling posix_fadvise with DONTNEED on FreeBSD, so
in-buffer tests shoud see better performance, and out-of-buffer in-memory
tests would probably be degraded as Amit test shown on Linux. As an admin
I can choose if I know whether I run in buffer or not.

On Linux either the call is ignored (if the page is not written yet) or
the page is coldly removed, so it has either no effect or a bad effect,
basically.

So I think that the current off default when running with posix_fadvise is
reasonable, and in some case turning it on can probably provide a better
performance stability, esp for in-buffer runs.

Now, franckly I do not care much about FreeBSD or Windows, so I'm fine
with dropping posix_fadvise if this is a blocker.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#102)

Re: checkpointer continuous flushing

It would replace what is currently an array.

It'd still be one afterwards.
[...]
extract/reinsert is actually O(1).

Hm, strange. I probably did not understood at all the heap structure
you're suggesting. No big deal.

[...] Why would a heap as I've described it require that?

Hmmm... The heap does *not* require anything, the *balancing* requires
this property.

[...] There's no "proved and heavily tested code" touched here.

I've prooved and tested heavily the submitted patch based on an array,
that you want to replace with some heap, so I think that my point stands.

Moreover, I do not see a clear benefit in changing the data structure.

So I would prefer to keep the code as is, that is pretty straightforward,
and wait for a strong incentive before doing anything fancier.

I find the proposed code not particularly pretty, so I don't really buy
the straightforwardness argument.

No big deal. From my point of view, the data structure change you're
suggesting does not bring significant value, so there is no good reason to
do it.

If you want to submit another patch, this is free software, please
proceed.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Fabien COELHO (#97)

Re: checkpointer continuous flushing

On Wed, Sep 9, 2015 at 2:31 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

I think that we may conclude, on these run:

(1) sorting seems not to harm performance, and may help a lot.

I agree with first part, but about helping a lot, I am not sure

I'm focussing on the "sort" dimension alone, that is I'm comparing the

average tps performance with sorting with the same test without sorting, :
There are 4 cases from your tests, if I'm not mistaken:

- T1 flush=off 27480 -> 27482 : +0.0%
- T1 flush=on 25214 -> 26819 : +6.3%
- T2 flush=off 5050 -> 6194 : +22.6%
- T2 flush=on 2771 -> 6110 : +120.4%

There is a clear win only in cases when sort is used with flush, apart
from that using sort alone doesn't have any clear advantage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#106

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Andres Freund (#100)

Re: checkpointer continuous flushing

On Wed, Sep 9, 2015 at 12:12 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-09 20:56:15 +0200, Fabien COELHO wrote:

As I wrote before, FreeBSD would be a good candidate because the
posix_fadvise seems much more reasonable than on Linux, and should be
profitable, so it would be a pity to remove it.

Why do you think it's different on fbsd? Also, why is it unreasonable
that DONNEED removes stuff from the cache?

It seems kind of silly that it means "No one, even people I am not aware of
and have no right to speak for, needs this" as opposed to "I don't need
this, don't keep it around on my behalf."

Cheers,

Jeff

#107

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Amit Kapila (#105)

Re: checkpointer continuous flushing

Hello Amit,

- T1 flush=off 27480 -> 27482 : +0.0%
- T1 flush=on 25214 -> 26819 : +6.3%
- T2 flush=off 5050 -> 6194 : +22.6%
- T2 flush=on 2771 -> 6110 : +120.4%

There is a clear win only in cases when sort is used with flush, apart
from that using sort alone doesn't have any clear advantage.

Indeed, I agree that the improvement is much smaller without flushing,
although it is there somehow (+0.0 & +22.6 => +11.3% on average).

Well, at least we may agree that it is "somehow significantly better" ?:-)

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Fabien COELHO (#90)

2 attachment(s)

Re: checkpointer continuous flushing

Thanks for the hints! Two-part v12 attached fixes these.

Here is a v13, which is just a rebase after 1aba62ec.

--
Fabien.

Attachments:

checkpoint-continuous-flush-13a.patchtext/x-diff; name=checkpoint-continuous-flush-13a.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9e7bcf5..2ef21fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2457,6 +2457,28 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-sort" xreflabel="checkpoint_sort">
+      <term><varname>checkpoint_sort</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_sort</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to sort buffers before writting them out to disk on checkpoint.
+        For a HDD storage, this setting allows to group together
+        neighboring pages written to disk, thus improving performance by
+        reducing random write activity.
+        This sorting should have limited performance effects on SSD backends
+        as such storages have good random write performance, but it may
+        help with wear-leveling so be worth keeping anyway.
+        The default is <literal>on</>.
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   When hard-disk drives (HDD) are used for terminal data storage
+   <xref linkend="guc-checkpoint-sort"> allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting <varname>checkpoint_sort</> to <value>off</>.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 152d4ed..7291447 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7999,11 +7999,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8034,6 +8036,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8052,8 +8058,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8061,6 +8067,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CheckpointSortItem), &foundCpid);
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8c0358e..09af13b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -75,12 +75,37 @@ typedef struct PrivateRefCountEntry
 /* 64 bytes, about the size of a cache line on common systems */
 #define REFCOUNT_ARRAY_ENTRIES 8
 
+/*
+ * Status of buffers to checkpoint for a particular tablespace,
+ * used internally in BufferSync.
+ * - space: oid of the tablespace
+ * - num_to_write: number of checkpoint pages counted for this tablespace
+ * - num_written: number of pages actually written out
+ * - index: scanning position in CheckpointBufferIds for this tablespace
+ */
+typedef struct TableSpaceCheckpointStatus {
+	Oid space;
+	int num_to_write;
+	int num_written;
+	int index;
+} TableSpaceCheckpointStatus;
+
+/*
+ * Entry structure for table space to count hashtable,
+ * used internally in BufferSync.
+ */
+typedef struct TableSpaceCountEntry {
+	Oid space;
+	int count;
+} TableSpaceCountEntry;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
 int			effective_io_concurrency = 0;
+bool		checkpoint_sort = true;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -98,6 +123,9 @@ static bool IsForInput;
 /* local state for LockBufferForCleanup */
 static volatile BufferDesc *PinCountWaitBuf = NULL;
 
+/* array of buffer ids & sort criterion of all buffers to checkpoint */
+CheckpointSortItem *CheckpointBufferIds = NULL;
+
 /*
  * Backend-Private refcount management:
  *
@@ -1622,6 +1650,106 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 	}
 }
 
+/* checkpoint buffers comparison */
+static int
+bufcmp(const void * pa, const void * pb)
+{
+	CheckpointSortItem
+		*a = (CheckpointSortItem *) pa,
+		*b = (CheckpointSortItem *) pb;
+
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* same relation, compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* same relation/fork, so same segmented "file", compare block number
+	 * which are mapped on different segments depending on the number.
+	 */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else /* should not be the same block anyway... */
+		return 1;
+}
+
+/*
+ * Return the next buffer to write, or -1.
+ * this function balances buffers over tablespaces, see comment inside.
+ */
+static int
+NextBufferToWrite(TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+				  int *pspace, int num_to_write, int num_written)
+{
+	int	space = *pspace, buf_id = -1, index;
+
+	/*
+	 * Select a tablespace depending on the current overall progress.
+	 *
+	 * The progress ratio of each unfinished tablespace is compared to
+	 * the overall progress ratio to find one with is not in advance
+	 * (i.e. overall ratio > tablespace ratio,
+	 *  i.e. tablespace written/to_write > overall written/to_write
+	 *
+	 * Existence: it is bound to exist otherwise the overall progress
+	 * ratio would be inconsistent: with positive buffers to write (t1 & t2)
+	 * and already written buffers (w1 & w2), we have:
+	 *
+	 * If w1/t1 > (w1+w2)/(t1+t2)          # one table space is in advance
+	 *   => w1t1+w1t2 > w1t1+w2t1 => w1t2 > w2t1 => w1t2+w2t2 > w2t1+w2t2
+	 *   => (w1+w2) / (t1+t2) > w2 / t2    # the other one is late
+	 *
+	 * The round robin ensures that each space is given some attention
+	 * till it is over the current ratio, before going to the next.
+	 *
+	 * Precision: using int32 computations for comparing fractions
+	 * (w1 / t1 > w / t <=> w1 t > w t1) seems a bad idea as the values
+	 * can overflow 32-bit integers: the limit would be sqrt(2**31) ~
+	 * 46340 buffers, i.e. a 362 MB checkpoint. So ensure that 64-bit
+	 * integers are used in the comparison.
+	 */
+	while ((int64) spcStatus[space].num_written * num_to_write >
+		   (int64) num_written * spcStatus[space].num_to_write)
+		space = (space + 1) % nb_spaces;	/* round robin */
+
+	/*
+	 * Find a valid buffer in the selected tablespace,
+	 * by continuing the tablespace specific buffer scan
+	 * where it was left.
+	 */
+	index = spcStatus[space].index;
+
+	while (index < num_to_write && buf_id == -1)
+	{
+		volatile BufferDesc *bufHdr;
+
+		buf_id = CheckpointBufferIds[index].buf_id;
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		/*
+		 * Skip if in another tablespace or not in checkpoint anymore.
+		 * No lock is acquired, see comments below.
+		 */
+		if (spcStatus[space].space != bufHdr->tag.rnode.spcNode ||
+			! (bufHdr->flags & BM_CHECKPOINT_NEEDED))
+		{
+			index ++;
+			buf_id = -1;
+		}
+	}
+
+	/* update tablespace writing status, will start over at next index */
+	spcStatus[space].index = index + 1;
+
+	*pspace = space;
+
+	return buf_id;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -1635,11 +1763,13 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 static void
 BufferSync(int flags)
 {
-	int			buf_id;
-	int			num_to_scan;
+	int			buf_id = -1;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	HTAB		*spcBuffers;
+	TableSpaceCheckpointStatus *spcStatus = NULL;
+	int         nb_spaces, space;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1670,6 +1800,18 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_write = 0;
+
+	/* initialize oid -> int buffer count hash table */
+	{
+		HASHCTL		ctl;
+
+		MemSet(&ctl, 0, sizeof(HASHCTL));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(TableSpaceCountEntry);
+		spcBuffers = hash_create("Number of buffers to write per tablespace",
+								 16, &ctl, HASH_ELEM | HASH_BLOBS);
+	}
+
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1682,32 +1824,99 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			Oid spc;
+			TableSpaceCountEntry * entry;
+			bool found;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+			CheckpointBufferIds[num_to_write].buf_id = buf_id;
+			CheckpointBufferIds[num_to_write].relNode = bufHdr->tag.rnode.relNode;
+			CheckpointBufferIds[num_to_write].forkNum = bufHdr->tag.forkNum;
+			CheckpointBufferIds[num_to_write].blockNum = bufHdr->tag.blockNum;
 			num_to_write++;
+
+			/* keep track of per tablespace buffers */
+			spc = bufHdr->tag.rnode.spcNode;
+			entry = (TableSpaceCountEntry *)
+				hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+			if (found)
+				entry->count++;
+			else
+				entry->count = 1;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
 	if (num_to_write == 0)
+	{
+		hash_destroy(spcBuffers);
 		return;					/* nothing to do */
+	}
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
+	/* Build checkpoint tablespace buffer status */
+	nb_spaces = hash_get_num_entries(spcBuffers);
+	spcStatus = (TableSpaceCheckpointStatus *)
+		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+
+	{
+		int index = 0;
+		HASH_SEQ_STATUS hseq;
+		TableSpaceCountEntry * entry;
+
+		hash_seq_init(&hseq, spcBuffers);
+		while ((entry = (TableSpaceCountEntry *) hash_seq_search(&hseq)))
+		{
+			Assert(index < nb_spaces);
+			spcStatus[index].space = entry->space;
+			spcStatus[index].num_to_write = entry->count;
+			spcStatus[index].num_written = 0;
+			/* should it be randomized? chosen with some criterion? */
+			spcStatus[index].index = 0;
+
+			index ++;
+		}
+	}
+
+	hash_destroy(spcBuffers);
+	spcBuffers = NULL;
+
+	/* sort buffer ids to help find sequential writes */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+
+	if (checkpoint_sort)
+	{
+		qsort(CheckpointBufferIds, num_to_write, sizeof(CheckpointSortItem),
+			  bufcmp);
+	}
+
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
+	 * Loop over buffers to write through CheckpointBufferIds,
+	 * and write the ones (still) marked with BM_CHECKPOINT_NEEDED,
+	 * with some round robin over table spaces so as to balance writes,
+	 * so that buffer writes move forward roughly proportionally for each
+	 * tablespace.
 	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Termination: if a tablespace is selected by the inner while loop
+	 * (see argument there), its index is incremented and will eventually
+	 * reach num_to_write, mark this table space scanning as done and
+	 * decrement the number of (active) spaces, which will thus reach 0.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	space = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+
+	while (nb_spaces != 0)
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		buf_id = NextBufferToWrite(spcStatus, nb_spaces, &space,
+								   num_to_write, num_written);
+		if (buf_id != -1)
+			bufHdr = GetBufferDescriptor(buf_id);
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1721,39 +1930,46 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
+		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
+				spcStatus[space].num_written++;
 				num_written++;
 
 				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
-
-				/*
 				 * Sleep to throttle our I/O rate.
 				 */
 				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Detect checkpoint end for a tablespace: either the scan is done
+		 * or all tablespace buffers have been written out. If so, the
+		 * another active tablespace status is moved in place of the current
+		 * one and the next round will start on this one, or maybe round about.
+		 *
+		 * Note: maybe an exchange could be made instead in order to keep
+		 * informations about the closed table space, but this is currently
+		 * not used afterwards.
+		 */
+		if (spcStatus[space].index >= num_to_write ||
+			spcStatus[space].num_written >= spcStatus[space].num_to_write)
+		{
+			nb_spaces--;
+			if (space != nb_spaces)
+				spcStatus[space] = spcStatus[nb_spaces];
+			else
+				space = 0;
+		}
 	}
 
+	pfree(spcStatus);
+	spcStatus = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8ebf424..1cd2aa0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1012,6 +1012,17 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_sort", PGC_SIGHUP, WAL_CHECKPOINTS,
+		 gettext_noop("Whether disk-page buffers are sorted on checkpoints."),
+		 NULL
+		},
+		&checkpoint_sort,
+		true,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8c65287..8020c1c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -201,6 +201,7 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_sort = on			# sort buffers on checkpoint
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 790ca66..11815a8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..32f2006 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,23 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * This structure is allocated per buffer in shared memory, so it should be
+ * kept as little as possible.  Maybe the sort criterion could be compacted
+ * to reduce memory requirement and for faster comparison?
+ */
+typedef struct CheckpointSortItem {
+	int buf_id;
+	Oid relNode;
+	ForkNumber	forkNum; /* hm... enum with only 4 values */
+	BlockNumber blockNum;
+} CheckpointSortItem;
+
+extern CheckpointSortItem *CheckpointBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 0f59201..b56802b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+extern bool checkpoint_sort;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;

checkpoint-continuous-flush-13b.patchtext/x-diff; name=checkpoint-continuous-flush-13b.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2ef21fb..356aed4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2500,6 +2500,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>on</> on Linux, <literal>off</> otherwise.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-min-wal-size" xreflabel="min_wal_size">
       <term><varname>min_wal_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f538698..1b658f2 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -558,6 +558,18 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput on some OS.  It should be beneficial for high write
+   loads on HDD.  This feature probably brings no benefit on SSD, as the I/O
+   write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6a6fc3b..2a8f645 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, false, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..4b5e9cd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..ea7a45d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, false, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..b700efb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, false, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3b3a09e..e361907 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -665,7 +665,8 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(int flags, double progress,
+					 FileFlushContext * context, int ctx_size)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -700,6 +701,26 @@ CheckpointWriteDelay(int flags, double progress)
 		 */
 		pgstat_send_bgwriter();
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Before sleeping, flush written blocks for each tablespace.
+		 */
+		if (checkpoint_flush_to_disk)
+		{
+			int i;
+
+			for (i = 0; i < ctx_size; i++)
+			{
+				if (context[i].ncalls != 0)
+				{
+					PerformFileFlush(&context[i]);
+					ResetFileFlushContext(&context[i]);
+				}
+			}
+		}
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 09af13b..deacec1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -105,6 +105,8 @@ int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
 int			effective_io_concurrency = 0;
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = DEFAULT_CHECKPOINT_FLUSH_TO_DISK;
 bool		checkpoint_sort = true;
 
 /*
@@ -427,7 +429,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
+						  bool flush_to_disk, FileFlushContext *context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -440,7 +443,8 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+						bool flush_to_disk, FileFlushContext *context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -1107,7 +1111,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, false, NULL);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1770,6 +1774,7 @@ BufferSync(int flags)
 	HTAB		*spcBuffers;
 	TableSpaceCheckpointStatus *spcStatus = NULL;
 	int         nb_spaces, space;
+	FileFlushContext * spcContext = NULL;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1857,10 +1862,12 @@ BufferSync(int flags)
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
-	/* Build checkpoint tablespace buffer status */
+	/* Build checkpoint tablespace buffer status & flush context arrays */
 	nb_spaces = hash_get_num_entries(spcBuffers);
 	spcStatus = (TableSpaceCheckpointStatus *)
 		palloc(sizeof(TableSpaceCheckpointStatus) * nb_spaces);
+	spcContext = (FileFlushContext *)
+		palloc(sizeof(FileFlushContext) * nb_spaces);
 
 	{
 		int index = 0;
@@ -1877,6 +1884,12 @@ BufferSync(int flags)
 			/* should it be randomized? chosen with some criterion? */
 			spcStatus[index].index = 0;
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			ResetFileFlushContext(&spcContext[index]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			index ++;
 		}
 	}
@@ -1932,7 +1945,8 @@ BufferSync(int flags)
 		 */
 		if (bufHdr != NULL && bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, checkpoint_flush_to_disk,
+							  &spcContext[space]) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1942,7 +1956,8 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write,
+									 spcContext, nb_spaces);
 			}
 		}
 
@@ -1959,6 +1974,13 @@ BufferSync(int flags)
 		if (spcStatus[space].index >= num_to_write ||
 			spcStatus[space].num_written >= spcStatus[space].num_to_write)
 		{
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+			PerformFileFlush(&spcContext[space]);
+			ResetFileFlushContext(&spcContext[space]);
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 			nb_spaces--;
 			if (space != nb_spaces)
 				spcStatus[space] = spcStatus[nb_spaces];
@@ -1969,6 +1991,8 @@ BufferSync(int flags)
 
 	pfree(spcStatus);
 	spcStatus = NULL;
+	pfree(spcContext);
+	spcContext = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -2216,7 +2240,7 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int	buffer_state = SyncOneBuffer(next_to_clean, true, false, NULL);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2293,7 +2317,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2334,7 +2359,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk, context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2596,9 +2621,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk,
+	FileFlushContext * context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2687,7 +2719,9 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_to_disk,
+			  context);
 
 	if (track_io_timing)
 	{
@@ -3109,7 +3143,9 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -3143,7 +3179,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3195,7 +3231,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, false, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..114a0a6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..fb3b383 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,7 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite, false, NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..e880a9e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1344,8 +1344,97 @@ retry:
 	return returnCode;
 }
 
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+void
+ResetFileFlushContext(FileFlushContext * context)
+{
+	context->fd = 0;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+	context->filename = NULL;
+}
+
+void
+PerformFileFlush(FileFlushContext * context)
+{
+	if (context->ncalls != 0)
+	{
+		int rc;
+
+#if defined(HAVE_SYNC_FILE_RANGE)
+
+		/*
+		 * Linux: tell the memory manager to move these blocks to io so
+		 * that they are considered for being actually written to disk.
+		 */
+		rc = sync_file_range(context->fd, context->offset, context->nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+#elif defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Others: say that data should not be kept in memory...
+		 * This is not exactly what we want to say, because we want to write
+		 * the data for durability but we may need it later nevertheless.
+		 * It seems that Linux would free the memory *if* the data has
+		 * already been written do disk, else the "dontneed" call is ignored.
+		 * For FreeBSD this may have the desired effect of moving the
+		 * data to the io layer, although the system does not seem to
+		 * take into account the provided offset & size, so it is rather
+		 * rough...
+		 */
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+#endif
+
+		if (rc < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not flush block " INT64_FORMAT
+							" on " INT64_FORMAT " blocks in file \"%s\": %m",
+							context->offset / BLCKSZ,
+							context->nbytes / BLCKSZ,
+							context->filename)));
+	}
+}
+
+void
+FileAsynchronousFlush(FileFlushContext * context,
+					  int fd, off_t offset, off_t nbytes, char * filename)
+{
+	if (context->ncalls != 0 && context->fd == fd)
+	{
+		/* same file: merge current flush with previous ones */
+		off_t new_offset = offset < context->offset? offset: context->offset;
+
+		context->nbytes =
+			(context->offset + context->nbytes > offset + nbytes ?
+			 context->offset + context->nbytes : offset + nbytes) -
+			new_offset;
+		context->offset = new_offset;
+		context->ncalls ++;
+	}
+	else
+	{
+		/* other file: do flush previous file & reset flush accumulator */
+		PerformFileFlush(context);
+
+		context->fd = fd;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+		context->filename = filename;
+	}
+}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	int			returnCode;
 
@@ -1395,6 +1484,28 @@ retry:
 
 	if (returnCode >= 0)
 	{
+
+#if defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE)
+
+		/*
+		 * Calling "write" tells the OS that pg wants to write some page to disk,
+		 * however when it is really done is chosen by the OS.
+		 * Depending on other disk activities this may be delayed significantly,
+		 * maybe up to an "fsync" call, which could induce an IO write surge.
+		 * When checkpointing pg is doing its own throttling and the result
+		 * should really be written to disk with high priority, so as to meet
+		 * the completion target.
+		 * This call hints that such write have a higher priority.
+		 */
+		if (flush_to_disk && returnCode == amount && errno == 0)
+		{
+			FileAsynchronousFlush(context,
+								  VfdCache[file].fd, VfdCache[file].seekPos,
+								  amount, VfdCache[file].fileName);
+		}
+
+#endif /* HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE */
+
 		VfdCache[file].seekPos += returnCode;
 
 		/* maintain fileSize and temporary_files_size if it's a temp file */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..dbf057f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, false, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, bool flush_to_disk,
+		FileFlushContext * context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +768,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_to_disk, context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..2db3cd3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						   BlockNumber blocknum, char *buffer, bool skipFsync,
+							   bool flush_to_disk, FileFlushContext *context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, bool flush_to_disk,
+		  FileFlushContext * context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+									buffer, skipFsync, flush_to_disk, context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1cd2aa0..95deb71 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -158,6 +158,7 @@ static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
+static bool check_flush_to_disk(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
 static bool check_timezone_abbreviations(char **newval, void **extra, GucSource source);
 static void assign_timezone_abbreviations(const char *newval, void *extra);
@@ -1024,6 +1025,17 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		/* see bufmgr.h: true on Linux, false otherwise */
+		DEFAULT_CHECKPOINT_FLUSH_TO_DISK,
+		check_flush_to_disk, NULL, NULL
+	},
+
+	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
 			NULL
@@ -9805,6 +9817,21 @@ check_log_stats(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_flush_to_disk(bool *newval, void **extra, GucSource source)
+{
+/* This test must be consistent with the one in FileWrite (storage/file/fd.c)
+ */
+#if ! (defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE))
+	/* just warn if it has no effect */
+	ereport(WARNING,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("Setting \"checkpoint_flush_to_disk\" has no effect "
+					"on this platform.")));
+#endif /* ! (HAVE_SYNC_FILE_RANGE || HAVE_POSIX_FADVISE) */
+	return true;
+}
+
+static bool
 check_canonical_path(char **newval, void **extra, GucSource source)
 {
 	/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8020c1c..e4cf2a1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_sort = on			# sort buffers on checkpoint
+#checkpoint_flush_to_disk = ?		# send buffers to disk on checkpoint
+					# default is on if Linux, off otherwise
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index a49c208..f9c8ca1 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -16,6 +16,7 @@
 #define _BGWRITER_H
 
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -29,7 +30,8 @@ extern void BackgroundWriterMain(void) pg_attribute_noreturn();
 extern void CheckpointerMain(void) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointWriteDelay(int flags, double progress,
+								 FileFlushContext * context, int ctx_size);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b56802b..cd9d130 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -54,6 +54,14 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
+
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK true
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK false
+#endif /* HAVE_SYNC_FILE_RANGE */
+
+extern bool checkpoint_flush_to_disk;
 extern bool checkpoint_sort;
 
 /* in buf_init.c */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..c7b2a6d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,24 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/*
+ * FileFlushContext structure:
+ *
+ * This is used to accumulate several flush requests on a file
+ * into a larger flush request.
+ * - fd: file descriptor of the file
+ * - ncalls: number of flushes merged together
+ * - offset: starting offset (minimum of all offsets)
+ * - nbytes: size (minimum extent to cover all flushed data)
+ * - filename: filename of fd for error messages
+ */
+typedef struct FileFlushContext {
+	int fd;
+	int ncalls;
+	off_t offset;
+	off_t nbytes;
+	char * filename;
+} FileFlushContext;
 
 /*
  * prototypes for functions in fd.c
@@ -70,7 +88,12 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern void ResetFileFlushContext(FileFlushContext * context);
+extern void PerformFileFlush(FileFlushContext * context);
+extern void FileAsynchronousFlush(FileFlushContext * context,
+				int fd, off_t offset, off_t nbytes, char * filename);
+extern int	FileWrite(File file, char *buffer, int amount, bool flush_to_disk,
+	FileFlushContext * context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..a46a70c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+					  BlockNumber blocknum, char *buffer, bool skipFsync,
+					  bool flush_to_disk, FileFlushContext * context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -120,8 +122,9 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					char *buffer, bool skipFsync, bool flush_to_disk,
+					FileFlushContext * context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);

#109

Fabien COELHO

coelho@cri.ensmp.fr

over 10 years ago

In reply to: Andres Freund (#86)

Re: checkpointer continuous flushing

Hello,

[...] If you make the sorting criterion include the tablespace id you
wouldn't need the lookahead loop in NextBufferToWrite().

I'm considering this precise point, i.e. including the tablespace as
a sorting criterion.

Currently the array used for sorting is 16 bytes per buffer (although I
wrote 12 in another mail, I was wrong...). The data include the bufid (4
bytes) the relation & fork num (8 bytes, but really 4 bytes + 2 bits are
used), and the block number (4 bytes) which is the offset within the
relation. These 3 combined data allow to find the file and the offset
within that file, for the given buffer id.

I'm concerned that these 16 bytes are already significant and I do not
want to extend them any more. I was already pretty happy with the previous
version with 4 bytes per buffer.

Now as the number of tablespace is expected to be very small (1, 2, maybe
3), there is no problem to pack it within the unused 30 bits in forknum.
That would mean some masking and casts here and there, so it would not be
very beautiful, but it would make it easy to find the buffers for a given
tablespace, and indeed remove the lookahead stuff in the next buffer
function, as you suggest.

My question is: would that be acceptable, or would someone object to the
use of masks and things like that? The benefit would be a simpler/more
direct next buffer function, but some more tinkering around the sorting
criterion to use a packed representation.

Note that I do not think that it would have any actual impact on
performance... it would only make a difference if there were really many
tablespaces (the scanning complexity would be Nbuffer instead of
Nbuffer*Ntablespace, but as Ntablespace is small...). My motivation is
rather to help the patch get through, so I'm fine if this is not needed.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#108)

Re: checkpointer continuous flushing

Hi,

On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:

Thanks for the hints! Two-part v12 attached fixes these.

Here is a v13, which is just a rebase after 1aba62ec.

I'm working on this patch, to get it into a state I think it'd be
commitable.

In my performance testing it showed that calling PerformFileFlush() only
at segment boundaries and in CheckpointWriteDelay() can lead to rather
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
problematic because it only is triggered while on schedule, and not when
behind. My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

As far as I can see the while (nb_spaces != 0)/NextBufferToWrite() logic
doesn't work correctly if tablespaces aren't actually sorted. I'm
actually inclined to fix this by simply removing the flag to
enable/disable sorting.

Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
so many places looks ugly, I want to push that to the underlying
functions. If we add a different flushing approach we shouldn't have to
touch several places that don't actually really care.

I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
seems to work well, with a bit less code actually.

I'll post this after some more cleanup & testing.

I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
particularly good. In high throughput workloads the 100ms sleep is too
long, leading to bursty IO behaviour. If 1k+ buffers a written out a
second 100ms is a rather long sleep. For another that we only sleep
100ms when the write rate is low makes the checkpoint finish rather
quickly - on a slow disk (say microsd) that can cause unneccesary
slowdowns for concurrent activity. ISTM we should calculate the sleep
time in a better way. The SIGHUP behaviour is also weird. Anyway, this
probably belongs on a new thread.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#110)

Re: checkpointer continuous flushing

Hello Andres,

Here is a v13, which is just a rebase after 1aba62ec.

I'm working on this patch, to get it into a state I think it'd be
commitable.

I'll review it carefully. Also, if you can include some performance
feature it would help, even if I'll do some more runs.

In my performance testing it showed that calling PerformFileFlush() only
at segment boundaries and in CheckpointWriteDelay() can lead to rather
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
problematic because it only is triggered while on schedule, and not when
behind.

When behind, the PerformFileFlush should be called on segment boundaries.
The idea was not to go to sleep without flushing, and to do it as little
as possible.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small. Not sure what a good
"limit" would be. It could depend whether pages are close or not.

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

I'm not sure that mmap/msync can be used for this purpose, because there
is no real control it seems about where the file is mmapped.

As far as I can see the while (nb_spaces != 0)/NextBufferToWrite() logic
doesn't work correctly if tablespaces aren't actually sorted. I'm
actually inclined to fix this by simply removing the flag to
enable/disable sorting.

I do no think that there is a significant downside to having sort always
on, but showing it requires to be able to test, so to have a guc. The
point of the guc is to demonstrate that the feature is harmless:-)

Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
so many places looks ugly, I want to push that to the underlying
functions. If we add a different flushing approach we shouldn't have to
touch several places that don't actually really care.

I agree that it is pretty ugly, but I do not think that you can remove
them all. You need at least one for checking the guc and one for enabling
the feature. Maybe their number could be reduced if the functions are
switched to do-nothing stubs which are called nevertheless, but I was not
keen on letting unused code when there is no sync_file_range nor
posix_fadvise.

I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
seems to work well, with a bit less code actually.

Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3
element set in most case is an improvement.

I'll post this after some more cleanup & testing.

I'll have a look when it is ready.

I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
particularly good. In high throughput workloads the 100ms sleep is too
long, leading to bursty IO behaviour. If 1k+ buffers a written out a
second 100ms is a rather long sleep. For another that we only sleep
100ms when the write rate is low makes the checkpoint finish rather
quickly - on a slow disk (say microsd) that can cause unneccesary
slowdowns for concurrent activity. ISTM we should calculate the sleep
time in a better way.

I also noted this point, but I'm not sure how to have a better approach,
so I let it as it is. I tried 50 ms & 200 ms on some runs, without
significant effect on performance for the test I ran then. The point of
having not too small a value is that it provide some significant work to
the IO subsystem without overflowing it. On average it does not matter.
I'm unsure how it would interact with flushing. So I decided not to do
anything about it. Maybe it should be a guc, but I would not know how to
choose it.

The SIGHUP behaviour is also weird. Anyway, this probably belongs on a
new thread.

Probably. I did not try to look at that.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#110)

Re: checkpointer continuous flushing

On Mon, Oct 19, 2015 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

I think this idea is worth exploring especially because we can have
Windows equivalent for this optimisation. Will this option by any
chance can lead to increase in memory usage as mmap has to
map the file/'s?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#113

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#111)

Re: checkpointer continuous flushing

On 2015-10-19 21:14:55 +0200, Fabien COELHO wrote:

In my performance testing it showed that calling PerformFileFlush() only
at segment boundaries and in CheckpointWriteDelay() can lead to rather
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
problematic because it only is triggered while on schedule, and not when
behind.

When behind, the PerformFileFlush should be called on segment
boundaries.

That means it's flushing up to a gigabyte of data at once. Far too
much. The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small.

Why? The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small. If you flush much more
sync_file_range starts to do work in the foreground.

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

I'm not sure that mmap/msync can be used for this purpose, because there is
no real control it seems about where the file is mmapped.

I'm not following? Why does it matter where a file is mapped?

I have had a friend (Christian Kruse, thanks!) confirm that at least on
OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
that should be the case on freebsd too.

Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
so many places looks ugly, I want to push that to the underlying
functions. If we add a different flushing approach we shouldn't have to
touch several places that don't actually really care.

I agree that it is pretty ugly, but I do not think that you can remove them
all.

Sure, never said all. But most.

I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
seems to work well, with a bit less code actually.

Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
set in most case is an improvement.

Yes, it'll not matter that much in many cases. But I rather disliked the
NextBufferToWrite() implementation, especially that it walkes the array
multiple times. And I did see setups with ~15 tablespaces.

I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
particularly good. In high throughput workloads the 100ms sleep is too
long, leading to bursty IO behaviour. If 1k+ buffers a written out a
second 100ms is a rather long sleep. For another that we only sleep
100ms when the write rate is low makes the checkpoint finish rather
quickly - on a slow disk (say microsd) that can cause unneccesary
slowdowns for concurrent activity. ISTM we should calculate the sleep
time in a better way.

I also noted this point, but I'm not sure how to have a better approach, so
I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
effect on performance for the test I ran then. The point of having not too
small a value is that it provide some significant work to the IO subsystem
without overflowing it.

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#113)

Re: checkpointer continuous flushing

Hello Andres,

In my performance testing it showed that calling PerformFileFlush() only
at segment boundaries and in CheckpointWriteDelay() can lead to rather
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
problematic because it only is triggered while on schedule, and not when
behind.

When behind, the PerformFileFlush should be called on segment
boundaries.

That means it's flushing up to a gigabyte of data at once. Far too
much.

Hmmm. I do not get it. There would not be gigabytes, there would be as
much as was written since the last sleep, about 100 ms ago, which is not
likely to be gigabytes?

The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that
is the point of the call before sleeping.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small.

Why?

Because the point of sorting is to generate sequential writes so that the
HDD has a lot of aligned stuff to write without moving the head, and 32 is
rather small for that.

The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so
I'd rather have 128? Also, it can be changed, so maybe it should really be
a guc?

If you flush much more sync_file_range starts to do work in the
foreground.

Argh, too bad. I would have hoped that the would just deal with in an
asynchronous way, this is not a "fsync" call, just a flush advise.

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

I'm not sure that mmap/msync can be used for this purpose, because there is
no real control it seems about where the file is mmapped.

I'm not following? Why does it matter where a file is mapped?

Because it should be in shared buffers where pg needs it? You probably
should not want to mmap all pg data files in user space for a large
database? Or if so, currently the OS keeps the data in memory if it has
enough space, but if you got to mmap this cache management would be pg
responsability, if I understand correctly mmap and your intentions.

I have had a friend (Christian Kruse, thanks!) confirm that at least on
OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
that should be the case on freebsd too.

Good. My concern is how mmap could be used, though, not the flushing part.

Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
set in most case is an improvement.

Yes, it'll not matter that much in many cases. But I rather disliked the
NextBufferToWrite() implementation, especially that it walkes the array
multiple times. And I did see setups with ~15 tablespaces.

ISTM that it is rather an argument for taking the tablespace into the
sorting, not necessarily for a binary heap.

I also noted this point, but I'm not sure how to have a better approach, so
I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
effect on performance for the test I ran then. The point of having not too
small a value is that it provide some significant work to the IO subsystem
without overflowing it.

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.

It sounds reasonable, but what would be the criterion?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#114)

Re: checkpointer continuous flushing

On 2015-10-21 07:49:23 +0200, Fabien COELHO wrote:

Hello Andres,

In my performance testing it showed that calling PerformFileFlush() only
at segment boundaries and in CheckpointWriteDelay() can lead to rather
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
problematic because it only is triggered while on schedule, and not when
behind.

When behind, the PerformFileFlush should be called on segment
boundaries.

That means it's flushing up to a gigabyte of data at once. Far too
much.

Hmmm. I do not get it. There would not be gigabytes,

I said 'up to a gigabyte' not gigabytes. But it actually can be more
than one if you're unluckly.

there would be as much as was written since the last sleep, about 100
ms ago, which is not likely to be gigabytes?

In many cases we don't sleep all that frequently - after one 100ms sleep
we're already behind a lot. And even so, it's pretty easy to get into
checkpoint scenarios with ~500 mbyte/s as a writeout rate. Only issuing
a sync_file_range() 10 times for that is obviously problematic.

The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that is
the point of the call before sleeping.

I don't follow this argument. It's important to avoid large flushes,
therefore we potentially allow large flushes to accumulate?

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small.

Why?

Because the point of sorting is to generate sequential writes so that the
HDD has a lot of aligned stuff to write without moving the head, and 32 is
rather small for that.

A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
data back. It just puts it into the write queue. You can have merging
between IOs from either side. But more importantly you can't merge that
many requests together anyway.

The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so
I'd rather have 128? Also, it can be changed, so maybe it should really be a
guc?

I couldn't see any benefits above (and below) 32 on a 20 drive system,
so I doubt it's worthwhile. It's actually good for interactivity to
allow other requests into the queue concurrently - otherwise other
reads/writes will obviously have a higher latency...

If you flush much more sync_file_range starts to do work in the
foreground.

Argh, too bad. I would have hoped that the would just deal with in an
asynchronous way,

It's even in the man page:
"Note that even this may block if you attempt to write more than
request queue size."

this is not a "fsync" call, just a flush advise.

sync_file_range isn't fadvise().

Because it should be in shared buffers where pg needs it?

Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
instead of sync_file_range().

Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
set in most case is an improvement.

Yes, it'll not matter that much in many cases. But I rather disliked the
NextBufferToWrite() implementation, especially that it walkes the array
multiple times. And I did see setups with ~15 tablespaces.

ISTM that it is rather an argument for taking the tablespace into the
sorting, not necessarily for a binary heap.

I don't understand your problem with that. The heap specific code is
small, smaller than your NextBufferToWrite() implementation?

ts_heap = binaryheap_allocate(nb_spaces,
ts_progress_cmp,
NULL);

spcContext = (FileFlushContext *)
palloc(sizeof(FileFlushContext) * nb_spaces);

for (i = 0; i < nb_spaces; i++)
{
TableSpaceCheckpointStatus *spc = &spcStatus[i];

spc->progress_slice = ((float8) num_to_write) / (float8) spc->num_to_write;

ResetFileFlushContext(&spcContext[i]);
spc->flushContext = &spcContext[i];

binaryheap_add_unordered(ts_heap, PointerGetDatum(&spcStatus[i]));
}

binaryheap_build(ts_heap);

and then

while (!binaryheap_empty(ts_heap))
{
TableSpaceCheckpointStatus *ts = (TableSpaceCheckpointStatus *)
DatumGetPointer(binaryheap_first(ts_heap));

...
ts->progress += ts->progress_slice;
ts->num_written++;
...
if (ts->num_written == ts->num_to_write)
{
...
binaryheap_remove_first(ts_heap);
}
else
{
/* update heap with the new progress */
binaryheap_replace_first(ts_heap, PointerGetDatum(ts));
}

I also noted this point, but I'm not sure how to have a better approach, so
I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
effect on performance for the test I ran then. The point of having not too
small a value is that it provide some significant work to the IO subsystem
without overflowing it.

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.

It sounds reasonable, but what would be the criterion?

What IsCheckpointOnSchedule() does is essentially to calculate progress
for two things:
1) Are we on schedule based on WAL segments until CheckPointSegments
(computed via max_wal_size these days). I.e. is the percentage of
used up WAL bigger than the percentage of written out buffers.

2) Are we on schedule based on checkpoint_timeout. I.e. is the
percentage of checkpoint_timeout already passed bigger than the
percentage of buffers written out.

So the trick is just to compute the number of work items (e.g. buffers
to write out) and divide the remaining time by it. That's how long you
can sleep.

It's slightly trickier for WAL and I'm not sure it's equally
important. But even there it shouldn't be too hard to calculate the
amount of time till we're behind on schedule and only sleep that long.

I'm running benchmarks right now, they'll take a bit to run to
completion.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#115)

Re: checkpointer continuous flushing

Hello Andres,

there would be as much as was written since the last sleep, about 100
ms ago, which is not likely to be gigabytes?

In many cases we don't sleep all that frequently - after one 100ms sleep
we're already behind a lot.

I think that "being behind" is not a problem as such, it is really the way
the scheduler has been designed and works, by keeping pace with time &
wall progress by little bursts of writes. If you reduce the sleep time a
lot then it would end up having writes interleaved with small sleeps, but
then this would be bad for performance has the OS would loose the ability
to write much data sequentially on the disk.

It does not mean that the default 100 ms is a good figure, but the "being
behind" is a feature, not an issue as such.

And even so, it's pretty easy to get into checkpoint scenarios with ~500
mbyte/s as a writeout rate.

Hmmmm. Not with my hardware:-)

Only issuing a sync_file_range() 10 times for that is obviously
problematic.

Hmmm. Then it should depend on the expected write capacity of the
underlying disks...

The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that is
the point of the call before sleeping.

I don't follow this argument. It's important to avoid large flushes,
therefore we potentially allow large flushes to accumulate?

On my simple test hardware the flushes are not large, I think, so the
problem does not arise. Maybe I should check.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small.

Why?

Because the point of sorting is to generate sequential writes so that the
HDD has a lot of aligned stuff to write without moving the head, and 32 is
rather small for that.

A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
data back. It just puts it into the write queue.

Yes.

You can have merging between IOs from either side. But more importantly
you can't merge that many requests together anyway.

Probably.

The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so
I'd rather have 128? Also, it can be changed, so maybe it should really be a
guc?

I couldn't see any benefits above (and below) 32 on a 20 drive system,

So it is one kind of (big) hardware. Assuming that pages are contiguous,
how much is written on each disk depends on the RAID type, the stripe
size, and when it is really written depends on the various cache (in the
RAID HW card if any, on the disk, ...), so whether 32 at the OS level is
the right size is pretty unclear to me. I would have said the larger the
better, but indeed you should avoid blocking.

so I doubt it's worthwhile. It's actually good for interactivity to
allow other requests into the queue concurrently - otherwise other
reads/writes will obviously have a higher latency...

Sure. Now on my tests, with my (old & little) hardware it seemed quite
smooth. What I'm driving at is that what is good may be relative and
depend on the underlying hardware, which makes it not obvious to choose
the right parameter.

If you flush much more sync_file_range starts to do work in the
foreground.

Argh, too bad. I would have hoped that the would just deal with in an
asynchronous way,

It's even in the man page:
"Note that even this may block if you attempt to write more than
request queue size."

Hmmm. What about choosing "request queue size * 0.5", then ?

Because it should be in shared buffers where pg needs it?

Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
instead of sync_file_range().

I think that I do not really understand how it may work, but possible it
could.

ISTM that it is rather an argument for taking the tablespace into the
sorting, not necessarily for a binary heap.

I don't understand your problem with that. The heap specific code is
small, smaller than your NextBufferToWrite() implementation?

You have not yet posted the updated version of the patch.

Thee complexity of the round robin scan on the array is O(1) and very few
instructions, plus some stop condition which is mostly true I think if the
writes are balanced between table spaces, there is no dynamic allocation
in the data structure (it is an array). The binary heap is O(log(n)),
probably there are dynamic allocations and frees when extracting/inserting
something, there are functions calls to rebalance the tree, and so on. Ok,
"n" is expected to be small.

So basically, for me it is not obviously superior to the previous version.
Now I'm also tired, so if it works reasonably I'll be fine with it.

[... code extract ...]

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.

It sounds reasonable, but what would be the criterion?

What IsCheckpointOnSchedule() does is essentially to calculate progress
for two things:
1) Are we on schedule based on WAL segments until CheckPointSegments
(computed via max_wal_size these days). I.e. is the percentage of
used up WAL bigger than the percentage of written out buffers.

2) Are we on schedule based on checkpoint_timeout. I.e. is the
percentage of checkpoint_timeout already passed bigger than the
percentage of buffers written out.

So the trick is just to compute the number of work items (e.g. buffers
to write out) and divide the remaining time by it. That's how long you
can sleep.

See discussion above. ISTM that the "bursts" is a useful feature of the
checkpoint scheduler, especially with sorted buffers & flushes. You want
to provide grouped writes that will be easilly written to disk together.
You do not want to have page writes issued one by one and interleaved with
small sleeps.

It's slightly trickier for WAL and I'm not sure it's equally
important. But even there it shouldn't be too hard to calculate the
amount of time till we're behind on schedule and only sleep that long.

The scheduler stops writing as soon as it has overtaken the progress, so
it should be a very small time, but if you do that you would end up
writing pages one by one, which is not desirable at all.

I'm running benchmarks right now, they'll take a bit to run to
completion.

Good.

I'm looking forward to have a look at the updated version of the patch.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#108)

1 attachment(s)

Re: checkpointer continuous flushing

On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:

Here is a v13, which is just a rebase after 1aba62ec.

And here's v14. It's not something entirely ready. A lot of details have
changed, I unfortunately don't remember them all. But there are more
important things than the details of the patch.

I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
correct. There's two problems: First, using the actual 'fd' number to
reference a to-be-flushed file isn't meaningful. If there are lots
of files open, fds get reused within fd.c. That part is enough fixed
by referencing File instead the fd. The bigger problem is that the
infrastructure doesn't deal with files being closed. There can, which
isn't that hard to trigger, be smgr invalidations causing smgr handle
and thus the file to be closed.

I think this means that the entire flushing infrastructure actually
needs to be hoisted up, onto the smgr/md level.

2) I noticed that sync_file_range() blocked far more often than I'd
expected. Reading the kernel code that turned out to be caused by a
pessimization in the kernel introduced years ago - in many situation
SFR_WRITE waited for the writes. A fix for this will be in the 4.4
kernel.

3) I found that latency wasn't improved much for workloads that are
significantly bigger than shared buffers. The problem here is that
neither bgwriter nor the backends have, so far, done
sync_file_range() calls. That meant that the old problem of having
gigabytes of dirty data that periodically get flushed out, still
exists. Having these do flushes mostly attacks that problem.

Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%. The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is. That seems to intuitively make sense, give the
above explanation attempt. If the checkpoint takes longer the clock hand
will almost certainly soon overtake checkpoints 'implicit' hand.

I'm not sure if we can really do anything about this problem. While I'm
pretty jet lagged, I still spent a fair amount of time thinking about
it. Seems to suggest that we need to bring back the setting to
enable/disable sorting :(

What I think needs to happen next with the patch is:
1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
handling the issue of smgr invalidations.
2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
later can contain multiple elements like checkpoint, bgwriter,
backends, ddl, bulk-writes. That seems better than adding GUCs for
these separately. Then make the flush locations in the patch
configurable using that.
3) I think we should remove the sort timing from the checkpoint logging
before commit. It'll always be pretty short.

Greetings,

Andres Freund

Attachments:

0001-ckpt-14-andres.patchtext/x-patch; charset=us-asciiDownload

>From dd0868d2c714bf18d34f82db40669b435d4b2ba2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 23 Oct 2015 15:22:04 +0200
Subject: [PATCH] ckpt-14-andres

---
 doc/src/sgml/config.sgml                      |  18 ++
 doc/src/sgml/wal.sgml                         |  12 +
 src/backend/access/heap/rewriteheap.c         |   2 +-
 src/backend/access/nbtree/nbtree.c            |   2 +-
 src/backend/access/nbtree/nbtsort.c           |   2 +-
 src/backend/access/spgist/spginsert.c         |   6 +-
 src/backend/access/transam/xlog.c             |  11 +-
 src/backend/storage/buffer/README             |   5 -
 src/backend/storage/buffer/buf_init.c         |  24 +-
 src/backend/storage/buffer/bufmgr.c           | 365 ++++++++++++++++++++++----
 src/backend/storage/buffer/freelist.c         |   6 +-
 src/backend/storage/buffer/localbuf.c         |   3 +-
 src/backend/storage/file/buffile.c            |   3 +-
 src/backend/storage/file/copydir.c            |   4 +-
 src/backend/storage/file/fd.c                 | 242 +++++++++++++++--
 src/backend/storage/smgr/md.c                 |   6 +-
 src/backend/storage/smgr/smgr.c               |   8 +-
 src/backend/utils/misc/guc.c                  |  12 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/access/xlog.h                     |   2 +
 src/include/storage/buf_internals.h           |  18 ++
 src/include/storage/bufmgr.h                  |   8 +
 src/include/storage/fd.h                      |  37 ++-
 src/include/storage/smgr.h                    |   7 +-
 src/tools/pgindent/typedefs.list              |   1 +
 25 files changed, 705 insertions(+), 101 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549de7..7db7ae7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2452,6 +2452,24 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-to-disk" xreflabel="checkpoint_flush_to_disk">
+      <term><varname>checkpoint_flush_to_disk</varname> (<type>bool</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_to_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When writing data for a checkpoint, hint the underlying OS that the
+        data must be sent to disk as soon as possible.  This may help smoothing
+        disk I/O writes and avoid a stall when fsync is issued at the end of
+        the checkpoint, but it may also reduce average performance.
+        This setting may have no effect on some platforms.
+        The default is <literal>on</> on Linux, <literal>off</> otherwise.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..a4b8d91 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms, <xref linkend="guc-checkpoint-flush-to-disk">
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when <literal>fsync</> is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput on some OS.  It should be beneficial for high write
+   loads on HDD.  This feature probably brings no benefit on SSD, as the I/O
+   write latency is small on such hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6a6fc3b..95f086d 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -918,7 +918,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * Note that we deviate from the usual WAL coding practices here,
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
-		written = FileWrite(src->vfd, waldata_start, len);
+		written = FileWrite(src->vfd, waldata_start, len, NULL);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..efb3338 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -203,7 +203,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+			  (char *) metapage, true, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					BTREE_METAPAGE, metapage, false);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..f8976f1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -315,7 +315,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		/* overwriting a block we zero-filled before */
 		smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
-				  (char *) page, true);
+				  (char *) page, true, NULL);
 	}
 
 	pfree(page);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..149c1c4 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -170,7 +170,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 	/* Write the page.  If archiving/streaming, XLOG it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_METAPAGE_BLKNO, page, false);
@@ -180,7 +180,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_ROOT_BLKNO, page, true);
@@ -190,7 +190,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+			  (char *) page, true, NULL);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
 					SPGIST_NULL_BLKNO, page, true);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08d1682..a40f7d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7980,11 +7980,13 @@ LogCheckpointEnd(bool restartpoint)
 				sync_secs,
 				total_secs,
 				longest_secs,
+				sort_secs,
 				average_secs;
 	int			write_usecs,
 				sync_usecs,
 				total_usecs,
 				longest_usecs,
+				sort_usecs,
 				average_usecs;
 	uint64		average_sync_time;
 
@@ -8015,6 +8017,10 @@ LogCheckpointEnd(bool restartpoint)
 						CheckpointStats.ckpt_end_t,
 						&total_secs, &total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+						CheckpointStats.ckpt_sort_end_t,
+						&sort_secs, &sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8033,8 +8039,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8042,6 +8048,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 45c5c83..e33e2ba 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -265,11 +265,6 @@ only needs to take the lock long enough to read the variable value, not
 while scanning the buffers.  (This is a very substantial improvement in
 the contention cost of the writer compared to PG 8.0.)
 
-During a checkpoint, the writer's strategy must be to write every dirty
-buffer (pinned or not!).  We may as well make it start this scan from
-nextVictimBuffer, however, so that the first-to-be-written pages are the
-ones that backends might otherwise have to write for themselves soon.
-
 The background writer takes shared content lock on a buffer while writing it
 out (and anyone else who flushes buffer contents to disk must do so too).
 This ensures that the page image transferred to disk is reasonably consistent.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..c6a3be8 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -20,6 +20,7 @@
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
+CkptSortItem *CkptBufferIds;
 
 
 /*
@@ -65,7 +66,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-				foundDescs;
+				foundDescs,
+				foundBufCkpt;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +79,21 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+	/*
+	 * The array used to sort to-be-checkpointed buffer ids is located in
+	 * shared memory, to avoid having to allocate significant amounts of
+	 * memory at runtime. As that'd be in the middle of a checkpoint, or when
+	 * the checkpointer is restarted, memory allocation failures would be
+	 * painful.
+	 */
+	CkptBufferIds = (CkptSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+
+	if (foundDescs || foundBufs || foundBufCkpt)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundBufCkpt);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +157,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8c0358e..5fb09c8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
+#include "lib/binaryheap.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -47,6 +48,7 @@
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
@@ -75,6 +77,36 @@ typedef struct PrivateRefCountEntry
 /* 64 bytes, about the size of a cache line on common systems */
 #define REFCOUNT_ARRAY_ENTRIES 8
 
+/*
+ * Status of buffers to checkpoint for a particular tablespace, used
+ * internally in BufferSync.
+ */
+typedef struct CkptTsStatus
+{
+	/* oid of the tablespace */
+	Oid			tsId;
+
+	/*
+	 * Checkpoint progress for this tablespace. To make progress comparable
+	 * between tablespaces the progress is, for each tablespace, measured as a
+	 * number between 0 and the total number of to-be-checkpointed pages. Each
+	 * page checkpointed in this tablespace increments this space's progress
+	 * by progress_slice.
+	 */
+	float8		progress;
+	float8		progress_slice;
+
+	/* number of to-be checkpointed pages in this tablespace */
+	int			num_to_scan;
+	/* already processed pages in this tablespace */
+	int			num_scanned;
+
+	/* current offset in CkptBufferIds for this tablespace */
+	int			index;
+
+	FileFlushContext flushContext;
+}	CkptTsStatus;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -82,6 +114,9 @@ double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
 int			effective_io_concurrency = 0;
 
+/* hint to move writes to high priority */
+bool		checkpoint_flush_to_disk = DEFAULT_CHECKPOINT_FLUSH_TO_DISK;
+
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
  * ReadBuffer calls by.  This is maintained by the assign hook for
@@ -399,7 +434,8 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int SyncOneBuffer(int buf_id, bool skip_recently_used,
+			  FileFlushContext *flush_context);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -412,10 +448,13 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+			FileFlushContext *flush_context);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
+static int	ckpt_buforder_comparator(const void *pa, const void *pb);
+static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
 
 /*
@@ -943,6 +982,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
+	static FileFlushContext *context = NULL;
+
+	/* XXX: Should probably rather be in buf_init() */
+	if (context == NULL)
+	{
+		context = MemoryContextAlloc(TopMemoryContext, sizeof(*context));
+		FlushContextInit(context, FLUSH_CONTEXT_DEFAULT_MAX_COALESCE);
+	}
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -1078,8 +1125,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 											   smgr->smgr_rnode.node.spcNode,
 												smgr->smgr_rnode.node.dbNode,
 											  smgr->smgr_rnode.node.relNode);
-
-				FlushBuffer(buf, NULL);
+				/* FIXME: configurable */
+				FlushBuffer(buf, NULL, context);
 				LWLockRelease(buf->content_lock);
 
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1637,10 +1684,16 @@ BufferSync(int flags)
 {
 	int			buf_id;
 	int			num_to_scan;
-	int			num_to_write;
+	int			num_spaces;
+	int			num_processed;
 	int			num_written;
+	CkptTsStatus *per_ts_stat = NULL;
+	Oid			last_tsid;
+	binaryheap *ts_heap;
+	int			i;
 	int			mask = BM_DIRTY;
 
+
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
@@ -1655,7 +1708,7 @@ BufferSync(int flags)
 
 	/*
 	 * Loop over all buffers, and mark the ones that need to be written with
-	 * BM_CHECKPOINT_NEEDED.  Count them as we go (num_to_write), so that we
+	 * BM_CHECKPOINT_NEEDED.  Count them as we go (num_to_scan), so that we
 	 * can estimate how much work needs to be done.
 	 *
 	 * This allows us to write only those pages that were dirty when the
@@ -1669,7 +1722,7 @@ BufferSync(int flags)
 	 * BM_CHECKPOINT_NEEDED still set.  This is OK since any such buffer would
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
-	num_to_write = 0;
+	num_to_scan = 0;
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1682,32 +1735,144 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			CkptSortItem *item;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
-			num_to_write++;
+
+			item = &CkptBufferIds[num_to_scan++];
+			item->buf_id = buf_id;
+			item->tsId = bufHdr->tag.rnode.spcNode;
+			item->relNode = bufHdr->tag.rnode.relNode;
+			item->forkNum = bufHdr->tag.forkNum;
+			item->blockNum = bufHdr->tag.blockNum;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
-	if (num_to_write == 0)
+	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
-	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
+	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
-	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Sort buffers that need to be written to reduce the likelihood of random
+	 * IO. The sorting is also important for the implementation of balancing
+	 * writes between tablespaces. Without balancing writes we'd potentially
+	 * end up writing to the tablespaces one-by-one; possibly overloading the
+	 * underlying system.
+	 */
+	CheckpointStats.ckpt_sort_t = GetCurrentTimestamp();
+	qsort(CkptBufferIds, num_to_scan, sizeof(CkptSortItem),
+		  ckpt_buforder_comparator);
+	CheckpointStats.ckpt_sort_end_t = GetCurrentTimestamp();
+
+	num_spaces = 0;
+
+	/*
+	 * Allocate progress status for each tablespace with buffers that need to
+	 * be flushed. This requires the to-be-flushed array to be sorted.
+	 */
+	last_tsid = InvalidOid;
+	for (i = 0; i < num_to_scan; i++)
+	{
+		CkptTsStatus *s;
+		Oid			cur_tsid;
+
+		cur_tsid = CkptBufferIds[i].tsId;
+
+		/*
+		 * Grow array of per-tablespace status structs, everytime a new
+		 * tablespace is found.
+		 */
+		if (last_tsid == InvalidOid || last_tsid != cur_tsid)
+		{
+			Size		sz;
+
+			num_spaces++;
+
+			/*
+			 * Not worth adding grow-by-power-of-2 logic here - even with a
+			 * few hundred tablespaces this will be fine.
+			 */
+			sz = sizeof(CkptTsStatus) * num_spaces;
+
+			if (per_ts_stat == NULL)
+				per_ts_stat = (CkptTsStatus *) palloc(sz);
+			else
+				per_ts_stat = (CkptTsStatus *) repalloc(per_ts_stat, sz);
+
+			s = &per_ts_stat[num_spaces - 1];
+			memset(s, 0, sizeof(*s));
+			s->tsId = cur_tsid;
+
+			/*
+			 * The first buffer in this tablespace. As CkptBufferIds is sorted
+			 * by tablespace all (s->num_to_scan) buffers in this tablespace
+			 * will follow afterwards.
+			 */
+			s->index = i;
+
+			/*
+			 * The progress_slice will be computed once we know how many
+			 * buffers are in this tablespace, i.e. after this loop.
+			 */
+
+			last_tsid = cur_tsid;
+		}
+		else
+		{
+			s = &per_ts_stat[num_spaces - 1];
+		}
+
+		s->num_to_scan++;
+	}
+
+	Assert(num_spaces > 0);
+
+	/*
+	 * Build a min-heap over the write-progress in the individual tablespaces,
+	 * and compute how large a portion of the total progress a single
+	 * processed buffer is.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	ts_heap = binaryheap_allocate(num_spaces,
+								  ts_ckpt_progress_comparator,
+								  NULL);
+
+	for (i = 0; i < num_spaces; i++)
+	{
+		CkptTsStatus *ts_stat = &per_ts_stat[i];
+
+		ts_stat->progress_slice = (float8) num_to_scan / ts_stat->num_to_scan;
+
+		FlushContextInit(&ts_stat->flushContext,
+						 FLUSH_CONTEXT_DEFAULT_MAX_COALESCE);
+
+		binaryheap_add_unordered(ts_heap, PointerGetDatum(ts_stat));
+	}
+
+	binaryheap_build(ts_heap);
+
+	/*
+	 * Iterate through to-be-checkpointed buffers and write the ones (still)
+	 * marked with BM_CHECKPOINT_NEEDED. The writes are balanced between
+	 * tablespaces.
+	 */
+	num_processed = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+	while (!binaryheap_empty(ts_heap))
 	{
-		volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		volatile BufferDesc *bufHdr = NULL;
+		CkptTsStatus *ts_stat = (CkptTsStatus *)
+		DatumGetPointer(binaryheap_first(ts_heap));
+
+		buf_id = CkptBufferIds[ts_stat->index].buf_id;
+		Assert(buf_id != -1);
+
+		bufHdr = GetBufferDescriptor(buf_id);
+		Assert(bufHdr->tag.rnode.spcNode == ts_stat->tsId);
+
+		num_processed++;
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1723,44 +1888,69 @@ BufferSync(int flags)
 		 */
 		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			FileFlushContext *context;
+
+			if (checkpoint_flush_to_disk)
+				context = &ts_stat->flushContext;
+			else
+				context = NULL;
+
+			if (SyncOneBuffer(buf_id, false, context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
 				num_written++;
+			}
+		}
 
-				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
+		/*
+		 * Measure progress independent of actualy having to flush the buffer
+		 * - otherwise writing become unbalanced.
+		 */
+		ts_stat->progress += ts_stat->progress_slice;
+		ts_stat->num_scanned++;
+		ts_stat->index++;
 
-				/*
-				 * Sleep to throttle our I/O rate.
-				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
-			}
+		/* Have all the buffers from the tablespace been processed? */
+		if (ts_stat->num_scanned == ts_stat->num_to_scan)
+		{
+			/*
+			 * If there's a pending flush, perform that now, we're finished
+			 * with the tablespace.
+			 */
+			FlushContextIssuePending(&ts_stat->flushContext);
+
+			binaryheap_remove_first(ts_heap);
+		}
+		else
+		{
+			/* update heap with the new progress */
+			binaryheap_replace_first(ts_heap, PointerGetDatum(ts_stat));
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Sleep to throttle our I/O rate.
+		 */
+		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+#ifdef CHECKPOINTER_DEBUG
+		/* delete current content of the line, print progress */
+		fprintf(stderr, "\33[2K\rto_scan: %d, scanned: %d, %%processed: %.2f, %%writeouts: %.2f",
+				num_to_scan, num_processed,
+				(((double) num_processed) / num_to_scan) * 100,
+				((double) num_written / num_processed) * 100);
+#endif
 	}
 
+	pfree(per_ts_stat);
+	per_ts_stat = NULL;
+
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
 	 * buffers written by other backends or bgwriter scan.
 	 */
 	CheckpointStats.ckpt_bufs_written += num_written;
 
-	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write);
+	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
 /*
@@ -1818,6 +2008,10 @@ BgBufferSync(void)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	FileFlushContext context;
+
+	FlushContextInit(&context, FLUSH_CONTEXT_DEFAULT_MAX_COALESCE);
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -2000,7 +2194,15 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int			buffer_state;
+
+		/*
+		 * FIXME: flushing should be configurable.
+		 *
+		 * Flushing here is important for latency, but also not unproblematic,
+		 * because the buffers are written out entirely unsorted.
+		 */
+		buffer_state = SyncOneBuffer(next_to_clean, true, &context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2077,7 +2279,8 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used,
+			  FileFlushContext *flush_context)
 {
 	volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
@@ -2118,7 +2321,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_context);
 
 	LWLockRelease(bufHdr->content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2380,9 +2583,16 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The third parameter tries to hint the OS that a high priority write is meant,
+ * possibly because io-throttling is already managed elsewhere.
+ * The last parameter holds the current flush context that accumulates flush
+ * requests to be performed in one call, instead of being performed on a buffer
+ * per buffer basis.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln,
+			FileFlushContext *flush_context)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2471,7 +2681,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  buf->tag.forkNum,
 			  buf->tag.blockNum,
 			  bufToWrite,
-			  false);
+			  false,
+			  flush_context);
 
 	if (track_io_timing)
 	{
@@ -2893,7 +3104,8 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
-						  false);
+						  false,
+						  NULL);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -2927,7 +3139,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, rel->rd_smgr, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2979,7 +3191,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, NULL);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3701,3 +3913,56 @@ rnode_comparator(const void *p1, const void *p2)
 	else
 		return 0;
 }
+
+/*
+ * Comparator determining the writeout order in a checkpoint.
+ *
+ * It is important that tablespaces are compared first as the logic balancing
+ * writes between tablespaces relies on it.
+ */
+static int
+ckpt_buforder_comparator(const void *pa, const void *pb)
+{
+	const CkptSortItem *a = (CkptSortItem *) pa;
+	const CkptSortItem *b = (CkptSortItem *) pb;
+
+	/* compare tablespace */
+	if (a->tsId < b->tsId)
+		return -1;
+	else if (a->tsId > b->tsId)
+		return 1;
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* compare block number */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else	/* should not be the same block anyway... */
+		return 1;
+}
+
+/*
+ * Comparator for a Min-Heap over the, per-tablespace, checkpoint completion
+ * progress.
+ */
+static int
+ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
+{
+	CkptTsStatus *sa = (CkptTsStatus *) a;
+	CkptTsStatus *sb = (CkptTsStatus *) b;
+
+	/* we want a min-heap, so return 1 for the a < b */
+	if (sa->progress < sb->progress)
+		return 1;
+	else if (sa->progress == sb->progress)
+		return 0;
+	else
+		return -1;
+}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index bc2c773..18e4397 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -358,10 +358,10 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 }
 
 /*
- * StrategySyncStart -- tell BufferSync where to start syncing
+ * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
- * The result is the buffer index of the best buffer to sync first.
- * BufferSync() will proceed circularly around the buffer array from there.
+ * The result is the buffer index below the current clock-hand. BgBufferSync()
+ * will proceed circularly around the buffer array from there.
  *
  * In addition, we return the completed-pass count (which is effectively
  * the higher-order bits of nextVictimBuffer) and the count of recent buffer
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3144afe..c508fc6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,7 +208,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  localpage,
-				  false);
+				  false,
+				  NULL);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ea4d689..f2913df 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -317,7 +317,8 @@ BufFileDumpBuffer(BufFile *file)
 				return;			/* seek failed, give up */
 			file->offsets[file->curFile] = file->curOffset;
 		}
-		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite);
+		bytestowrite = FileWrite(thisfile, file->buffer + wpos, bytestowrite,
+								 NULL);
 		if (bytestowrite <= 0)
 			return;				/* failed to write */
 		file->offsets[file->curFile] += bytestowrite;
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 41b2c62..81c9754 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -190,9 +190,9 @@ copy_file(char *fromfile, char *tofile)
 		/*
 		 * We fsync the files later but first flush them to avoid spamming the
 		 * cache and hopefully get the kernel to start writing them out before
-		 * the fsync comes.  Ignore any error, since it's only a hint.
+		 * the fsync comes.
 		 */
-		(void) pg_flush_data(dstfd, offset, nbytes);
+		pg_flush_data(dstfd, offset, nbytes);
 	}
 
 	if (CloseTransientFile(dstfd))
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1ba4946..2974c2b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -61,6 +61,9 @@
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#ifndef WIN32
+#include <sys/mman.h>
+#endif
 #include <unistd.h>
 #include <fcntl.h>
 #ifdef HAVE_SYS_RESOURCE_H
@@ -82,6 +85,8 @@
 /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
 #if defined(HAVE_SYNC_FILE_RANGE)
 #define PG_FLUSH_DATA_WORKS 1
+#elif !defined(WIN32) && defined(MS_ASYNC)
+#define PG_FLUSH_DATA_WORKS 1
 #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
 #define PG_FLUSH_DATA_WORKS 1
 #endif
@@ -380,29 +385,128 @@ pg_fdatasync(int fd)
 }
 
 /*
- * pg_flush_data --- advise OS that the data described won't be needed soon
+ * pg_flush_data --- advise OS that the described dirty data should be flushed
  *
- * Not all platforms have sync_file_range or posix_fadvise; treat as no-op
- * if not available.  Also, treat as no-op if enableFsync is off; this is
- * because the call isn't free, and some platforms such as Linux will actually
- * block the requestor until the write is scheduled.
+ * An offset of 0 with an amount of 0 means that the entire file should be
+ * flushed.
  */
-int
-pg_flush_data(int fd, off_t offset, off_t amount)
+void
+pg_flush_data(int fd, off_t offset, off_t nbytes)
 {
 #ifdef PG_FLUSH_DATA_WORKS
-	if (enableFsync)
-	{
+
+	/*
+	 * Right now file flushing is primarily used to avoid making later
+	 * fsync()/fdatasync() calls have a significant impact. Thus don't trigger
+	 * flushes if fsyncs are disabled - that's a decision we might want to
+	 * make configurable at some point.
+	 */
+	if (!enableFsync)
+		return;
+
 #if defined(HAVE_SYNC_FILE_RANGE)
-		return sync_file_range(fd, offset, amount, SYNC_FILE_RANGE_WRITE);
+	{
+		int			rc = 0;
+
+		/*
+		 * sync_file_range(2), currently linux specific, with
+		 * SYNC_FILE_RANGE_WRITE as a parameter tells the OS that writeback
+		 * for the passed in blocks should be started, but that we don't want
+		 * to wait for completion.  Note that this call might block if too
+		 * much dirty data exists in the range.  This is the preferrable
+		 * method on OSs supporting it, as it works reliably when available
+		 * (contrast to msync()) and doesn't flush out clean data (like
+		 * FADV_DONTNEED).
+		 */
+		rc = sync_file_range(fd, offset, nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+		/* don't error out, this is just a performance optimization */
+		if (rc != 0)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not flush dirty data: %m")));
+		}
+	}
+#elif !defined(WIN32) && defined(MS_ASYNC)
+	{
+		int			rc = 0;
+		void	   *p;
+
+		/*
+		 * On many OSs msync() on a mmap'ed file triggers writeback. On linux
+		 * it only does so when MS_SYNC is specified, but then it does the
+		 * writeback in the foreground. Luckily all common linux systems have
+		 * sync_file_range().  This is preferrable over FADV_DONTNEED because
+		 * it doesn't flush out clean data.
+		 *
+		 * We map the file (mmap()), tell the kernel to sync back the contents
+		 * (msync()), and then remove the mapping again (munmap()).
+		 */
+
+		p = mmap(NULL, context->nbytes,
+				 PROT_READ | PROT_WRITE, MAP_SHARED,
+				 context->fd, context->offset);
+		if (p == MAP_FAILED)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not mmap while flushing dirty data in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+			goto out;
+		}
+
+		rc = msync(p, context->nbytes, MS_ASYNC);
+		if (rc != 0)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not flush dirty data in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+			/* NB: need to fall through to munmap()! */
+		}
+
+		rc = munmap(p, context->nbytes);
+		if (rc != 0)
+		{
+			/* FATAL error because mapping would remain */
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg("could not munmap while flushing blocks in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+		}
+	}
 #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-		return posix_fadvise(fd, offset, amount, POSIX_FADV_DONTNEED);
+	{
+		int			rc = 0;
+
+		/*
+		 * Signal the kernel that the passed in range should not be cached
+		 * anymore. This has the, desired, side effect of writing out dirty
+		 * data, and the, undesired, side effect of likely discarding useful
+		 * clean cached blocks.  For the latter reason this is the least
+		 * preferrable method.
+		 */
+
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+		/* don't error out, this is just a performance optimization */
+		if (rc != 0)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not flush dirty data in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+			goto out;
+		}
+	}
 #else
 #error PG_FLUSH_DATA_WORKS should not have been defined
 #endif
-	}
-#endif
-	return 0;
+
+#endif /* PG_FLUSH_DATA_WORKS */
 }
 
 
@@ -1345,7 +1449,8 @@ retry:
 }
 
 int
-FileWrite(File file, char *buffer, int amount)
+FileWrite(File file, char *buffer, int amount,
+		  FileFlushContext *flush_context)
 {
 	int			returnCode;
 
@@ -1408,6 +1513,11 @@ retry:
 				VfdCache[file].fileSize = newPos;
 			}
 		}
+
+		/* update bulk flush state */
+		if (flush_context != NULL)
+			FlushContextSchedule(flush_context, file,
+								 VfdCache[file].seekPos, amount);
 	}
 	else
 	{
@@ -1579,6 +1689,103 @@ FilePathName(File file)
 
 
 /*
+ * Initialize a FileFlushContext, discarding potential previous state in
+ * context.
+ *
+ * max_coalesce is the maximum number of flush requests that will be coalesced
+ * into a bigger one. 0 meaning there is no limit.
+ */
+void
+FlushContextInit(FileFlushContext *context, int max_coalesce)
+{
+	context->max_coalesce = max_coalesce;
+	context->file = -1;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+}
+
+/*
+ * Schedule writeout of a range of bytes in a file.
+ *
+ * filename is just used for error reporting, and may be NULL.
+ */
+void
+FlushContextSchedule(FileFlushContext *context,
+					 File file, off_t offset, off_t nbytes)
+{
+	/*
+	 * If the new range of blocks is in the same file as a previous request
+	 * try to coalesce with previous requests. That increases the chance that
+	 * these writeouts can be coalesced in the OSs IO layer and decreases the
+	 * number of syscalls.  If there are a lot of outstanding flush requests,
+	 * immediately trigger writeout of previously blocks to avoid overflowing
+	 * request queues and thelike, thereby causing latency spikes.
+	 */
+	if (context->file == file && context->ncalls != 0)
+	{
+		int64	startoff;
+		int64	endoff;
+
+		/* merge current flush with previous ones */
+		startoff = Min(context->offset, offset);
+		endoff = Max(context->offset + context->nbytes, offset + nbytes);
+
+		context->offset = startoff;
+		context->nbytes = endoff - startoff;
+		context->ncalls++;
+
+		/*
+		 * Accumulated enough dirty ranges - flush now. XXX: It might be
+		 * worthwhile to count actual bytes that we've been asked to flush,
+		 * and to have additional limits; but that's for another day.
+		 */
+		if (context->max_coalesce > 0 &&
+			context->ncalls >= context->max_coalesce)
+			FlushContextIssuePending(context);
+	}
+	else
+	{
+		/* flush previous file & reset flush accumulator */
+		FlushContextIssuePending(context);
+
+		context->file = file;
+		context->ncalls = 1;
+		context->offset = offset;
+		context->nbytes = nbytes;
+	}
+}
+
+/*
+ * Issue all pending flush requests previously scheduled with
+ * FlushContextSchedule to the OS.
+ *
+ * Because this is, currently, only used to improve the OSs IO scheduling we
+ * try hard to never error out - it's just a hint.
+ */
+void
+FlushContextIssuePending(FileFlushContext *context)
+{
+	int			rc;
+
+	if (context->ncalls == 0)
+		return;
+
+	rc = FileAccess(context->file);
+	if (rc < 0)
+		return;
+
+	pg_flush_data(VfdCache[context->file].fd,
+				  context->offset, context->nbytes);
+
+	context->file = -1;
+	context->ncalls = 0;
+	context->offset = 0;
+	context->nbytes = 0;
+}
+
+
+/*
  * Make room for another allocatedDescs[] array entry if needed and possible.
  * Returns true if an array element is available.
  */
@@ -2655,9 +2862,10 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
 	}
 
 	/*
-	 * We ignore errors from pg_flush_data() because this is only a hint.
+	 * pg_flush_data() ignores errors, which is ok because this is only a
+	 * hint.
 	 */
-	(void) pg_flush_data(fd, 0, 0);
+	pg_flush_data(fd, 0, 0);
 
 	(void) CloseTransientFile(fd);
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..eeaac07 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -531,7 +531,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, NULL)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -738,7 +738,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool skipFsync)
+		char *buffer, bool skipFsync, FileFlushContext *flush_context)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -767,7 +767,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, flush_context);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..31c15a6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -52,7 +52,8 @@ typedef struct f_smgr
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum, char *buffer, bool skipFsync);
+						  BlockNumber blocknum, char *buffer, bool skipFsync,
+										   FileFlushContext *flush_context);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -643,10 +644,11 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool skipFsync)
+		  char *buffer, bool skipFsync, FileFlushContext *flush_context)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, skipFsync);
+											  buffer, skipFsync,
+											  flush_context);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fda0fb9..b72f782 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1004,6 +1004,18 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+
+	{
+		{"checkpoint_flush_to_disk", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Hint that checkpoint's writes are high priority."),
+			NULL
+		},
+		&checkpoint_flush_to_disk,
+		/* see bufmgr.h: true on Linux, false otherwise */
+		DEFAULT_CHECKPOINT_FLUSH_TO_DISK,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_connections", PGC_SU_BACKEND, LOGGING_WHAT,
 			gettext_noop("Logs each successful connection."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dcf929f..20726dc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,8 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_flush_to_disk = ?		# send buffers to disk on checkpoint
+					# default is on if Linux, off otherwise
 #checkpoint_warning = 30s		# 0 disables
 
 # - Archiving -
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 790ca66..11815a8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,8 @@ extern bool XLOG_DEBUG;
 typedef struct CheckpointStatsData
 {
 	TimestampTz ckpt_start_t;	/* start of checkpoint */
+	TimestampTz ckpt_sort_t;    /* start buffer sorting */
+	TimestampTz ckpt_sort_end_t;      /* end of sorting */
 	TimestampTz ckpt_write_t;	/* start of flushing buffers */
 	TimestampTz ckpt_sync_t;	/* start of fsyncs */
 	TimestampTz ckpt_sync_end_t;	/* end of fsyncs */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..1628154 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -210,6 +210,24 @@ extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * This structure is allocated per buffer in shared memory, so it should be
+ * kept as small as possible.
+ */
+typedef struct CkptSortItem
+{
+	Oid			tsId;
+	Oid			relNode;
+	ForkNumber	forkNum;
+	BlockNumber blockNum;
+	int			buf_id;
+}	CkptSortItem;
+
+extern CkptSortItem *CkptBufferIds;
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 0f59201..28a3deb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -55,6 +55,14 @@ extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
 
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK true
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_TO_DISK false
+#endif   /* HAVE_SYNC_FILE_RANGE */
+
+extern bool checkpoint_flush_to_disk;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 7eabe09..a05500a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -59,6 +59,34 @@ extern int	max_files_per_process;
  */
 extern int	max_safe_fds;
 
+/*
+ * FlushContext structure - This is used to accumulate several flush requests
+ * made by one callsite into a larger flush request.
+ */
+typedef struct FileFlushContext
+{
+	/* max number of flush requests to coalesce */
+	int			max_coalesce;
+	/* VFD of the last file processed or -1 */
+	File		file;
+	/* number of flush requests merged together */
+	int			ncalls;
+	/* offset to start flushing (minimum of all offsets) */
+	int64		offset;
+
+	/*
+	 * Size (minimum extent to cover all flushed data). If 0 byt ncalls > 0,
+	 * the whole file should be flushed.
+	 */
+	int64		nbytes;
+}	FileFlushContext;
+
+/*
+ * By default coalesce up to 32 flush requests to the same file. As flush
+ * requests usually are BLCKSZ large, that amounts to about the size of common
+ * IO request queues.
+ */
+#define FLUSH_CONTEXT_DEFAULT_MAX_COALESCE 64
 
 /*
  * prototypes for functions in fd.c
@@ -70,11 +98,16 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
-extern int	FileWrite(File file, char *buffer, int amount);
+extern int FileWrite(File file, char *buffer, int amount,
+		  FileFlushContext * flush_context);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
 extern char *FilePathName(File file);
+extern void FlushContextInit(FileFlushContext *context, int max_coalesce);
+extern void FlushContextIssuePending(FileFlushContext *context);
+extern void FlushContextSchedule(FileFlushContext *context, File file,
+								 off_t offset, off_t nbytes);
 
 /* Operations that allow use of regular stdio --- USE WITH CAUTION */
 extern FILE *AllocateFile(const char *name, const char *mode);
@@ -112,7 +145,7 @@ extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
-extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+extern void	pg_flush_data(int fd, off_t offset, off_t amount);
 extern void fsync_fname(char *fname, bool isdir);
 extern void SyncDataDirectory(void);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..e95b859 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/fd.h"
 #include "storage/relfilenode.h"
 
 
@@ -95,7 +96,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool skipFsync);
+		  BlockNumber blocknum, char *buffer, bool skipFsync,
+		  FileFlushContext *flush_context);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -121,7 +123,8 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
+		BlockNumber blocknum, char *buffer, bool skipFsync,
+		FileFlushContext *flush_context);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 03e1d2c..2f00050 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -576,6 +576,7 @@ FileNameMap
 FindSplitData
 FixedParallelState
 FixedParamState
+FileFlushContext
 FmgrBuiltin
 FmgrHookEventType
 FmgrInfo
-- 
2.6.0.rc3

#118

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#117)

Re: checkpointer continuous flushing

Hello Andres,

And here's v14. It's not something entirely ready.

I'm going to have a careful look at it.

A lot of details have changed, I unfortunately don't remember them all.
But there are more important things than the details of the patch.

I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
correct. There's two problems: First, using the actual 'fd' number to
reference a to-be-flushed file isn't meaningful. If there are lots
of files open, fds get reused within fd.c.

Hmm.

My assumption is that a file being used (i.e. with modifie pages, being
used for writes...) would not be closed before everything is cleared...

After some poking in the code, I think that this issue may indeed be
there, although the probability of hitting it is close to 0, but alas not
0:-)

To fix it, ITSM that it is enough to hold a "do not close lock" on the
file while a flush is in progress (a short time) that would prevent
mdclose to do its stuff.

That part is enough fixed by referencing File instead the fd. The bigger
problem is that the infrastructure doesn't deal with files being closed.
There can, which isn't that hard to trigger, be smgr invalidations
causing smgr handle and thus the file to be closed.

I think this means that the entire flushing infrastructure actually
needs to be hoisted up, onto the smgr/md level.

Hmmm. I'm not sure that it is necessary, see above my suggestion.

2) I noticed that sync_file_range() blocked far more often than I'd
expected. Reading the kernel code that turned out to be caused by a
pessimization in the kernel introduced years ago - in many situation
SFR_WRITE waited for the writes. A fix for this will be in the 4.4
kernel.

Alas, Pg cannot help issues in the kernel.

3) I found that latency wasn't improved much for workloads that are
significantly bigger than shared buffers. The problem here is that
neither bgwriter nor the backends have, so far, done
sync_file_range() calls. That meant that the old problem of having
gigabytes of dirty data that periodically get flushed out, still
exists. Having these do flushes mostly attacks that problem.

I'm concious that the patch only addresses *checkpointer* writes, not
those from bgwrither or backends writes. I agree that these should need to
be addressed at some point as well, but given the time to get a patch
through, the more complex the slower (sort propositions are 10 years old),
I think this should be postponed for later.

Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.

This is consistent with my experiments: sorting improves things, and
flushing on top of sorting improves things further.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.

I did not see such behavior in the many tests I ran. Could you share more
precise details so that I can try to reproduce this performance
regression? (available memory, shared buffers, db size, ...).

The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

I'm quite surprised at such a large effect on throughput, though.

This explanation seems to suggest that if bgwriter/workders write are
sorted and/or coordinated with the checkpointer somehow then all would be
well?

ISTM that this explanation could be checked by looking whether
bgwriter/workers writes are especially large compared to checkpointer
writes in those cases with reduced throughput? The data is in the log.

My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective, and the more likely to go back to random I/Os, and maybe to
seem some effect of the sync strategy stuff.

That seems to intuitively make sense, give the above explanation
attempt. If the checkpoint takes longer the clock hand will almost
certainly soon overtake checkpoints 'implicit' hand.

I'm not sure if we can really do anything about this problem. While I'm
pretty jet lagged, I still spent a fair amount of time thinking about
it. Seems to suggest that we need to bring back the setting to
enable/disable sorting :(

What I think needs to happen next with the patch is:
1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
handling the issue of smgr invalidations.

Not sure that much is necessary, see above.

2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
later can contain multiple elements like checkpoint, bgwriter,
backends, ddl, bulk-writes. That seems better than adding GUCs for
these separately. Then make the flush locations in the patch
configurable using that.

My 0,02ï¿œ on this point: I have not seen much of this style of guc
elsewhere. The only one I found while scanning the postgres file are
*_path and *_libraries. It seems to me that this would depart
significantly from the usual style, so one guc per case, or one shared guc
but with only on/off, would blend in more cleanly with the usual style.

3) I think we should remove the sort timing from the checkpoint logging
before commit. It'll always be pretty short.

I added it to show that it was really short, in response to concerns that
my approach of just sorting through indexes to reduce the memory needed
instead of copying the data to be sorted did not induce significant
performance issues. I prooved my point, but peer pressure made me switch
to larger memory anyway.

I think it should be kept while the features are under testing. I do not
think that it harms in anyway.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#118)

Re: checkpointer continuous flushing

Hi,

On 2015-11-12 15:31:41 +0100, Fabien COELHO wrote:

A lot of details have changed, I unfortunately don't remember them all.
But there are more important things than the details of the patch.

I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
correct. There's two problems: First, using the actual 'fd' number to
reference a to-be-flushed file isn't meaningful. If there are lots
of files open, fds get reused within fd.c.

Hmm.

My assumption is that a file being used (i.e. with modifie pages, being used
for writes...) would not be closed before everything is cleared...

That's likely, but far from guaranteed.

After some poking in the code, I think that this issue may indeed be there,
although the probability of hitting it is close to 0, but alas not 0:-)

I did hit it...

To fix it, ITSM that it is enough to hold a "do not close lock" on the file
while a flush is in progress (a short time) that would prevent mdclose to do
its stuff.

Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct? If that, how would you deal with FileClose()
being called?

3) I found that latency wasn't improved much for workloads that are
significantly bigger than shared buffers. The problem here is that
neither bgwriter nor the backends have, so far, done
sync_file_range() calls. That meant that the old problem of having
gigabytes of dirty data that periodically get flushed out, still
exists. Having these do flushes mostly attacks that problem.

I'm concious that the patch only addresses *checkpointer* writes, not those
from bgwrither or backends writes. I agree that these should need to be
addressed at some point as well, but given the time to get a patch through,
the more complex the slower (sort propositions are 10 years old), I think
this should be postponed for later.

I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.

I did not see such behavior in the many tests I ran. Could you share more
precise details so that I can try to reproduce this performance regression?
(available memory, shared buffers, db size, ...).

I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
shared_buffers. Tried checkpoint timeouts from 60 to 300s. I could
see issues in workloads ranging from scale 300 to 5000. Throughput
regressions are visible for both sync_commit on/off workloads. Here
the largest regressions were visible.

* my workstation: 24GB Ram, 2x E5520, a) Raid 10 of of 4 4TB, 7.2krpm
devices b) Raid 1 of 2 m4 512GB SSDs. One of the latter was killed
during the test. Both showed regressions, but smaller.

* EC2 d2.8xlarge, 244 GB RAM, 24 x 2000 HDD, 64GB shared_buffers. I
tried scale 3000,8000,15000. Here sorting, without flushing, didn't
lead much to regressions.

I think generally the regressions were visible with a) noticeable shared
buffers, b) workload not fitting into shared buffers, c) significant
throughput, leading to high cache replacement ratios.

Another thing that's worthwhile to mention, while not surprising, is
that the benefits of this patch are massively smaller when WAL and data
are separated onto different disks. For workloads fitting into
shared_buffers I saw no performance difference - not particularly
surprising. I guess if you'd construct a case where the data, not WAL,
is the bottleneck that'd be different. Also worthwhile to mention that
the separate disks setups was noticeably faster.

The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

I'm quite surprised at such a large effect on throughput, though.

Me too.

This explanation seems to suggest that if bgwriter/workders write are sorted
and/or coordinated with the checkpointer somehow then all would be well?

Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order. We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.

ISTM that this explanation could be checked by looking whether
bgwriter/workers writes are especially large compared to checkpointer writes
in those cases with reduced throughput? The data is in the log.

What do you mean with "large"? Numerous?

My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective

You mean, as evidenced by the results, or is that what you'd actually
expect?

2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
later can contain multiple elements like checkpoint, bgwriter,
backends, ddl, bulk-writes. That seems better than adding GUCs for
these separately. Then make the flush locations in the patch
configurable using that.

My 0,02€ on this point: I have not seen much of this style of guc elsewhere.
The only one I found while scanning the postgres file are *_path and
*_libraries. It seems to me that this would depart significantly from the
usual style, so one guc per case, or one shared guc but with only on/off,
would blend in more cleanly with the usual style.

Such a guc would allow one 'on' and 'off' setting, and either would
hopefully be the norm. That seems advantageous to me.

3) I think we should remove the sort timing from the checkpoint logging
before commit. It'll always be pretty short.

I added it to show that it was really short, in response to concerns that my
approach of just sorting through indexes to reduce the memory needed instead
of copying the data to be sorted did not induce significant performance
issues. I prooved my point, but peer pressure made me switch to larger
memory anyway.

Grumble. I'm getting a bit tired about this topic. This wasn't even
remotely primarily about sorting speed, and you damn well know it.

I think it should be kept while the features are under testing. I do not
think that it harms in anyway.

That's why I said we should remove it *before commit*.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#119)

Re: checkpointer continuous flushing

To fix it, ITSM that it is enough to hold a "do not close lock" on the file
while a flush is in progress (a short time) that would prevent mdclose to do
its stuff.

Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct?

Basically yes, I'm suggesting a mutex in the vdf struct.

If that, how would you deal with FileClose() being called?

Just wait for the mutex, which would be held while flushes are accumulated
into the flush context and released after the flush is performed and the
fd is not necessary anymore for this purpose, which is expected to be
short (at worst between the wake & sleep of the checkpointer, and just one
file at a time).

I'm concious that the patch only addresses *checkpointer* writes, not those
from bgwrither or backends writes. I agree that these should need to be
addressed at some point as well, but given the time to get a patch through,
the more complex the slower (sort propositions are 10 years old), I think
this should be postponed for later.

I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.

Hmmm. See below.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.

I did not see such behavior in the many tests I ran. Could you share more
precise details so that I can try to reproduce this performance regression?
(available memory, shared buffers, db size, ...).

I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short. I tend to do tests with much larger timeouts. I
would advise against a short timeout esp. in a high throughput system, the
whole point of the checkpointer is to accumulate as much changes as
possible.

I'll look into that.

This explanation seems to suggest that if bgwriter/workders write are sorted
and/or coordinated with the checkpointer somehow then all would be well?

Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order. We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.

Maybe the sorting could be shared with others so that everybody uses the
same order?

That would suggest to have one global sorting of buffers, maybe maintained
by the checkpointer, which could be used by all processes that need to
scan the buffers (in file order), instead of scanning them in memory
order.

For this purpose, I think that the initial index-based sorting would
suffice. Could be resorted periodically with some delay maintained in a
guc, or when significant buffer changes have occured (read & writes).

ISTM that this explanation could be checked by looking whether
bgwriter/workers writes are especially large compared to checkpointer writes
in those cases with reduced throughput? The data is in the log.

What do you mean with "large"? Numerous?

I mean the amount of buffers written by bgwriter/worker is greater than
what is written by the checkpointer. If all fits in shared buffers,
bgwriter/worker mostly do not need to write anything and the checkpointer
does all the writes.

The larger the memory needed, the more likely workers/bgwriter will have
to quick in and generate random I/Os because nothing sensible is currently
done, so this is consistent with your findings, although I'm surprised
that it would have a large effect on throughput, as already said.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective

You mean, as evidenced by the results, or is that what you'd actually
expect?

What I would expect...

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#120)

Re: checkpointer continuous flushing

On 2015-11-12 17:44:40 +0100, Fabien COELHO wrote:

To fix it, ITSM that it is enough to hold a "do not close lock" on the file
while a flush is in progress (a short time) that would prevent mdclose to do
its stuff.

Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct?

Basically yes, I'm suggesting a mutex in the vdf struct.

I can't see that being ok. I mean what would that thing even do? VFD
isn't shared between processes, and if we get a smgr flush we have to
apply it, or risk breaking other things.

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short.

Indeed. I'd never do that in a production scenario myself. But
nonetheless it showcases a problem.

Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order. We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.

Maybe the sorting could be shared with others so that everybody uses the
same order?

That would suggest to have one global sorting of buffers, maybe maintained
by the checkpointer, which could be used by all processes that need to scan
the buffers (in file order), instead of scanning them in memory order.

Uh. Cache replacement is based on an approximated LRU, you can't just
remove that without serious regressions.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective

You mean, as evidenced by the results, or is that what you'd actually
expect?

What I would expect...

I don't see why then? If you very quickly writes lots of data the OS
will continously flush dirty data to the disk, in which case sorting is
rather important?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#121)

Re: checkpointer continuous flushing

Hello,

Basically yes, I'm suggesting a mutex in the vdf struct.

I can't see that being ok. I mean what would that thing even do? VFD
isn't shared between processes, and if we get a smgr flush we have to
apply it, or risk breaking other things.

Probably something is eluding my comprehension:-)

My basic assumption is that the fopen & fd is per process, so we just have
to deal with the one in the checkpointer process, so it is enough that the
checkpointer does not close the file while it is flushing things to it?

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short.

Indeed. I'd never do that in a production scenario myself. But
nonetheless it showcases a problem.

I would say that it would render sorting ineffective because all the
rewriting is done by bgwriter or workers, which does not totally explain
why the throughput would be worst than before, I would expect it to be as
bad as before...

Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order. We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.

Maybe the sorting could be shared with others so that everybody uses the
same order?

That would suggest to have one global sorting of buffers, maybe maintained
by the checkpointer, which could be used by all processes that need to scan
the buffers (in file order), instead of scanning them in memory order.

Uh. Cache replacement is based on an approximated LRU, you can't just
remove that without serious regressions.

I understand that, but there is a balance to find. Generating random I/Os
is very bad for performance, so the decision process must combine LRU/LFU
heuristics with considering things in some order as well.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective

You mean, as evidenced by the results, or is that what you'd actually
expect?

What I would expect...

I don't see why then? If you very quickly writes lots of data the OS
will continously flush dirty data to the disk, in which case sorting is
rather important?

What I have in mind is: the shorter the timeout the less neighboring
buffers will be touched, so the less nice sequential writes will be found
by sorting them, so the worst the positive impact on performance...

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Fabien COELHO (#122)

Re: checkpointer continuous flushing

Basically yes, I'm suggesting a mutex in the vdf struct.

I can't see that being ok. I mean what would that thing even do? VFD
isn't shared between processes, and if we get a smgr flush we have to
apply it, or risk breaking other things.

Probably something is eluding my comprehension:-)

My basic assumption is that the fopen & fd is per process, so we just have to
deal with the one in the checkpointer process, so it is enough that the
checkpointer does not close the file while it is flushing things to it?

Hmmm...

Maybe I'm a little bit too optimistic here, because it seems that I'm
suggesting to create a dead lock if the checkpointer has both buffers to
flush in waiting and wishes to close the very same file that holds them.

So on wanting to close the file the checkpointer should rather flushes the
outstanding flushes in wait and then close the fd, which suggest some
global variable to hold flush context so that this can be done.

Hmmm.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#117)

Re: checkpointer continuous flushing

On Wed, Nov 11, 2015 at 1:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:

Here is a v13, which is just a rebase after 1aba62ec.

3) I found that latency wasn't improved much for workloads that are
significantly bigger than shared buffers. The problem here is that
neither bgwriter nor the backends have, so far, done
sync_file_range() calls. That meant that the old problem of having
gigabytes of dirty data that periodically get flushed out, still
exists. Having these do flushes mostly attacks that problem.

Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%. The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

That sounds to be a tricky problem. I think the way to improve the current
situation is to change buffer allocation algorithm such that instead of
backend issuing the write for dirty buffer, it will just continue to find
next
free buffer when it finds that selected buffer is dirty and if it could not
find the non-dirty buffer for certain number of attempts, it will signal
bgwriter
to write-out some buffers. Now the writing algorithm of bgwriter has to
be such that it picks the buffers in chunks from checkpoint-list, sort them
and then write them. Checkpoint also uses the same checkpoint-list to flush
the dirty buffers. This will ensure that the writes will always be
sorted-writes
irrespective of which process does the writes. There could be multiple ways
to form this checkpoint-list and one of the way could be MarkBufferDirty()
adds it to such a list. I think following such a mechanism could solve the
problem of unsorted writes in the system, but it arises a question, what
kind
of latency such a mechanism could introduce for a backend which
signals bgwriter after not finding a non-dirty buffer for certain number of
attempts, I think if we sense this could be a problematic case, then we
can make both bgwriter and checkpoint to always start from next victim
buffer and then traverse the checkpoint-list.

My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is. That seems to intuitively make sense, give the
above explanation attempt. If the checkpoint takes longer the clock hand
will almost certainly soon overtake checkpoints 'implicit' hand.

I'm not sure if we can really do anything about this problem. While I'm
pretty jet lagged, I still spent a fair amount of time thinking about
it. Seems to suggest that we need to bring back the setting to
enable/disable sorting :(

What I think needs to happen next with the patch is:
1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
handling the issue of smgr invalidations.
2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
later can contain multiple elements like checkpoint, bgwriter,
backends, ddl, bulk-writes. That seems better than adding GUCs for
these separately. Then make the flush locations in the patch
configurable using that.
3) I think we should remove the sort timing from the checkpoint logging
before commit. It'll always be pretty short.

It seems for now you have left out the windows specific implementation
in pg_flush_data().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#125

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Fabien COELHO (#123)

Re: checkpointer continuous flushing

Hmmm...

Maybe I'm a little bit too optimistic here, because it seems that I'm
suggesting to create a dead lock if the checkpointer has both buffers to
flush in waiting and wishes to close the very same file that holds them.

So on wanting to close the file the checkpointer should rather flushes the
outstanding flushes in wait and then close the fd, which suggest some global
variable to hold flush context so that this can be done.

Hmmm.

On third (fourth, fifth:-) thoughts:

The vfd (virtual file descriptor?) structure in the checkpointer could
keep a pointer to the current flush if it concerns this fd, so that if it
decides to close if while there is a write in progress (I'm still baffled
at why and when the checkpointer process would take such a decision, maybe
while responding to some signals, because it seems that there is no such
event in the checkpointer loop itself...) then on close the process could
flush before close, or just close which probably would induce flushing,
but at least cleanup the structure so that the the closed fd would not be
flushed after being closed and result in an error.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 10 years ago

In reply to: Fabien COELHO (#125)

Re: checkpointer continuous flushing

Hi,

I'm planning to do some thorough benchmarking of the patches proposed in
this thread, on various types of hardware (10k SAS drives and SSDs). But
is that actually needed? I see Andres did some testing, as he posted
summary of the results on 11/12, but I don't see any actual results or
even info about what benchmarks were done (pgbench?).

If yes, do we only want to compare 0001-ckpt-14-andres.patch against
master, or do we need to test one of the previous Fabien's patches?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Tomas Vondra (#126)

Re: checkpointer continuous flushing

Hello Tomas,

I'm planning to do some thorough benchmarking of the patches proposed in this
thread, on various types of hardware (10k SAS drives and SSDs). But is that
actually needed? I see Andres did some testing, as he posted summary of the
results on 11/12, but I don't see any actual results or even info about what
benchmarks were done (pgbench?).

If yes, do we only want to compare 0001-ckpt-14-andres.patch against master,
or do we need to test one of the previous Fabien's patches?

My 0.02ï¿½,

Although I disagree with some aspects of Andres patch, I'm not a committer
and I'm tired of arguing. I'm just planing to do minor changes to Andres
version to fix a potential issue if the file is closed which flushing is
in progress, but that will not change the overall shape of it.

So testing on Andres version seems relevant to me.

For SSD the performance impact should be limited. For disk it should be
significant if there is no big cache in front of it. There were some
concerns raised for some loads in the thread (shared memory smaller than
needed I think?), if you can include such cases that would be great. My
guess is that it should be not very beneficial in this case because the
writing is mostly done by bgwriter & worker in this case, and these are
still random.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Michael Paquier

michael.paquier@gmail.com

about 10 years ago

In reply to: Fabien COELHO (#127)

Re: checkpointer continuous flushing

On Thu, Dec 17, 2015 at 4:27 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Tomas,

I'm planning to do some thorough benchmarking of the patches proposed in
this thread, on various types of hardware (10k SAS drives and SSDs). But is
that actually needed? I see Andres did some testing, as he posted summary of
the results on 11/12, but I don't see any actual results or even info about
what benchmarks were done (pgbench?).

If yes, do we only want to compare 0001-ckpt-14-andres.patch against
master, or do we need to test one of the previous Fabien's patches?

My 0.02€,

Although I disagree with some aspects of Andres patch, I'm not a committer
and I'm tired of arguing. I'm just planing to do minor changes to Andres
version to fix a potential issue if the file is closed which flushing is in
progress, but that will not change the overall shape of it.

So testing on Andres version seems relevant to me.

For SSD the performance impact should be limited. For disk it should be
significant if there is no big cache in front of it. There were some
concerns raised for some loads in the thread (shared memory smaller than
needed I think?), if you can include such cases that would be great. My
guess is that it should be not very beneficial in this case because the
writing is mostly done by bgwriter & worker in this case, and these are
still random.

As there are still plans to move on regarding tests (and because this
patch makes a difference), this is moved to next CF.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 10 years ago

In reply to: Fabien COELHO (#127)

Re: checkpointer continuous flushing

Hi,

On 12/16/2015 08:27 PM, Fabien COELHO wrote:

Hello Tomas,

I'm planning to do some thorough benchmarking of the patches proposed
in this thread, on various types of hardware (10k SAS drives and
SSDs). But is that actually needed? I see Andres did some testing, as
he posted summary of the results on 11/12, but I don't see any actual
results or even info about what benchmarks were done (pgbench?).

If yes, do we only want to compare 0001-ckpt-14-andres.patch against
master, or do we need to test one of the previous Fabien's patches?

My 0.02ï¿½,

Although I disagree with some aspects of Andres patch, I'm not a
committer and I'm tired of arguing. I'm just planing to do minor changes
to Andres version to fix a potential issue if the file is closed which
flushing is in progress, but that will not change the overall shape of it.

So testing on Andres version seems relevant to me.

The patch no longer applies to master. Can someone rebase it?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#130

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Tomas Vondra (#129)

Re: checkpointer continuous flushing

On 2016-01-06 21:01:47 +0100, Tomas Vondra wrote:

Although I disagree with some aspects of Andres patch, I'm not a
committer and I'm tired of arguing. I'm just planing to do minor changes
to Andres version to fix a potential issue if the file is closed which
flushing is in progress, but that will not change the overall shape of it.

Are you working on that aspect?

So testing on Andres version seems relevant to me.

The patch no longer applies to master. Can someone rebase it?

I'm working on an updated version, trying to mitigate the performance
regressions I observed.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Fabien COELHO (#118)

Re: checkpointer continuous flushing

<Ooops, wrong from address, resent, sorry for the noise>

Hello Andres,

Although I disagree with some aspects of Andres patch, I'm not a
committer and I'm tired of arguing. I'm just planing to do minor changes
to Andres version to fix a potential issue if the file is closed which
flushing is in progress, but that will not change the overall shape of
it.

Are you working on that aspect?

I read your patch and I know what I want to try to have a small and simple
fix. I must admit that I have not really understood in which condition the
checkpointer would decide to close a file, but that does not mean that the
potential issue should not be addressed.

Also, I gave some thoughts about what should be done for bgwriter random IOs.
The idea is to implement some per-file sorting there and then do some LRU/LFU
combing. It would not interact much with the checkpointer, so for me the two
issues should be kept separate and this should not preclude changing the
checkpointer, esp. given the significant performance benefit of the patch.

However, all this is still in my stack of things to do, and I had not much
time in the Fall for that. I may have more time in the coming weeks. I'm fine
if things are updated and performance figures are collected in between, I'll
take it from where it is when I have time, if something remains to be done.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: alpine.DEB.2.10.1601071101490.5278@sto

#132

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#131)

Re: checkpointer continuous flushing

On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:

I read your patch and I know what I want to try to have a small and simple
fix. I must admit that I have not really understood in which condition the
checkpointer would decide to close a file, but that does not mean that the
potential issue should not be addressed.

There's a trivial example: Consider three tablespaces and
max_files_per_process = 2. The balancing can easily cause three files
being flushed at the same time.

But more importantly: You designed the API to be generic because you
wanted it to be usable for other purposes as well. And for that it
certainly needs to deal with that.

Also, I gave some thoughts about what should be done for bgwriter random
IOs. The idea is to implement some per-file sorting there and then do some
LRU/LFU combing. It would not interact much with the checkpointer, so for me
the two issues should be kept separate and this should not preclude changing
the checkpointer, esp. given the significant performance benefit of the
patch.

Well, the problem is that the patch significantly regresses some cases
right now. So keeping them separate isn't particularly feasible.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#132)

Re: checkpointer continuous flushing

Hello,

I read your patch and I know what I want to try to have a small and simple
fix. I must admit that I have not really understood in which condition the
checkpointer would decide to close a file, but that does not mean that the
potential issue should not be addressed.

There's a trivial example: Consider three tablespaces and
max_files_per_process = 2. The balancing can easily cause three files
being flushed at the same time.

Indeed. Thanks for this explanation!

But more importantly: You designed the API to be generic because you
wanted it to be usable for other purposes as well. And for that it
certainly needs to deal with that.

Yes, I'm planning to try to do the minimum possible damage to the current
API to fix the issue.

Also, I gave some thoughts about what should be done for bgwriter random
IOs. The idea is to implement some per-file sorting there and then do some
LRU/LFU combing. It would not interact much with the checkpointer, so for me
the two issues should be kept separate and this should not preclude changing
the checkpointer, esp. given the significant performance benefit of the
patch.

Well, the problem is that the patch significantly regresses some cases
right now. So keeping them separate isn't particularly feasible.

I have not seen significant regressions on my many test runs. In
particular, I would not consider that having a tps deep in cases where
postgresql is doing 0 tps most of the time anyway (ie pg is offline)
because of random IO issues should be blocker.

As I understood it, the regressions occur when the checkpointer is less
used, i.e. bgwriter is doing most of the writes, but this does not change
much whether the checkpointer sorts buffers or not, and the overall
behavior of pg is very bad anyway in these cases.

Also I think that coupling the two issues is a recipee for never having
anything done in the end and keep the current awful behavior:-(

The solution on the bgwriter front is somehow similar to the checkpointer,
but from a code point of view there is minimum interaction, so I would
really separate them, esp. as the bgwriter part will require extensive
testing and discussions as well.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#133)

Re: checkpointer continuous flushing

On 2016-01-07 12:50:07 +0100, Fabien COELHO wrote:

But more importantly: You designed the API to be generic because you
wanted it to be usable for other purposes as well. And for that it
certainly needs to deal with that.

Yes, I'm planning to try to do the minimum possible damage to the current
API to fix the issue.

What's your thought there? Afaics it's infeasible to do the flushing tat
the fd.c level.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#135

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#134)

Re: checkpointer continuous flushing

Yes, I'm planning to try to do the minimum possible damage to the current
API to fix the issue.

What's your thought there? Afaics it's infeasible to do the flushing tat
the fd.c level.

I thought of adding a pointer to the current flush structure at the vfd
level, so that on closing a file with a flush in progress the flush can be
done and the structure properly cleaned up, hence later the checkpointer
would see a clean thing and be able to skip it instead of generating
flushes on a closed file or on a different file...

Maybe I'm missing something, but that is the plan I had in mind.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#135)

Re: checkpointer continuous flushing

On 2016-01-07 13:07:33 +0100, Fabien COELHO wrote:

Yes, I'm planning to try to do the minimum possible damage to the current
API to fix the issue.

What's your thought there? Afaics it's infeasible to do the flushing tat
the fd.c level.

I thought of adding a pointer to the current flush structure at the vfd
level, so that on closing a file with a flush in progress the flush can be
done and the structure properly cleaned up, hence later the checkpointer
would see a clean thing and be able to skip it instead of generating flushes
on a closed file or on a different file...

Maybe I'm missing something, but that is the plan I had in mind.

That might work, although it'd not be pretty (not fatally so
though). But I'm inclined to go a different way: I think it's a mistake
to do flusing based on a single file. It seems better to track a fixed
number of outstanding 'block flushes', independent of the file. Whenever
the number of outstanding blocks is exceeded, sort that list, and flush
all outstanding flush requests after merging neighbouring flushes. Imo
that means that we'd better track writes on a relfilenode + block number
level.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#137

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#136)

Re: checkpointer continuous flushing

Hello Andres,

I thought of adding a pointer to the current flush structure at the vfd
level, so that on closing a file with a flush in progress the flush can be
done and the structure properly cleaned up, hence later the checkpointer
would see a clean thing and be able to skip it instead of generating flushes
on a closed file or on a different file...

Maybe I'm missing something, but that is the plan I had in mind.

That might work, although it'd not be pretty (not fatally so
though).

Alas, any solution has to communicate somehow between the API levels, so
it cannot be "pretty", although we should avoid the worse.

But I'm inclined to go a different way: I think it's a mistake to do
flusing based on a single file. It seems better to track a fixed number
of outstanding 'block flushes', independent of the file. Whenever the
number of outstanding blocks is exceeded, sort that list, and flush all
outstanding flush requests after merging neighbouring flushes.

Hmmm. I'm not sure I understand your strategy.

I do not think that flushing without a prior sorting would be effective,
because there is no clear reason why buffers written together would then
be next to the other and thus give sequential write benefits, we would
just get flushed random IO, I tested that and it worked badly.

One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the
thread, so it makes sense to limit this cost, hence the aggregation. These
removed some performation regression I had in some cases.

Also, the granularity of the buffer flush call is a file + offset + size,
so necessarily it should be done this way (i.e. per file).

Once buffers are sorted per file and offset within file, then written
buffers are as close as possible one after the other, the merging is very
easy to compute (it is done on the fly, no need to keep the list of
buffers for instance), it is optimally effective, and when the
checkpointed file changes then we will never go back to it before the next
checkpoint, so there is no reason not to flush right then.

So basically I do not see a clear positive advantage to your suggestion,
especially when taking into consideration the scheduling process of the
scheduler:

In effect the checkpointer already works with little bursts of activity
between sleep phases, so that it writes buffers a few at a time, so it may
already work more or less as you expect, but not for the same reason.

The closest stategy that I experimented which is maybe close to your
suggestion was to manage a minimum number of buffers to write when awaken
and to change the sleep delay in between, but I had no clear way to choose
values and the experiments I did did not show significant performance
impact by varying these parameters, so I kept that out. If you find a
magic number of buffer which results in consistant better performance,
fine with me, but this is independent with aggregating before or after.

Imo that means that we'd better track writes on a relfilenode + block
number level.

I do not think that it is a better option. Moreover, the current approach
has been proven to be very effective on hundreds of runs, so redoing it
differently for the sake of it does not look like good resource
allocation.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#138

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#137)

Re: checkpointer continuous flushing

On 2016-01-07 16:05:32 +0100, Fabien COELHO wrote:

But I'm inclined to go a different way: I think it's a mistake to do
flusing based on a single file. It seems better to track a fixed number of
outstanding 'block flushes', independent of the file. Whenever the number
of outstanding blocks is exceeded, sort that list, and flush all
outstanding flush requests after merging neighbouring flushes.

Hmmm. I'm not sure I understand your strategy.

I do not think that flushing without a prior sorting would be effective,
because there is no clear reason why buffers written together would then be
next to the other and thus give sequential write benefits, we would just get
flushed random IO, I tested that and it worked badly.

Oh, I was thinking of sorting & merging these outstanding flushes. Sorry
for not making that clear.

One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the
thread, so it makes sense to limit this cost, hence the aggregation. These
removed some performation regression I had in some cases.

FWIW, my tests show that flushing for clean ranges is pretty cheap.

Also, the granularity of the buffer flush call is a file + offset + size, so
necessarily it should be done this way (i.e. per file).

What syscalls we issue, and at what level we track outstanding flushes,
doesn't have to be the same.

Once buffers are sorted per file and offset within file, then written
buffers are as close as possible one after the other, the merging is very
easy to compute (it is done on the fly, no need to keep the list of buffers
for instance), it is optimally effective, and when the checkpointed file
changes then we will never go back to it before the next checkpoint, so
there is no reason not to flush right then.

Well, that's true if there's only one tablespace, but e.g. not the case
with two tablespaces of about the same number of dirty buffers.

So basically I do not see a clear positive advantage to your suggestion,
especially when taking into consideration the scheduling process of the
scheduler:

I don't think it makes a big difference for the checkpointer alone, but
it makes the interface much more suitable for other processes, e.g. the
bgwriter, and normal backends.

Imo that means that we'd better track writes on a relfilenode + block
number level.

I do not think that it is a better option. Moreover, the current approach
has been proven to be very effective on hundreds of runs, so redoing it
differently for the sake of it does not look like good resource allocation.

For a subset of workloads, yes.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#139

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#138)

Re: checkpointer continuous flushing

Hello Andres,

One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the
thread, so it makes sense to limit this cost, hence the aggregation. These
removed some performation regression I had in some cases.

FWIW, my tests show that flushing for clean ranges is pretty cheap.

Yes, I agree that it is quite cheap, but I had a few % tps regressions
in some cases without aggregating, and aggregating was enough to avoid
these small regressions.

Also, the granularity of the buffer flush call is a file + offset + size, so
necessarily it should be done this way (i.e. per file).

What syscalls we issue, and at what level we track outstanding flushes,
doesn't have to be the same.

Sure. But the current version is simple, efficient and proven by many
runs, so there should be a very strong argument to justify a significant
benefit to change the approach, and I see no such thing in your arguments.

For me the current approach is optimal for the checkpointer, because it
takes advantage of all available information to perform a better job.

Once buffers are sorted per file and offset within file, then written
buffers are as close as possible one after the other, the merging is very
easy to compute (it is done on the fly, no need to keep the list of buffers
for instance), it is optimally effective, and when the checkpointed file
changes then we will never go back to it before the next checkpoint, so
there is no reason not to flush right then.

Well, that's true if there's only one tablespace, but e.g. not the case
with two tablespaces of about the same number of dirty buffers.

ISTM that in the version of the patch I sent there was one flushing
structure per tablespace each doing its own flushing on its files, so it
should work the same, only the writing intensity is devided by the number
of tablespace? Or am I missing something?

So basically I do not see a clear positive advantage to your suggestion,
especially when taking into consideration the scheduling process of the
scheduler:

I don't think it makes a big difference for the checkpointer alone, but
it makes the interface much more suitable for other processes, e.g. the
bgwriter, and normal backends.

Hmmm.

ISTM that the requirement are not exactly the same for the bgwriter and
backends vs the checkpointer. The checkpointer has the advantage of being
able to plan its IOs on the long term (volume & time is known...) and the
implementation takes the full benefit of this planing by sorting and
scheduling and flushing buffers so as to generate as much sequential
writes as possible.

The bgwriter and backends have a much shorter vision (a few seconds, or
juste one query being process), so the solution will be less efficient and
probably more messy on the coding side. This is life. I do not see why not
to take the benefit of a full planing in the checkpointer just because
other processes cannot do the same, especially as under plenty of loads
the checkpointer does most of the writing so is the limiting factor.

So I do not buy your suggestion for the checkpointer. Maybe it will be the
way to go for bgwriter and backends, then fine for them.

Imo that means that we'd better track writes on a relfilenode + block
number level.

I do not think that it is a better option. Moreover, the current approach
has been proven to be very effective on hundreds of runs, so redoing it
differently for the sake of it does not look like good resource allocation.

For a subset of workloads, yes.

Hmmm. What I understood is that the workloads that have some performance
regressions (regressions that I have *not* seen in the many tests I ran)
are not due to checkpointer IOs, but rather in settings where most of the
writes is done by backends or bgwriter.

I do not see the point of rewriting the checkpointer for them, although
obviously I agree that something has to be done also for the other
processes.

Maybe if all the writes (bgwriter and checkpointer) where performed by the
same process then some dynamic mixing and sorting and aggregating would
make sense, but this is currently not the case, and would probably have
quite limited effect.

Basically I do not understand how changing the flushing organisation as
you suggest would improve the checkpointer performance significantly, for
me it should only degrade the performance compared to the current version,
as far as the checkpointer is concerned.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#140

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#139)

Re: checkpointer continuous flushing

On 2016-01-07 21:08:10 +0100, Fabien COELHO wrote:

Hmmm. What I understood is that the workloads that have some performance
regressions (regressions that I have *not* seen in the many tests I ran) are
not due to checkpointer IOs, but rather in settings where most of the writes
is done by backends or bgwriter.

As far as I can see you've not run many tests where the hot/warm data
set is larger than memory (the full machine's memory, not
shared_buffers). That quite drastically alters the performance
characteristics here, because you suddenly have lots of synchronous read
IO thrown into the mix.

Whether it's bgwriter or not I've not fully been able to establish, but
it's a working theory.

I do not see the point of rewriting the checkpointer for them, although
obviously I agree that something has to be done also for the other
processes.

Rewriting the checkpointer and fixing the flush interface in a more
generic way aren't the same thing at all.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#141

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#140)

Re: checkpointer continuous flushing

Hello Andres,

Hmmm. What I understood is that the workloads that have some performance
regressions (regressions that I have *not* seen in the many tests I ran) are
not due to checkpointer IOs, but rather in settings where most of the writes
is done by backends or bgwriter.

As far as I can see you've not run many tests where the hot/warm data
set is larger than memory (the full machine's memory, not
shared_buffers).

Indeed, I think I ran some, but not many with such characteristics.

That quite drastically alters the performance characteristics here,
because you suddenly have lots of synchronous read IO thrown into the
mix.

If I understand this point correctly...

I would expect the overall performance to be abysmal in such a situation
because you get only intermixed *random* read and writes: As you point
out, synchroneous *random* reads (very slow), but on the write side the
IOs are mostly random as well on the checkpointer side because there is
not much to aggregate to get sequential writes.

Now why would that degrade performance significantly? For me it should
render the sorting/flushing less and less effective, and it would go back
to the previous performance levels...

Or maybe it only the flushing itself which degrades performance, as you
point out, because then you have some synchronous (synced) writes as well
as read, as opposed to just the reads before without the patch.

If this is indeed the issue, then the solution to avoid the regression is
*not* to flush so that the OS IO scheduler is less constrained in its job,
and can be slightly more effective (well, we talking of abysmal random IO
disk performance here, so effective would be between slightly more or less
very very very bad).

Maybe a trick could be not to aggregate and flush when buffers in the same
file are too much apart anyway, for instance, based on some threshold?
This can be implemented locally when deciding to merge buffer flushes or
not, and whether to flush or not, so it would fit the current code quite
simply.

Now my understanding of the sync_file_range call is that it is an advice
to flush the stuff, but it is still asynchronous in nature, so whether it
would impact performance that badly depends on the OS IO scheduler. Also,
I would like to check whether, under the "regressed performance" (in tps
term that you observed), pg is more or less responsive. It could be that
the average performance is better but pg is offline longer on fsync. In
which case, I would consider it better to have lower tps in such cases
*if* pg responsiveness is significantly improved.

Would you have these measures for the regression runs you observed?

Whether it's bgwriter or not I've not fully been able to establish, but
it's a working theory.

Ok, that is something to check for confirmation or infirmation.

Given the above discussion, I think my suggestion may be wrong: as the tps
is low because of random read/write accesses then not many buffers are
modified (so the bgwriter/backends won't need to make space), the
checkpointer does not have much to write (good), *but* all of it is random
(bad).

I do not see the point of rewriting the checkpointer for them, although
obviously I agree that something has to be done also for the other
processes.

Rewriting the checkpointer and fixing the flush interface in a more
generic way aren't the same thing at all.

Hmmm, probably I misunderstood something in the discussion. It started
with an implementation strategy, but it derived to discussing a
performance regression. I aggree that these are two different subjects.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#142

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#132)

Re: checkpointer continuous flushing

On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:

I read your patch and I know what I want to try to have a small and

simple

fix. I must admit that I have not really understood in which condition

the

checkpointer would decide to close a file, but that does not mean that

the

potential issue should not be addressed.

There's a trivial example: Consider three tablespaces and
max_files_per_process = 2. The balancing can easily cause three files
being flushed at the same time.

Won't the same thing can occur without patch in mdsync() and can't
we handle it in same way? In particular, I am referring to below code:

mdsync()

{

* It is possible that the relation has been dropped or

* truncated since the fsync request was entered.

* Therefore, allow ENOENT, but only if we didn't fail

* already on this file. This applies both for

* _mdfd_getseg() and for FileSync, since fd.c might have

* closed the file behind our back.

* XXX is there any point in allowing more than one retry?

* Don't see one at the moment, but easy to change the

* test here if so.

if (!FILE_POSSIBLY_DELETED(errno) ||

failures > 0)

ereport(ERROR,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\": %m",

path)));

else

ereport(DEBUG1,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\" but retrying: %m",

path)));
}

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#143

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Amit Kapila (#142)

Re: checkpointer continuous flushing

On 2016-01-09 18:04:39 +0530, Amit Kapila wrote:

On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:

I read your patch and I know what I want to try to have a small and

simple

fix. I must admit that I have not really understood in which condition

the

checkpointer would decide to close a file, but that does not mean that

the

potential issue should not be addressed.

There's a trivial example: Consider three tablespaces and
max_files_per_process = 2. The balancing can easily cause three files
being flushed at the same time.

Won't the same thing can occur without patch in mdsync() and can't
we handle it in same way? In particular, I am referring to below code:

I don't see how that's corresponding - the problem is that current
proposed infrastructure keeps a kernel level (or fd.c in my versio) fd
open in it's 'pending flushes' struct. But since that isn't associated
with fd.c opening/closing files that fd isn't very meaningful.

mdsync()

That seems to address different issues.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#144

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#143)

Re: checkpointer continuous flushing

On Sat, Jan 9, 2016 at 6:08 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-09 18:04:39 +0530, Amit Kapila wrote:

On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund <andres@anarazel.de>

wrote:

On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:

I read your patch and I know what I want to try to have a small and

simple

fix. I must admit that I have not really understood in which

condition

the

checkpointer would decide to close a file, but that does not mean

that

the

potential issue should not be addressed.

There's a trivial example: Consider three tablespaces and
max_files_per_process = 2. The balancing can easily cause three files
being flushed at the same time.

Won't the same thing can occur without patch in mdsync() and can't
we handle it in same way? In particular, I am referring to below code:

I don't see how that's corresponding - the problem is that current
proposed infrastructure keeps a kernel level (or fd.c in my versio) fd
open in it's 'pending flushes' struct. But since that isn't associated
with fd.c opening/closing files that fd isn't very meaningful.

Okay, but I think that is the reason why you are worried that it is possible
to issue sync_file_range() on a closed file, is that right or am I missing
something?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#145

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Amit Kapila (#144)

Re: checkpointer continuous flushing

On 2016-01-09 18:24:01 +0530, Amit Kapila wrote:

Okay, but I think that is the reason why you are worried that it is possible
to issue sync_file_range() on a closed file, is that right or am I missing
something?

That's one potential issue. You can also fsync a different file, try to
print an error message containing an unallocated filename (that's how I
noticed the issue in the first place)...

I don't think it's going to be acceptable to issue operations on more or
less random fds, even if that operation is hopefully harmless.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#145)

Re: checkpointer continuous flushing

On Sat, Jan 9, 2016 at 6:26 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-09 18:24:01 +0530, Amit Kapila wrote:

Okay, but I think that is the reason why you are worried that it is

possible

to issue sync_file_range() on a closed file, is that right or am I

missing

something?

That's one potential issue. You can also fsync a different file, try to
print an error message containing an unallocated filename (that's how I
noticed the issue in the first place)...

I don't think it's going to be acceptable to issue operations on more or
less random fds, even if that operation is hopefully harmless.

Right that won't be acceptable, however I think with your latest
proposal [1]"It seems better to track a fixed number of outstanding 'block flushes', independent of the file. Whenever the number of outstanding blocks is exceeded, sort that list, and flush all outstanding flush requests after merging neighbouring flushes.", we might not need to solve this problem or do we still
need to address it. I think that idea will help to mitigate the problem of
backend and bgwriter writes as well. In that, can't we do it with the
help of existing infrastructure of *pendingOpsTable* and
*CheckpointerShmem->requests[]*, as already the flush requests are
remembered in those structures, we can use those to apply your idea
to issue flush requests.

[1]: "It seems better to track a fixed number of outstanding 'block flushes', independent of the file. Whenever the number of outstanding blocks is exceeded, sort that list, and flush all outstanding flush requests after merging neighbouring flushes."
"It seems better to track a fixed
number of outstanding 'block flushes', independent of the file. Whenever
the number of outstanding blocks is exceeded, sort that list, and flush
all outstanding flush requests after merging neighbouring flushes."

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#147

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Amit Kapila (#146)

Re: checkpointer continuous flushing

On 2016-01-09 19:05:54 +0530, Amit Kapila wrote:

Right that won't be acceptable, however I think with your latest
proposal [1]

Sure, that'd address that problem.

[...] think that idea will help to mitigate the problem of backend and
bgwriter writes as well. In that, can't we do it with the help of
existing infrastructure of *pendingOpsTable* and
*CheckpointerShmem->requests[]*, as already the flush requests are
remembered in those structures, we can use those to apply your idea to
issue flush requests.

Hm, that might be possible. But that might have some bigger implications
- we currently can issue thousands of flush requests a second, without
much chance of merging. I'm not sure it's a good idea to overlay that
into the lower frequency pendingOpsTable. Backends having to issue
fsyncs because the pending fsync queue is full is darn expensive. In
contrast to that a 'flush hint' request getting lost doesn't cost that
much.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Andres Freund (#140)

Re: checkpointer continuous flushing

On 2016-01-07 21:17:32 +0100, Andres Freund wrote:

On 2016-01-07 21:08:10 +0100, Fabien COELHO wrote:

Hmmm. What I understood is that the workloads that have some performance
regressions (regressions that I have *not* seen in the many tests I ran) are
not due to checkpointer IOs, but rather in settings where most of the writes
is done by backends or bgwriter.

As far as I can see you've not run many tests where the hot/warm data
set is larger than memory (the full machine's memory, not
shared_buffers). That quite drastically alters the performance
characteristics here, because you suddenly have lots of synchronous read
IO thrown into the mix.

Whether it's bgwriter or not I've not fully been able to establish, but
it's a working theory.

Hm. New theory: The current flush interface does the flushing inside
FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
problem with that is that at that point we (need to) hold a content lock
on the buffer!

Especially on a system that's bottlenecked on IO that means we'll
frequently hold content locks for a noticeable amount of time, while
flushing blocks, without any need to.

Even if that's not the reason for the slowdowns I observed, I think this
fact gives further credence to the current "pending flushes" tracking
residing on the wrong level.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#149

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#148)

Re: checkpointer continuous flushing

Hello Andres,

Hm. New theory: The current flush interface does the flushing inside
FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
problem with that is that at that point we (need to) hold a content lock
on the buffer!

You are worrying that FlushBuffer is holding a lock on a buffer and the
"sync_file_range" call occurs is issued at that moment.

Although I agree that it is not that good, I would be surprise if that was
the explanation for a performance regression, because the sync_file_range
with the chosen parameters is an async call, it "advises" the OS to send
the file, but it does not wait for it to be completed.

Moreover, for this issue to have a significant impact, it would require
that another backend just happen to need this very buffer, but ISTM that
the performance regression you are arguing about is on random IO bound
performance, that is a few 100 tps in the best case, for very large bases,
so a lot of buffers, so the probability of such a collision is very small,
so it would not explain a significant regression.

Especially on a system that's bottlenecked on IO that means we'll
frequently hold content locks for a noticeable amount of time, while
flushing blocks, without any need to.

I'm not that sure it is really noticeable, because sync_file_range does
not wait for completion.

Even if that's not the reason for the slowdowns I observed, I think this
fact gives further credence to the current "pending flushes" tracking
residing on the wrong level.

ISTM that I put the tracking at the level where is the information is
available without having to recompute it several times, as the flush needs
to know the fd and offset. Doing it differently would mean more code and
translating buffer to file/offset several times, I think.

Also, maybe you could answer a question I had about the performance
regression you observed, I could not find the post where you gave the
detailed information about it, so that I could try reproducing it: what
are the exact settings and conditions (shared_buffers, pgbench scaling,
host memory, ...), what is the observed regression (tps? other?), and what
is the responsiveness of the database under the regression (eg % of
seconds with 0 tps for instance, or something like that).

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#150

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#147)

Re: checkpointer continuous flushing

On Sat, Jan 9, 2016 at 7:10 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-09 19:05:54 +0530, Amit Kapila wrote:

Right that won't be acceptable, however I think with your latest
proposal [1]

Sure, that'd address that problem.

[...] think that idea will help to mitigate the problem of backend and
bgwriter writes as well. In that, can't we do it with the help of
existing infrastructure of *pendingOpsTable* and
*CheckpointerShmem->requests[]*, as already the flush requests are
remembered in those structures, we can use those to apply your idea to
issue flush requests.

Hm, that might be possible. But that might have some bigger implications
- we currently can issue thousands of flush requests a second, without
much chance of merging. I'm not sure it's a good idea to overlay that
into the lower frequency pendingOpsTable.

In that case, we can have unified structure to remember flush requests
rather than backend and bgwriter noting that information in
CheckpointerShmem and checkpointer in pendingOpsTable. I understand
there are some benefits of having pendingOpsTable, but having a
common structure seems to be more beneficial and in particular
because it can be used for the purpose of flush hints.

Now, I am sure we can invent a new way of tracking the flush
requests for flush hints, but I think we might want to consider why
can't we have one unified way of tracking the flush requests which
can be used both for *flush* and *flush hints*.

Backends having to issue
fsyncs because the pending fsync queue is full is darn expensive. In
contrast to that a 'flush hint' request getting lost doesn't cost that
much.

In general, I think the cases where backends have to do flush should
be less as the size of fsync queue is NBuffers and we take care of
handling duplicate fsync requests for the same buffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#151

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#149)

Re: checkpointer continuous flushing

On 2016-01-09 16:49:56 +0100, Fabien COELHO wrote:

Hello Andres,

Hm. New theory: The current flush interface does the flushing inside
FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
problem with that is that at that point we (need to) hold a content lock
on the buffer!

You are worrying that FlushBuffer is holding a lock on a buffer and the
"sync_file_range" call occurs is issued at that moment.

Although I agree that it is not that good, I would be surprise if that was
the explanation for a performance regression, because the sync_file_range
with the chosen parameters is an async call, it "advises" the OS to send the
file, but it does not wait for it to be completed.

I frequently see sync_file_range blocking - it waits till it could
submit the writes into the io queues. On a system bottlenecked on IO
that's not always possible immediately.

Also, maybe you could answer a question I had about the performance
regression you observed, I could not find the post where you gave the
detailed information about it, so that I could try reproducing it: what are
the exact settings and conditions (shared_buffers, pgbench scaling, host
memory, ...), what is the observed regression (tps? other?), and what is the
responsiveness of the database under the regression (eg % of seconds with 0
tps for instance, or something like that).

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j16 -T 300 -P 1
I get

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1155733
latency average: 4.151 ms
latency stddev: 8.712 ms
tps = 3851.242965 (including connections establishing)
tps = 3851.725856 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 855156
latency average: 5.612 ms
latency stddev: 7.896 ms
tps = 2849.876327 (including connections establishing)
tps = 2849.912015 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
master:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2104781
latency average: 2.280 ms
latency stddev: 9.868 ms
tps = 7010.397938 (including connections establishing)
tps = 7010.475848 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1930716
latency average: 2.484 ms
latency stddev: 7.303 ms
tps = 6434.785605 (including connections establishing)
tps = 6435.177773 (excluding connections establishing)

In neither case there are periods of 0 tps, but both have times of <
1000 tps with noticeably increased latency.

The endresults are similar with a sane checkpoint timeout - the tests
just take much longer to give meaningful results. Constantly running
long tests on prosumer level SSDs isn't nice - I've now killed 5 SSDs
with postgres testing...

As you can see there's roughly a 30% performance regression on the
slower SSD and a ~9% on the faster one. HDD results are similar (but I
can't repeat on the laptop right now since the 2nd hdd is now an SSD).

My working copy of checkpoint sorting & flushing currently results in:
My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1136260
latency average: 4.223 ms
latency stddev: 8.298 ms
tps = 3786.696499 (including connections establishing)
tps = 3786.778875 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2050661
latency average: 2.339 ms
latency stddev: 7.708 ms
tps = 6833.593170 (including connections establishing)
tps = 6833.680391 (excluding connections establishing)

My version of the patch currently addresses various points, which need
to be separated and benchmarked separate:
* Different approach to background writer, trying to make backends write
less. While that proves to be beneficial in isolation, on its own that
doesn't address the performance regression.
* Different flushing API, done outside the lock

So this partially addresses the performance problems, but not yet
completely.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Andres Freund (#151)

Re: checkpointer continuous flushing

On 2016-01-11 14:45:16 +0100, Andres Freund wrote:

On 2016-01-09 16:49:56 +0100, Fabien COELHO wrote:

Hm. New theory: The current flush interface does the flushing inside
FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
problem with that is that at that point we (need to) hold a content lock
on the buffer!

You are worrying that FlushBuffer is holding a lock on a buffer and the
"sync_file_range" call occurs is issued at that moment.

Although I agree that it is not that good, I would be surprise if that was
the explanation for a performance regression, because the sync_file_range
with the chosen parameters is an async call, it "advises" the OS to send the
file, but it does not wait for it to be completed.

I frequently see sync_file_range blocking - it waits till it could
submit the writes into the io queues. On a system bottlenecked on IO
that's not always possible immediately.

Also, maybe you could answer a question I had about the performance
regression you observed, I could not find the post where you gave the
detailed information about it, so that I could try reproducing it: what are
the exact settings and conditions (shared_buffers, pgbench scaling, host
memory, ...), what is the observed regression (tps? other?), and what is the
responsiveness of the database under the regression (eg % of seconds with 0
tps for instance, or something like that).

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j16 -T 300 -P 1
I get

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1155733
latency average: 4.151 ms
latency stddev: 8.712 ms
tps = 3851.242965 (including connections establishing)
tps = 3851.725856 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 855156
latency average: 5.612 ms
latency stddev: 7.896 ms
tps = 2849.876327 (including connections establishing)
tps = 2849.912015 (excluding connections establishing)

Hm. I think I have an entirely different theory that might explain some
of this theory. I instrumented lwlocks to check for additional blocking
and found some. Admittedly not exactly where I thought it might
be. Check out what you can observe when adding/enabling an elog in
FlushBuffer() (and the progress printing from BufferSync()):

(sorry, a bit long, but it's necessary to understand)

[2016-01-11 20:15:02 CET][14957] CONTEXT: writing block 0 of relation base/13000/16387
to_scan: 131141, scanned: 6, %processed: 0.00, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG: xlog flush request 1F/D2FD7E0; write 1F/D296000; flush 1F/D296000; insert: 1F/D33B418
[2016-01-11 20:15:02 CET][14957] CONTEXT: writing block 2 of relation base/13000/16387
to_scan: 131141, scanned: 7, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG: xlog flush request 1F/D3B2E30; write 1F/D33C000; flush 1F/D33C000; insert: 1F/D403198
[2016-01-11 20:15:02 CET][14957] CONTEXT: writing block 3 of relation base/13000/16387
to_scan: 131141, scanned: 9, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG: xlog flush request 1F/D469990; write 1F/D402000; flush 1F/D402000; insert: 1F/D4FDD00
[2016-01-11 20:15:02 CET][14957] CONTEXT: writing block 5 of relation base/13000/16387
to_scan: 131141, scanned: 11, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG: xlog flush request 1F/D5663E8; write 1F/D4FC000; flush 1F/D4FC000; insert: 1F/D5D1390
[2016-01-11 20:15:02 CET][14957] CONTEXT: writing block 7 of relation base/13000/16387
to_scan: 131141, scanned: 14, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG: xlog flush request 1F/D673700; write 1F/D5D0000; flush 1F/D5D0000; insert: 1F/D687E58
[2016-01-11 20:15:02 CET][14957] CONTEXT: writing block 10 of relation base/13000/16387
to_scan: 131141, scanned: 15, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG: xlog flush request 1F/D76BEC8; write 1F/D686000; flush 1F/D686000; insert: 1F/D7A83A0
[2016-01-11 20:15:02 CET][14957] CONTEXT: writing block 11 of relation base/13000/16387
to_scan: 131141, scanned: 16, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/D7AE5C0; write 1F/D7A83E8; flush 1F/D7A83E8; insert: 1F/D8B9A88
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 12 of relation base/13000/16387
to_scan: 131141, scanned: 17, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/DA08370; write 1F/D963A38; flush 1F/D963A38; insert: 1F/DA0A7D0
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 13 of relation base/13000/16387
to_scan: 131141, scanned: 18, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/DAC09A0; write 1F/DA92250; flush 1F/DA92250; insert: 1F/DB9AAC8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 14 of relation base/13000/16387
to_scan: 131141, scanned: 21, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/DCEFF18; write 1F/DC2AD30; flush 1F/DC2AD30; insert: 1F/DCF25B0
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 17 of relation base/13000/16387
to_scan: 131141, scanned: 23, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/DD0E9E0; write 1F/DCF25F8; flush 1F/DCF25F8; insert: 1F/DDD6198
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 19 of relation base/13000/16387
to_scan: 131141, scanned: 24, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/DED6A20; write 1F/DEC0358; flush 1F/DEC0358; insert: 1F/DFB64C8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 20 of relation base/13000/16387
to_scan: 131141, scanned: 25, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/DFDEE90; write 1F/DFB6560; flush 1F/DFB6560; insert: 1F/E073468
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 21 of relation base/13000/16387
to_scan: 131141, scanned: 26, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/E295638; write 1F/E10B9F8; flush 1F/E10B9F8; insert: 1F/E2B40E0
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 22 of relation base/13000/16387
to_scan: 131141, scanned: 27, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/E381688; write 1F/E354BC0; flush 1F/E354BC0; insert: 1F/E459598
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 23 of relation base/13000/16387
to_scan: 131141, scanned: 28, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/E56EF70; write 1F/E4C0C98; flush 1F/E4C0C98; insert: 1F/E56F200
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 24 of relation base/13000/16387
to_scan: 131141, scanned: 29, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/E67E538; write 1F/E5DC440; flush 1F/E5DC440; insert: 1F/E6F7FF8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 25 of relation base/13000/16387
to_scan: 131141, scanned: 31, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/E873DD8; write 1F/E7D81F0; flush 1F/E7D81F0; insert: 1F/E8A1710
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 27 of relation base/13000/16387
to_scan: 131141, scanned: 33, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/E9E3948; write 1F/E979610; flush 1F/E979610; insert: 1F/EA27AC0
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 29 of relation base/13000/16387
to_scan: 131141, scanned: 35, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/EABDDC8; write 1F/EA6DFE0; flush 1F/EA6DFE0; insert: 1F/EB10728
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 31 of relation base/13000/16387
to_scan: 131141, scanned: 37, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/EC07328; write 1F/EBAABE0; flush 1F/EBAABE0; insert: 1F/EC9B8A8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 33 of relation base/13000/16387
to_scan: 131141, scanned: 40, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/ED18FF8; write 1F/EC9B8A8; flush 1F/EC9B8A8; insert: 1F/ED8C2F8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 36 of relation base/13000/16387
to_scan: 131141, scanned: 41, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/EEED640; write 1F/EE0BAD8; flush 1F/EE0BAD8; insert: 1F/EF35EA8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 37 of relation base/13000/16387
to_scan: 131141, scanned: 42, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/EFF20B8; write 1F/EFAAE20; flush 1F/EFAAE20; insert: 1F/F06FAC0
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 38 of relation base/13000/16387
to_scan: 131141, scanned: 43, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/F1430B0; write 1F/F0DEAB8; flush 1F/F0DEAB8; insert: 1F/F265020
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 39 of relation base/13000/16387
to_scan: 131141, scanned: 45, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/F3556C0; write 1F/F268F68; flush 1F/F268F68; insert: 1F/F3682B8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 41 of relation base/13000/16387
to_scan: 131141, scanned: 46, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/F5005F8; write 1F/F4376F8; flush 1F/F4376F8; insert: 1F/F523838
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 42 of relation base/13000/16387
to_scan: 131141, scanned: 47, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/F6261C0; write 1F/F5A07A0; flush 1F/F5A07A0; insert: 1F/F691288
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 43 of relation base/13000/16387
to_scan: 131141, scanned: 48, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/F7CBCD0; write 1F/F719020; flush 1F/F719020; insert: 1F/F80DBB0
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 44 of relation base/13000/16387
to_scan: 131141, scanned: 49, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/F9359C8; write 1F/F874CB8; flush 1F/F874CB8; insert: 1F/F95AD58
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 45 of relation base/13000/16387
to_scan: 131141, scanned: 50, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/FA33F38; write 1F/FA03490; flush 1F/FA03490; insert: 1F/FAD4DF8
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 46 of relation base/13000/16387
to_scan: 131141, scanned: 51, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/FBDBCD8; write 1F/FB52238; flush 1F/FB52238; insert: 1F/FC54E68
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 47 of relation base/13000/16387
to_scan: 131141, scanned: 52, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/FD74B60; write 1F/FD10360; flush 1F/FD10360; insert: 1F/FDB6A88
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 48 of relation base/13000/16387
to_scan: 131141, scanned: 53, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/FE4FF60; write 1F/FDB6AD0; flush 1F/FDB6AD0; insert: 1F/FE90028
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 49 of relation base/13000/16387
to_scan: 131141, scanned: 54, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG: xlog flush request 1F/FFD6A78; write 1F/FF223F0; flush 1F/FF223F0; insert: 1F/10022F70
[2016-01-11 20:15:03 CET][14957] CONTEXT: writing block 50 of relation base/13000/16387
to_scan: 131141, scanned: 55, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10144C98; write 1F/10023000; flush 1F/10023000; insert: 1F/10157730
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 51 of relation base/13000/16387
to_scan: 131141, scanned: 58, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/102AA468; write 1F/1020C600; flush 1F/1020C600; insert: 1F/102C73F0
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 54 of relation base/13000/16387
to_scan: 131141, scanned: 60, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10313470; write 1F/102C7460; flush 1F/102C7460; insert: 1F/103D4F38
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 56 of relation base/13000/16387
to_scan: 131141, scanned: 61, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10510CE8; write 1F/104562F0; flush 1F/104562F0; insert: 1F/105171E8
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 57 of relation base/13000/16387
to_scan: 131141, scanned: 62, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10596B18; write 1F/105191B0; flush 1F/105191B0; insert: 1F/106076F8
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 58 of relation base/13000/16387
to_scan: 131141, scanned: 63, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/1073FB28; write 1F/10693638; flush 1F/10693638; insert: 1F/10787D40
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 59 of relation base/13000/16387
to_scan: 131141, scanned: 64, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/1088D058; write 1F/107F7068; flush 1F/107F7068; insert: 1F/10920EA0
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 60 of relation base/13000/16387
to_scan: 131141, scanned: 67, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/109D9158; write 1F/109A8458; flush 1F/109A8458; insert: 1F/10A8A240
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 63 of relation base/13000/16387
to_scan: 131141, scanned: 68, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10BDAA38; write 1F/10B2AD48; flush 1F/10B2AD48; insert: 1F/10C16768
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 64 of relation base/13000/16387
to_scan: 131141, scanned: 69, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10D824D0; write 1F/10C859A0; flush 1F/10C859A0; insert: 1F/10DCC860
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 65 of relation base/13000/16387
to_scan: 131141, scanned: 70, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10E24CD8; write 1F/10DCC8A8; flush 1F/10DCC8A8; insert: 1F/10EA8588
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 66 of relation base/13000/16387
to_scan: 131141, scanned: 71, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/10FD3E90; write 1F/10F57530; flush 1F/10F57530; insert: 1F/11043A58
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 67 of relation base/13000/16387
to_scan: 131141, scanned: 72, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/111CE4A0; write 1F/11043AC8; flush 1F/11043AC8; insert: 1F/111ED470
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 68 of relation base/13000/16387
to_scan: 131141, scanned: 73, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11338080; write 1F/112917C8; flush 1F/112917C8; insert: 1F/1135CF80
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 69 of relation base/13000/16387
to_scan: 131141, scanned: 76, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11369068; write 1F/1135CF80; flush 1F/1135CF80; insert: 1F/1140BE88
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 72 of relation base/13000/16387
to_scan: 131141, scanned: 77, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/1146A420; write 1F/1136E000; flush 1F/1136E000; insert: 1F/11483530
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 73 of relation base/13000/16387
to_scan: 131141, scanned: 78, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/1157B800; write 1F/11483530; flush 1F/11483530; insert: 1F/11583E20
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 74 of relation base/13000/16387
to_scan: 131141, scanned: 79, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/116368C0; write 1F/11583E20; flush 1F/11583E20; insert: 1F/116661A8
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 75 of relation base/13000/16387
to_scan: 131141, scanned: 81, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/116FC598; write 1F/11668178; flush 1F/11668178; insert: 1F/11716758
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 0 of relation base/13000/16393
to_scan: 131141, scanned: 82, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/117DA658; write 1F/117631F0; flush 1F/117631F0; insert: 1F/118206F0
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 1 of relation base/13000/16393
to_scan: 131141, scanned: 83, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11956320; write 1F/118E96B8; flush 1F/118E96B8; insert: 1F/1196F000
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 2 of relation base/13000/16393
to_scan: 131141, scanned: 84, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11A09B00; write 1F/1196F090; flush 1F/1196F090; insert: 1F/11A23D38
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 3 of relation base/13000/16393
to_scan: 131141, scanned: 85, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11B43C80; write 1F/11AB2148; flush 1F/11AB2148; insert: 1F/11B502D8
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 4 of relation base/13000/16393
to_scan: 131141, scanned: 86, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11BE2610; write 1F/11B503B8; flush 1F/11B503B8; insert: 1F/11BF9068
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 5 of relation base/13000/16393
to_scan: 131141, scanned: 87, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11CB9FD8; write 1F/11BF9168; flush 1F/11BF9168; insert: 1F/11CBE1F8
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 6 of relation base/13000/16393
to_scan: 131141, scanned: 88, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11D24E10; write 1F/11CBE268; flush 1F/11CBE268; insert: 1F/11D8BC18
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 7 of relation base/13000/16393
to_scan: 131141, scanned: 89, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11E9B070; write 1F/11DEC840; flush 1F/11DEC840; insert: 1F/11EB7EC0
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 8 of relation base/13000/16393
to_scan: 131141, scanned: 90, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/11F5C3F0; write 1F/11F3FBD0; flush 1F/11F3FBD0; insert: 1F/11FE1A08
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 9 of relation base/13000/16393
to_scan: 131141, scanned: 91, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/121EDC00; write 1F/1208E838; flush 1F/1208E838; insert: 1F/121F1EF8
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 10 of relation base/13000/16393
to_scan: 131141, scanned: 92, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/122E0A70; write 1F/121F1F90; flush 1F/121F1F90; insert: 1F/122E9198
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 11 of relation base/13000/16393
to_scan: 131141, scanned: 93, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/1243B698; write 1F/123A7EC8; flush 1F/123A7EC8; insert: 1F/1245E620
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 12 of relation base/13000/16393
to_scan: 131141, scanned: 94, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/1258E7B0; write 1F/124BF6B8; flush 1F/124BF6B8; insert: 1F/1259F198
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 13 of relation base/13000/16393
to_scan: 131141, scanned: 95, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/126C8E38; write 1F/12662BA0; flush 1F/12662BA0; insert: 1F/126FE690
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 14 of relation base/13000/16393
to_scan: 131141, scanned: 96, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/127DE810; write 1F/126FE6D8; flush 1F/126FE6D8; insert: 1F/128081B0
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 15 of relation base/13000/16393
to_scan: 131141, scanned: 97, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG: xlog flush request 1F/12980108; write 1F/128A6000; flush 1F/128A6000; insert: 1F/129A8E00
[2016-01-11 20:15:04 CET][14957] CONTEXT: writing block 16 of relation base/13000/16393
to_scan: 131141, scanned: 98, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG: xlog flush request 1F/12A55978; write 1F/129ACDB8; flush 1F/129ACDB8; insert: 1F/12A6A408
[2016-01-11 20:15:05 CET][14957] CONTEXT: writing block 17 of relation base/13000/16393
to_scan: 131141, scanned: 99, %processed: 0.08, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG: xlog flush request 1F/12BC1148; write 1F/12B12F40; flush 1F/12B12F40; insert: 1F/12BC15F8
[2016-01-11 20:15:05 CET][14957] CONTEXT: writing block 18 of relation base/13000/16393
to_scan: 131141, scanned: 100, %processed: 0.08, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG: xlog flush request 1F/12D36E20; write 1F/12C70120; flush 1F/12C70120; insert: 1F/12D4DC08
[2016-01-11 20:15:05 CET][14957] CONTEXT: writing block 19 of relation base/13000/16393
to_scan: 131141, scanned: 9892, %processed: 7.54, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG: xlog flush request 1F/13128AF8; write 1F/12DEE670; flush 1F/12DEE670; insert: 1F/1313B7D0
[2016-01-11 20:15:05 CET][14957] CONTEXT: writing block 101960 of relation base/13000/16396
to_scan: 131141, scanned: 18221, %processed: 13.89, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG: xlog flush request 1F/13276328; write 1F/1313A000; flush 1F/1313A000; insert: 1F/134E93A8
[2016-01-11 20:15:05 CET][14957] CONTEXT: writing block 188242 of relation base/13000/16396
to_scan: 131141, scanned: 25857, %processed: 19.72, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG: xlog flush request 1F/13497370; write 1F/1346E000; flush 1F/1346E000; insert: 1F/136C00F8
[2016-01-11 20:15:06 CET][14957] CONTEXT: writing block 267003 of relation base/13000/16396
to_scan: 131141, scanned: 26859, %processed: 20.48, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG: xlog flush request 1F/136B5BB0; write 1F/135D6000; flush 1F/135D6000; insert: 1F/136C00F8
[2016-01-11 20:15:06 CET][14957] CONTEXT: writing block 277621 of relation base/13000/16396
to_scan: 131141, scanned: 27582, %processed: 21.03, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG: xlog flush request 1F/138C6C38; write 1F/1375E900; flush 1F/1375E900; insert: 1F/138D5518
[2016-01-11 20:15:06 CET][14957] CONTEXT: writing block 285176 of relation base/13000/16396
to_scan: 131141, scanned: 28943, %processed: 22.07, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG: xlog flush request 1F/13A5B768; write 1F/138C8000; flush 1F/138C8000; insert: 1F/13AB61D0
[2016-01-11 20:15:06 CET][14957] CONTEXT: writing block 300007 of relation base/13000/16396
to_scan: 131141, scanned: 36181, %processed: 27.59, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG: xlog flush request 1F/13C320C8; write 1F/13A8A000; flush 1F/13A8A000; insert: 1F/13DAAB40
[2016-01-11 20:15:06 CET][14957] CONTEXT: writing block 375983 of relation base/13000/16396
to_scan: 131141, scanned: 40044, %processed: 30.54, %writeouts: 100.00
[2016-01-11 20:15:07 CET][14957] LOG: xlog flush request 1F/13E196C8; write 1F/13CBA000; flush 1F/13CBA000; insert: 1F/13F9E6D8
[2016-01-11 20:15:07 CET][14957] CONTEXT: writing block 416439 of relation base/13000/16396
to_scan: 131141, scanned: 48250, %processed: 36.79, %writeouts: 100.00
[2016-01-11 20:15:07 CET][14957] LOG: xlog flush request 1F/143F6160; write 1F/13EE8000; flush 1F/13EE8000; insert: 1F/1461BB08

You can see that initially every buffer triggers a WAL flush. That
causes a slowdown because a) we're doing significantly more WAL flushes
in that time period, both causing slowdown of concurrent IO and
concurrent WAL insertions b) due to the many slow flushes we get behind
on the checkpoint schedule, triggering a rapid fire period of writes
afterwards.

My theory is that this happens due to the sorting: pgbench is an update
heavy workload, the first few pages are always going to be used if
there's free space as freespacemap.c essentially prefers those. Due to
the sorting all a relation's early pages are going to be in "in a row".

Indeed, the behaviour is not visible in a significant manner when using
pgbench -N, where there are far fewer updated pages.

I'm not entirely sure how we can deal with that.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#153

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#152)

Re: checkpointer continuous flushing

On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <andres@anarazel.de> wrote:>

My theory is that this happens due to the sorting: pgbench is an update
heavy workload, the first few pages are always going to be used if
there's free space as freespacemap.c essentially prefers those. Due to
the sorting all a relation's early pages are going to be in "in a row".

Not sure, what is best way to tackle this problem, but I think one way could
be to perform sorting at flush requests level rather than before writing
to OS buffers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#154

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Amit Kapila (#153)

Re: checkpointer continuous flushing

On 2016-01-12 17:50:36 +0530, Amit Kapila wrote:

On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <andres@anarazel.de> wrote:>

My theory is that this happens due to the sorting: pgbench is an update
heavy workload, the first few pages are always going to be used if
there's free space as freespacemap.c essentially prefers those. Due to
the sorting all a relation's early pages are going to be in "in a row".

Not sure, what is best way to tackle this problem, but I think one way could
be to perform sorting at flush requests level rather than before writing
to OS buffers.

I'm not following. If you just sort a couple hundred more or less random
buffers - which is what you get if you look in buf_id order through
shared_buffers - the likelihood of actually finding neighbouring writes
is pretty low.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#155

Fabien COELHO

coelho@cri.ensmp.fr

about 10 years ago

In reply to: Andres Freund (#151)

Re: checkpointer continuous flushing

Hello Andres,

Thanks for the details. Many comments and some questions below.

Also, maybe you could answer a question I had about the performance
regression you observed, I could not find the post where you gave the
detailed information about it, so that I could try reproducing it: what are
the exact settings and conditions (shared_buffers, pgbench scaling, host
memory, ...), what is the observed regression (tps? other?), and what is the
responsiveness of the database under the regression (eg % of seconds with 0
tps for instance, or something like that).

I measured it in a different number of cases, both on SSDs
and spinning rust.

Argh! This is a key point: the sort/flush is designed to help HDDs, and
would have limited effect on SSDs, and it seems that you are showing that
the effect is in fact negative on SSDs, too bad:-(

The bad news is that I do not have a host with a SSD available for
reproducing such results.

On SSDs, the linux IO scheduler works quite well, so this is a place where
I would consider simply disactivating flushing and/or sorting.

ISTM that I would rather update the documentation to "do not activate on
SSD" than try to find a miraculous solution which may or may not exist.
Basically I would use your results to give better advises in the
documentation, not as a motivation to rewrite the patch from scratch.

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \

I'm not sure I like this one. I guess the intention is to focus on
checkpointer writes and reduce the impact of WAL writes. Why not.

-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

That is a very short one, but the point is to exercise the checkpoint, so
why not.

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800

The DB is probably about 12GB, so it fits in memory in the end, meaning
that there should be only write activity after some time? So this is not
really the case where it does not fit in memory, but it is large enough to
get mostly random IOs both in read & write, so why not.

query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1155733

Assuming one buffer accessed per transaction on average, and considering a
uniform random distribution, this means about 50% of pages actually loaded
in memory at the end of the run (1 - e(-1155766/800*2048)) (with 2048
pages per scale unit).

latency average: 4.151 ms
latency stddev: 8.712 ms
tps = 3851.242965 (including connections establishing)
tps = 3851.725856 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):

Is this comment refering to "synchronous_commit = off"?
I guess this is the same on master above, even if not written?

[...] In neither case there are periods of 0 tps, but both have times of
1000 tps with noticeably increased latency.

Ok, but we are talking SSDs, things are not too bad, even if there are ups
and downs.

The endresults are similar with a sane checkpoint timeout - the tests
just take much longer to give meaningful results. Constantly running
long tests on prosumer level SSDs isn't nice - I've now killed 5 SSDs
with postgres testing...

Indeed. It wears out and costs, too bad:-(

As you can see there's roughly a 30% performance regression on the
slower SSD and a ~9% on the faster one. HDD results are similar (but I
can't repeat on the laptop right now since the 2nd hdd is now an SSD).

Ok, that is what I would have expected, the larger the database, the
smaller the impact of sorting & flushin on SSDs. Now I would have hoped
that flushing would help get a more constant load even in this case, at
least this is what I measured in my tests. The closest to your setting
test I ran is scale=660, and the sort/flush got 400 tps vs 100 tps
without, with 30 minutes checkpoints, but HDDs do not compare to SSDs...

My overall comments about this SSD regression is that the patch is really
designed to make a difference for HDDs, so to advise not activate on SSDs
if there is a regression in such a case.

Now this is a little disappointing as on paper sorted writes should also
be slightly better on SSDs, but if the bench says the contrary, I have to
believe the bench:-)

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#156

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#154)

Re: checkpointer continuous flushing

On Tue, Jan 12, 2016 at 5:52 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-12 17:50:36 +0530, Amit Kapila wrote:

On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <andres@anarazel.de>

wrote:>

My theory is that this happens due to the sorting: pgbench is an

update

heavy workload, the first few pages are always going to be used if
there's free space as freespacemap.c essentially prefers those. Due to
the sorting all a relation's early pages are going to be in "in a

row".

Not sure, what is best way to tackle this problem, but I think one way

could

be to perform sorting at flush requests level rather than before writing
to OS buffers.

I'm not following. If you just sort a couple hundred more or less random
buffers - which is what you get if you look in buf_id order through
shared_buffers - the likelihood of actually finding neighbouring writes
is pretty low.

Why can't we do it at larger intervals (relative to total amount of writes)?
To explain, what I have in mind, let us assume that checkpoint interval
is longer (10 mins) and in the mean time all the writes are being done
by bgwriter which it registers in shared memory so that later checkpoint
can perform corresponding fsync's, now when the request queue
becomes threshhold size (let us say 1/3rd) full, then we can perform
sorting and merging and issue flush hints. Checkpointer task can
also follow somewhat similar technique which means that once it
has written 1/3rd or so of buffers (which we need to track), it can
perform flush hints after sort+merge. Now, I think we can also
do it in checkpointer alone rather than in bgwriter and checkpointer.
Basically, I think this can lead to lesser merging of neighbouring
writes, but might not hurt if sync_file_range() API is cheap.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#157

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Fabien COELHO (#155)

Re: checkpointer continuous flushing

On 2016-01-12 13:54:21 +0100, Fabien COELHO wrote:

I measured it in a different number of cases, both on SSDs
and spinning rust.

Argh! This is a key point: the sort/flush is designed to help HDDs, and
would have limited effect on SSDs, and it seems that you are showing that
the effect is in fact negative on SSDs, too bad:-(

As you quoted, I could reproduce the slowdown both with SSDs *and* with
rotating disks.

On SSDs, the linux IO scheduler works quite well, so this is a place where I
would consider simply disactivating flushing and/or sorting.

Not my experience. In different scenarios, primarily with a large
shared_buffers fitting the whole hot working set, the patch
significantly improves performance.

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \

I'm not sure I like this one. I guess the intention is to focus on
checkpointer writes and reduce the impact of WAL writes. Why not.

Now sure what you mean? s_c = off is *very* frequent in the field.

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800

The DB is probably about 12GB, so it fits in memory in the end, meaning that
there should be only write activity after some time? So this is not really
the case where it does not fit in memory, but it is large enough to get
mostly random IOs both in read & write, so why not.

Doesn't really fit into ram - shared buffers uses some space (which will
be double buffered) and the xlog will use some more.

ckpt-14 (flushing by backends disabled):

Is this comment refering to "synchronous_commit = off"?
I guess this is the same on master above, even if not written?

No, what I mean by that is that I didn't active flushing writes in
backends - something I found hugely effective in reducing jitter in a
number of workloads, but doesn't help throughput.

As you can see there's roughly a 30% performance regression on the
slower SSD and a ~9% on the faster one. HDD results are similar (but I
can't repeat on the laptop right now since the 2nd hdd is now an SSD).

Ok, that is what I would have expected, the larger the database, the smaller
the impact of sorting & flushin on SSDs.

Again: "HDD results are similar". I primarily tested on a 4 disk raid10
of 4 disks, and a raid0 of 20 disks.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#158

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Amit Kapila (#156)

Re: checkpointer continuous flushing

On 2016-01-12 19:17:49 +0530, Amit Kapila wrote:

Why can't we do it at larger intervals (relative to total amount of writes)?
To explain, what I have in mind, let us assume that checkpoint interval
is longer (10 mins) and in the mean time all the writes are being done
by bgwriter

But that's not the scenario with the regression here, so I'm not sure
why you're bringing it up?

And if we're flushing significant portion of the writes, how does that
avoid the performance problem pointed out two messages upthread? Where
sorting leads to flushing highly contended buffers together, leading to
excessive wal flushing?

But more importantly, unless you also want to delay the writes
themselves, leaving that many dirty buffers in the kernel page cache
will bring back exactly the type of stalls (where the kernel flushes all
the pending dirty data in a short amount of time) we're trying to avoid
with the forced flushing. So doing flushes in a large patches is
something we really fundamentally do *not* want!

which it registers in shared memory so that later checkpoint
can perform corresponding fsync's, now when the request queue
becomes threshhold size (let us say 1/3rd) full, then we can perform
sorting and merging and issue flush hints.

Which means that a significant portion of the writes won't be able to be
collapsed, since only a random 1/3 of the buffers is sorted together.

Basically, I think this can lead to lesser merging of neighbouring
writes, but might not hurt if sync_file_range() API is cheap.

The cost of writing out data doess correspond heavily with the number of
random writes - which is what you get if you reduce the number of
neighbouring writes.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#159

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#158)

Re: checkpointer continuous flushing

On Tue, Jan 12, 2016 at 7:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-12 19:17:49 +0530, Amit Kapila wrote:

Why can't we do it at larger intervals (relative to total amount of

writes)?

To explain, what I have in mind, let us assume that checkpoint interval
is longer (10 mins) and in the mean time all the writes are being done
by bgwriter

But that's not the scenario with the regression here, so I'm not sure
why you're bringing it up?

And if we're flushing significant portion of the writes, how does that
avoid the performance problem pointed out two messages upthread? Where
sorting leads to flushing highly contended buffers together, leading to
excessive wal flushing?

I think it will avoid that problem, because what I am telling is not-to-sort
the buffers before writing, rather sort the flush requests. If I remember
correctly, the initial patch of Fabien doesn't have sorting at the buffer
level, but still he is able to see the benefits in many cases.

But more importantly, unless you also want to delay the writes
themselves, leaving that many dirty buffers in the kernel page cache
will bring back exactly the type of stalls (where the kernel flushes all
the pending dirty data in a short amount of time) we're trying to avoid
with the forced flushing. So doing flushes in a large patches is
something we really fundamentally do *not* want!

Could it be because random I/O?

which it registers in shared memory so that later checkpoint
can perform corresponding fsync's, now when the request queue
becomes threshhold size (let us say 1/3rd) full, then we can perform
sorting and merging and issue flush hints.

Which means that a significant portion of the writes won't be able to be
collapsed, since only a random 1/3 of the buffers is sorted together.

Basically, I think this can lead to lesser merging of neighbouring
writes, but might not hurt if sync_file_range() API is cheap.

The cost of writing out data doess correspond heavily with the number of
random writes - which is what you get if you reduce the number of
neighbouring writes.

Yeah, thats right, but I am not sure how much difference it would
create if sorting everything at one short versus if we do that in
batches. In anycase, I am just trying to think out loud to see if we
can find some solution to the regression you have seen above
without disabling sorting altogether for certain cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#160

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#157)

Re: checkpointer continuous flushing

Hello Andres,

Argh! This is a key point: the sort/flush is designed to help HDDs, and
would have limited effect on SSDs, and it seems that you are showing that
the effect is in fact negative on SSDs, too bad:-(

As you quoted, I could reproduce the slowdown both with SSDs *and* with
rotating disks.

Ok, once again I misunderstood. So you have a regression on HDD with the
settings you pointed out, I can try that.

On SSDs, the linux IO scheduler works quite well, so this is a place where I
would consider simply disactivating flushing and/or sorting.

Not my experience. In different scenarios, primarily with a large
shared_buffers fitting the whole hot working set, the patch
significantly improves performance.

Good! That would be what I expected, but I have no way to test that.

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \

I'm not sure I like this one. I guess the intention is to focus on
checkpointer writes and reduce the impact of WAL writes. Why not.

Now sure what you mean? s_c = off is *very* frequent in the field.

Too bad, because for me it is really disactivating the D of ACID...

I think that this setting would not issue the "sync" calls on the WAL
file, which means that the impact of WAL writing is somehow reduced and
random writes (more or less for each transaction) is switched to
sequential writes by the IO scheduler.

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800

The DB is probably about 12GB, so it fits in memory in the end, meaning that
there should be only write activity after some time? So this is not really
the case where it does not fit in memory, but it is large enough to get
mostly random IOs both in read & write, so why not.

Doesn't really fit into ram - shared buffers uses some space (which will
be double buffered) and the xlog will use some more.

Hmmm. My understanding is that you are really using about 6GB of shared
buffer data in a run, plus some write only stuff...

xlog is flush/synced constantly and never read again, I would be surprise
that it has a significant memory impact.

ckpt-14 (flushing by backends disabled):

Is this comment refering to "synchronous_commit = off"?
I guess this is the same on master above, even if not written?

No, what I mean by that is that I didn't active flushing writes in
backends -

I'm not sure that I understand. What is the actual corresponding directive
in the configuration file?

As you can see there's roughly a 30% performance regression on the
slower SSD and a ~9% on the faster one. HDD results are similar (but I
can't repeat on the laptop right now since the 2nd hdd is now an SSD).

Ok, that is what I would have expected, the larger the database, the smaller
the impact of sorting & flushin on SSDs.

Again: "HDD results are similar". I primarily tested on a 4 disk raid10
of 4 disks, and a raid0 of 20 disks.

I guess similar but with a much lower tps. Anyway I can try that.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#161

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#151)

Re: checkpointer continuous flushing

Hi Fabien,

On 2016-01-11 14:45:16 +0100, Andres Freund wrote:

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

What kernel, filesystem and filesystem option did you measure with?

I was/am using ext4, and it turns out that, when abling flushing, the
results are hugely dependant on barriers=on/off, with the latter making
flushing rather advantageous. Additionally data=ordered/writeback makes
measureable difference too.

Reading kernel sources trying to understand some more of the performance
impact.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#162

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#161)

Re: checkpointer continuous flushing

Hi Fabien,

Hello Tomas.

On 2016-01-11 14:45:16 +0100, Andres Freund wrote:

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

What kernel, filesystem and filesystem option did you measure with?

Andres did these measures, not me, so I do not know.

I was/am using ext4, and it turns out that, when abling flushing, the
results are hugely dependant on barriers=on/off, with the latter making
flushing rather advantageous. Additionally data=ordered/writeback makes
measureable difference too.

These are very interesting tests, I'm looking forward to have a look at
the results.

The fact that these options change performance is expected. Personnaly the
test I submitted on the thread used ext4 with default mount options plus
"relatime".

If I had a choice, I would tend to take the safest options, because the
point of a database is to keep data safe. That's why I'm not found of the
"synchronous_commit=off" chosen above.

Reading kernel sources trying to understand some more of the performance
impact.

Wow!

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#163

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Fabien COELHO (#162)

Re: checkpointer continuous flushing

Hello Andres,

Hello Tomas.

Ooops, sorry Andres, I mixed up the thread in my head so was not clear who
was asking the questions to whom.

I was/am using ext4, and it turns out that, when abling flushing, the
results are hugely dependant on barriers=on/off, with the latter making
flushing rather advantageous. Additionally data=ordered/writeback makes
measureable difference too.

These are very interesting tests, I'm looking forward to have a look at the
results.

The fact that these options change performance is expected. Personnaly the
test I submitted on the thread used ext4 with default mount options plus
"relatime".

I confirm that: nothing special but "relatime" on ext4 on my test host.

If I had a choice, I would tend to take the safest options, because the point
of a database is to keep data safe. That's why I'm not found of the
"synchronous_commit=off" chosen above.

"found" -> "fond". I confirm this opinion. If you have BBU on you
disk/raid system probably playing with some of these options is safe,
though. Not the case with my basic hardware.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#164

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#151)

Re: checkpointer continuous flushing

Hello Andres,

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j 16 -T 300 -P 1

I'm running some tests similar to those above...

Do you do some warmup when testing? I guess the answer is "no".

I understand that you have 8 cores/16 threads on your host?

Loading scale 800 data for 300 seconds tests takes much more than 300
seconds (init takes ~360 seconds, vacuum & index are slow). With 30
seconds checkpoint cycles and without any warmup, I feel that these tests
are really on the very short (too short) side, so I'm not sure how much I
can trust such results as significant. The data I reported were with more
real life like parameters.

Anyway, I'll have some results to show with a setting more or less similar
to yours.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#165

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#164)

Re: checkpointer continuous flushing

On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:

Hello Andres,

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j 16 -T 300 -P 1

So, I've analyzed the problem further, and I think I found something
rater interesting. I'd profiled the kernel looking where it blocks in
the IO request queues, and found that the wal writer was involved
surprisingly often.

So, in a workload where everything (checkpoint, bgwriter, backend
writes) is flushed: 2995 tps
After I kill the wal writer with -STOP: 10887 tps

Stracing the wal writer shows:

17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, si_uid=1000} ---
17:29:02.001538 rt_sigreturn({mask=[]}) = 0
17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.001615 write(3, "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 49152) = 49152
17:29:02.001671 fdatasync(3) = 0
17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, si_uid=1000} ---
17:29:02.005043 rt_sigreturn({mask=[]}) = 0
17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.005111 write(3, "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 8192) = 8192
17:29:02.005147 fdatasync(3) = 0
17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, si_uid=1000} ---
17:29:02.008705 rt_sigreturn({mask=[]}) = 0
17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 \331_/\0\0\0\267\30\0\0\0\0\0\0 "..., 98304) = 98304
17:29:02.008822 fdatasync(3) = 0
17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.016141 rt_sigreturn({mask=[]}) = 0
17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.016204 write(3, "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 57344) = 57344
17:29:02.016281 fdatasync(3) = 0
17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.019199 rt_sigreturn({mask=[]}) = 0
17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.019249 write(3, "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0 "..., 73728) = 73728
17:29:02.019355 fdatasync(3) = 0
17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.022696 rt_sigreturn({mask=[]}) = 0

I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
second. As soon as the wal writer is stopped, it's much bigger chunks,
on the order of 50-130 pages. And, not that surprisingly, that improves
performance, because there's far fewer cache flushes submitted to the
hardware.

I'm running some tests similar to those above...

Do you do some warmup when testing? I guess the answer is "no".

Doesn't make a difference here, I tried both. As long as before/after
benchmarks start from the same state...

I understand that you have 8 cores/16 threads on your host?

On one of them, 4 cores/8 threads on the laptop.

Loading scale 800 data for 300 seconds tests takes much more than 300
seconds (init takes ~360 seconds, vacuum & index are slow). With 30 seconds
checkpoint cycles and without any warmup, I feel that these tests are really
on the very short (too short) side, so I'm not sure how much I can trust
such results as significant. The data I reported were with more real life
like parameters.

I see exactly the same with 300s or 1000s checkpoint cycles, it just
takes a lot longer to repeat. They're also similar (although obviously
both before/after patch are higher) if I disable full_page_writes,
thereby eliminating a lot of other IO.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#166

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#165)

Re: checkpointer continuous flushing

<Oops, wrong "From" again, resent>

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j 16 -T 300 -P 1

I must say that I have not succeeded in reproducing any significant
regression up to now on an HDD. I'm running some more tests again because
I had left out some options above that I thought were non essential.

I have deep problems with the 30-second checkpoint tests: basically the
checkpoints take much more than 30 seconds to complete, the system is not
stable, the 300 seconds runs last more than 900 seconds because the
clients are stuck a long time. The overall behavior is appaling as most of
the time is spent in IO panic at 0 tps.

Also, the performance level is around 160 tps on HDDs, which make sense to
me for a 7200 rpm HDD capable of about x00 random writes per second. It
seems to me that you reported much better performance on HDD, but I cannot
really see how this would be possible if data are indeed writen to disk.
Any idea?

Also, what is the very precise postgres version & patch used in your
tests on HDDs?

both before/after patch are higher) if I disable full_page_writes,
thereby eliminating a lot of other IO.

Maybe this is an explanation....

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#167

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#137)

Re: checkpointer continuous flushing

On 2016-01-19 10:27:31 +0100, Fabien COELHO wrote:

Also, the performance level is around 160 tps on HDDs, which make sense to
me for a 7200 rpm HDD capable of about x00 random writes per second. It
seems to me that you reported much better performance on HDD, but I cannot
really see how this would be possible if data are indeed writen to disk. Any
idea?

synchronous_commit = off does make a significant difference.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: alpine.DEB.2.10.1601191004060.15654@sto

#168

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#167)

Re: checkpointer continuous flushing

synchronous_commit = off does make a significant difference.

Sure, but I had thought about that and kept this one...

I think I found one possible culprit: I automatically wrote 300 seconds
for checkpoint_timeout, instead of 30 seconds in your settings. I'll have
to rerun the tests with this (unreasonnable) figure to check whether I
really get a regression.

Other tests I ran with "reasonnable" settings on a large (scale=800) db
did not show any significant performance regression, up to know.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#169

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#168)

Re: checkpointer continuous flushing

On 2016-01-19 13:34:14 +0100, Fabien COELHO wrote:

synchronous_commit = off does make a significant difference.

Sure, but I had thought about that and kept this one...

But why are you then saying this is fundamentally limited to 160
xacts/sec?

I think I found one possible culprit: I automatically wrote 300 seconds for
checkpoint_timeout, instead of 30 seconds in your settings. I'll have to
rerun the tests with this (unreasonnable) figure to check whether I really
get a regression.

I've not seen meaningful changes in the size of the regression between 30/300s.

Other tests I ran with "reasonnable" settings on a large (scale=800) db did
not show any significant performance regression, up to know.

Try running it so that the data set nearly, but not entirely fit into
the OS page cache, while definitely not fitting into shared_buffers. The
scale=800 just worked for that on my hardware, no idea how it's for yours.

That seems to be the point where the effect is the worst.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#170

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#165)

Re: checkpointer continuous flushing

On Mon, Jan 18, 2016 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:

Hello Andres,

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j 16 -T 300 -P 1

So, I've analyzed the problem further, and I think I found something
rater interesting. I'd profiled the kernel looking where it blocks in
the IO request queues, and found that the wal writer was involved
surprisingly often.

So, in a workload where everything (checkpoint, bgwriter, backend
writes) is flushed: 2995 tps
After I kill the wal writer with -STOP: 10887 tps

Stracing the wal writer shows:

17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, si_uid=1000} ---
17:29:02.001538 rt_sigreturn({mask=[]}) = 0
17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.001615 write(3, "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 49152) = 49152
17:29:02.001671 fdatasync(3) = 0
17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, si_uid=1000} ---
17:29:02.005043 rt_sigreturn({mask=[]}) = 0
17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.005111 write(3, "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 8192) = 8192
17:29:02.005147 fdatasync(3) = 0
17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, si_uid=1000} ---
17:29:02.008705 rt_sigreturn({mask=[]}) = 0
17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 \331_/\0\0\0\267\30\0\0\0\0\0\0 "..., 98304) = 98304
17:29:02.008822 fdatasync(3) = 0
17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.016141 rt_sigreturn({mask=[]}) = 0
17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.016204 write(3, "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 57344) = 57344
17:29:02.016281 fdatasync(3) = 0
17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.019199 rt_sigreturn({mask=[]}) = 0
17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.019249 write(3, "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0 "..., 73728) = 73728
17:29:02.019355 fdatasync(3) = 0
17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.022696 rt_sigreturn({mask=[]}) = 0

I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
second. As soon as the wal writer is stopped, it's much bigger chunks,
on the order of 50-130 pages. And, not that surprisingly, that improves
performance, because there's far fewer cache flushes submitted to the
hardware.

This seems like a problem with the WAL writer quite independent of
anything else. It seems likely to be inadvertent fallout from this
patch:

Author: Simon Riggs <simon@2ndQuadrant.com>
Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000

Wakeup WALWriter as needed for asynchronous commit performance.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.

If I understand correctly, prior to that commit, WAL writer woke up 5
times per second and flushed just that often (unless you changed the
default settings). But as the commit message explained, that turned
out to suck - you could make performance go up very significantly by
radically decreasing wal_writer_delay. This commit basically lets it
flush at maximum velocity - as fast as we finish one flush, we can
start the next. That must have seemed like a win at the time from the
way the commit message was written, but you seem to now be seeing the
opposite effect, where performance is suffering because flushes are
too frequent rather than too infrequent. I wonder if there's an ideal
flush rate and what it is, and how much it depends on what hardware
you have got.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#171

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#169)

Re: checkpointer continuous flushing

synchronous_commit = off does make a significant difference.

Sure, but I had thought about that and kept this one...

But why are you then saying this is fundamentally limited to 160
xacts/sec?

I'm just saying that the tested load generates mostly random IOs (probably
on average over 1 page per transaction), random IOs are very slow on a
HDD, so I do not expect great tps.

I think I found one possible culprit: I automatically wrote 300 seconds for
checkpoint_timeout, instead of 30 seconds in your settings. I'll have to
rerun the tests with this (unreasonnable) figure to check whether I really
get a regression.

I've not seen meaningful changes in the size of the regression between 30/300s.

At 300 seconds (5 minutes) the checkpoints of the accumulated takes 15-25
minutes, during which the database is mostly offline, and there is no
clear difference with/without sort+flush.

Other tests I ran with "reasonnable" settings on a large (scale=800) db did
not show any significant performance regression, up to now.

Try running it so that the data set nearly, but not entirely fit into
the OS page cache, while definitely not fitting into shared_buffers. The
scale=800 just worked for that on my hardware, no idea how it's for yours.
That seems to be the point where the effect is the worst.

I have 16GB memory on the tested host, same as your hardware I think, so I
use scale 800 => 12GB at the beginning of the run. Not sure it fits the
bill as I think it fits in memory, so the load is mostly write and no/very
few reads. I'll also try with scale 1000.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#172

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#170)

Re: checkpointer continuous flushing

On 2016-01-19 12:58:38 -0500, Robert Haas wrote:

This seems like a problem with the WAL writer quite independent of
anything else. It seems likely to be inadvertent fallout from this
patch:

Author: Simon Riggs <simon@2ndQuadrant.com>
Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000

Wakeup WALWriter as needed for asynchronous commit performance.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.

In addition to that the "powersaving" effort also plays a role - without
the latch we'd not wake up at any meaningful rate at all atm.

If I understand correctly, prior to that commit, WAL writer woke up 5
times per second and flushed just that often (unless you changed the
default settings). But as the commit message explained, that turned
out to suck - you could make performance go up very significantly by
radically decreasing wal_writer_delay. This commit basically lets it
flush at maximum velocity - as fast as we finish one flush, we can
start the next. That must have seemed like a win at the time from the
way the commit message was written, but you seem to now be seeing the
opposite effect, where performance is suffering because flushes are
too frequent rather than too infrequent. I wonder if there's an ideal
flush rate and what it is, and how much it depends on what hardware
you have got.

I think the problem isn't really that it's flushing too much WAL in
total, it's that it's flushing WAL in a too granular fashion. I suspect
we want something where we attempt a minimum number of flushes per
second (presumably tied to wal_writer_delay) and, once exceeded, a
minimum number of pages per flush. I think we even could continue to
write() the data at the same rate as today, we just would need to reduce
the number of fdatasync()s we issue. And possibly could make the
eventual fdatasync()s cheaper by hinting the kernel to write them out
earlier.

Now the question what the minimum number of pages we want to flush for
(setting wal_writer_delay triggered ones aside) isn't easy to answer. A
simple model would be to statically tie it to the size of wal_buffers;
say, don't flush unless at least 10% of XLogBuffers have been written
since the last flush. More complex approaches would be to measure the
continuous WAL writeout rate.

By tying it to both a minimum rate under activity (ensuring things go to
disk fast) and a minimum number of pages to sync (ensuring a reasonable
number of cache flush operations) we should be able to mostly accomodate
the different types of workloads. I think.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#173

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#172)

Re: checkpointer continuous flushing

On 2016-01-19 22:43:21 +0100, Andres Freund wrote:

On 2016-01-19 12:58:38 -0500, Robert Haas wrote:

This seems like a problem with the WAL writer quite independent of
anything else. It seems likely to be inadvertent fallout from this
patch:

Author: Simon Riggs <simon@2ndQuadrant.com>
Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000

Wakeup WALWriter as needed for asynchronous commit performance.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.

In addition to that the "powersaving" effort also plays a role - without
the latch we'd not wake up at any meaningful rate at all atm.

The relevant thread is at
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
what I didn't remember is that I voiced concern back then about exactly this:
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
;)

Simon: CCed you, as the author of the above commit. Quick summary:
The frequent wakeups of wal writer can lead to significant performance
regressions in workloads that are bigger than shared_buffers, because
the super-frequent fdatasync()s by the wal writer slow down concurrent
writes (bgwriter, checkpointer, individual backend writes)
dramatically. To the point that SIGSTOPing the wal writer gets a pgbench
workload from 2995 to 10887 tps. The reasons fdatasyncs cause a slow
down is that it prevents real use of queuing to the storage devices.

On 2016-01-19 22:43:21 +0100, Andres Freund wrote:

On 2016-01-19 12:58:38 -0500, Robert Haas wrote:

If I understand correctly, prior to that commit, WAL writer woke up 5
times per second and flushed just that often (unless you changed the
default settings). But as the commit message explained, that turned
out to suck - you could make performance go up very significantly by
radically decreasing wal_writer_delay. This commit basically lets it
flush at maximum velocity - as fast as we finish one flush, we can
start the next. That must have seemed like a win at the time from the
way the commit message was written, but you seem to now be seeing the
opposite effect, where performance is suffering because flushes are
too frequent rather than too infrequent. I wonder if there's an ideal
flush rate and what it is, and how much it depends on what hardware
you have got.

I think the problem isn't really that it's flushing too much WAL in
total, it's that it's flushing WAL in a too granular fashion. I suspect
we want something where we attempt a minimum number of flushes per
second (presumably tied to wal_writer_delay) and, once exceeded, a
minimum number of pages per flush. I think we even could continue to
write() the data at the same rate as today, we just would need to reduce
the number of fdatasync()s we issue. And possibly could make the
eventual fdatasync()s cheaper by hinting the kernel to write them out
earlier.

Now the question what the minimum number of pages we want to flush for
(setting wal_writer_delay triggered ones aside) isn't easy to answer. A
simple model would be to statically tie it to the size of wal_buffers;
say, don't flush unless at least 10% of XLogBuffers have been written
since the last flush. More complex approaches would be to measure the
continuous WAL writeout rate.

By tying it to both a minimum rate under activity (ensuring things go to
disk fast) and a minimum number of pages to sync (ensuring a reasonable
number of cache flush operations) we should be able to mostly accomodate
the different types of workloads. I think.

This unfortunately leaves out part of the reasoning for the above
commit: We want WAL to be flushed fast, so we immediately can set hint
bits.

One, relatively extreme, approach would be to continue *writing* WAL in
the background writer as today, but use rules like suggested above
guiding the actual flushing. Additionally using operations like
sync_file_range() (and equivalents on other OSs). Then, to address the
regression of SetHintBits() having to bail out more often, actually
trigger a WAL flush whenever WAL is already written, but not flushed.
has the potential to be bad in a number of other cases tho :(

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#174

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#173)

Re: checkpointer continuous flushing

On 2016-01-20 11:13:26 +0100, Andres Freund wrote:

On 2016-01-19 22:43:21 +0100, Andres Freund wrote:

On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
I think the problem isn't really that it's flushing too much WAL in
total, it's that it's flushing WAL in a too granular fashion. I suspect
we want something where we attempt a minimum number of flushes per
second (presumably tied to wal_writer_delay) and, once exceeded, a
minimum number of pages per flush. I think we even could continue to
write() the data at the same rate as today, we just would need to reduce
the number of fdatasync()s we issue. And possibly could make the
eventual fdatasync()s cheaper by hinting the kernel to write them out
earlier.

Now the question what the minimum number of pages we want to flush for
(setting wal_writer_delay triggered ones aside) isn't easy to answer. A
simple model would be to statically tie it to the size of wal_buffers;
say, don't flush unless at least 10% of XLogBuffers have been written
since the last flush. More complex approaches would be to measure the
continuous WAL writeout rate.

By tying it to both a minimum rate under activity (ensuring things go to
disk fast) and a minimum number of pages to sync (ensuring a reasonable
number of cache flush operations) we should be able to mostly accomodate
the different types of workloads. I think.

This unfortunately leaves out part of the reasoning for the above
commit: We want WAL to be flushed fast, so we immediately can set hint
bits.

One, relatively extreme, approach would be to continue *writing* WAL in
the background writer as today, but use rules like suggested above
guiding the actual flushing. Additionally using operations like
sync_file_range() (and equivalents on other OSs). Then, to address the
regression of SetHintBits() having to bail out more often, actually
trigger a WAL flush whenever WAL is already written, but not flushed.
has the potential to be bad in a number of other cases tho :(

Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
because we can't easily set the LSN. But, it's actually fairly common
that the pages LSN is already newer than the commitLSN - in which case
we, afaics, just can go ahead and set the hint bit, no?

So, instead of
if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)
return; /* not flushed yet, so don't set hint */
we do
if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)
&& BufferGetLSNAtomic(buffer) < commitLSN)
return; /* not flushed yet, so don't set hint */

In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
a large portion of the hint writes that we currently skip.

Right now, on my laptop, I get (-M prepared -c 32 -j 32):
current wal-writer 12827 tps, 95 % IO util, 93 % CPU
no flushing in wal writer * 13185 tps, 46 % IO util, 93 % CPU
no flushing in wal writer & above change 16366 tps, 41 % IO util, 95 % CPU
flushing in wal writer & above change: 14812 tps, 94 % IO util, 95 % CPU

* sometimes the results initially were much lower, with lots of lock
contention. Can't figure out why that's only sometimes the case. In
those cases the results were more like 8967 tps.

these aren't meant as thorough benchmarks, just to provide some
orientation.

Now that solution won't improve every situation, e.g. for a workload
that inserts a lot of rows in one transaction, and only does inserts, it
probably won't do all that much. But it still seems like a pretty good
mitigation strategy. I hope that with a smarter write strategy (getting
that 50% reduction in IO util) and the above we should be ok.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#175

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Andres Freund (#173)

Re: checkpointer continuous flushing

Andres Freund wrote:

The relevant thread is at
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
what I didn't remember is that I voiced concern back then about exactly this:
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
;)

Interesting. If we consider for a minute that part of the cause for the
slowdown is slowness in pg_clog, maybe we should reconsider the initial
decision to flush as quickly as possible (i.e. adopt a strategy where
walwriter sleeps a bit between two flushes) in light of the group-update
feature for CLOG being proposed by Amit Kapila in another thread -- it
seems that these things might go hand-in-hand.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#176

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Alvaro Herrera (#175)

Re: checkpointer continuous flushing

On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

The relevant thread is at
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
what I didn't remember is that I voiced concern back then about exactly this:
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
;)

Interesting. If we consider for a minute that part of the cause for the
slowdown is slowness in pg_clog, maybe we should reconsider the initial
decision to flush as quickly as possible (i.e. adopt a strategy where
walwriter sleeps a bit between two flushes) in light of the group-update
feature for CLOG being proposed by Amit Kapila in another thread -- it
seems that these things might go hand-in-hand.

I don't think it's strongly related - the contention here is on read
access to the clog, not on write access. While Amit's patch will reduce
the impact of that a bit, I don't see it making a fundamental
difference.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#177

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#176)

Re: checkpointer continuous flushing

On Wed, Jan 20, 2016 at 9:07 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

The relevant thread is at

http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com

what I didn't remember is that I voiced concern back then about

exactly this:

http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de

;)

Interesting. If we consider for a minute that part of the cause for the
slowdown is slowness in pg_clog, maybe we should reconsider the initial
decision to flush as quickly as possible (i.e. adopt a strategy where
walwriter sleeps a bit between two flushes) in light of the group-update
feature for CLOG being proposed by Amit Kapila in another thread -- it
seems that these things might go hand-in-hand.

I don't think it's strongly related - the contention here is on read
access to the clog, not on write access.

Aren't reads on clog contended with parallel writes to clog?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#178

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#177)

Re: checkpointer continuous flushing

On 2016-01-21 11:33:15 +0530, Amit Kapila wrote:

On Wed, Jan 20, 2016 at 9:07 PM, Andres Freund <andres@anarazel.de> wrote:

I don't think it's strongly related - the contention here is on read
access to the clog, not on write access.

Aren't reads on clog contended with parallel writes to clog?

Sure. But you're not going to beat "no access to the clog" due to hint
bits, by making parallel writes a bit better citizens.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#179

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#174)

Re: checkpointer continuous flushing

On Wed, Jan 20, 2016 at 9:02 AM, Andres Freund <andres@anarazel.de> wrote:

Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
because we can't easily set the LSN. But, it's actually fairly common
that the pages LSN is already newer than the commitLSN - in which case
we, afaics, just can go ahead and set the hint bit, no?

So, instead of
if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)
return; /* not flushed yet, so don't set hint */
we do
if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)
&& BufferGetLSNAtomic(buffer) < commitLSN)
return; /* not flushed yet, so don't set hint */

In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
a large portion of the hint writes that we currently skip.

Dang. That's a really good idea. Although I think you'd probably
better revise the comment, since it will otherwise be false.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#180

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Andres Freund (#173)

Re: checkpointer continuous flushing

This patch got its fair share of reviewer attention this commitfest.
Moving to the next one. Andres, if you want to commit ahead of time
you're of course encouraged to do so.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#181

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#1)

Re: checkpointer continuous flushing - V16

Hi,

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The main changes are that:
1) the significant performance regressions I saw are addressed by
changing the wal writer flushing logic
2) The flushing API moved up a couple layers, and now deals with buffer
tags, rather than the physical files
3) Writes from checkpoints, bgwriter and files are flushed, configurable
by individual GUCs. Without that I still saw the spiked in a lot of circumstances.

There's also a more experimental reimplementation of bgwriter, but I'm
not sure it's realistic to polish that up within the constraints of 9.6.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#182

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#181)

Re: checkpointer continuous flushing - V16

Hi Fabien,

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The main changes are that:
1) the significant performance regressions I saw are addressed by
changing the wal writer flushing logic
2) The flushing API moved up a couple layers, and now deals with buffer
tags, rather than the physical files
3) Writes from checkpoints, bgwriter and files are flushed, configurable
by individual GUCs. Without that I still saw the spiked in a lot of circumstances.

There's also a more experimental reimplementation of bgwriter, but I'm
not sure it's realistic to polish that up within the constraints of 9.6.

Any comments before I spend more time polishing this? I'm currently
updating docs and comments to actually describe the current state...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#183

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#182)

Re: checkpointer continuous flushing - V16

Hello Andres,

Any comments before I spend more time polishing this?

I'm running tests on various settings, I'll send a report when it is done.
Up to now the performance seems as good as with the previous version.

I'm currently updating docs and comments to actually describe the
current state...

I did notice the mismatched documentation.

I think I would appreciate comments to understand why/how the ringbuffer
is used, and more comments in general, so it is fine if you improve this
part.

Minor details:

"typedefs.list" should be updated to WritebackContext.

"WritebackContext" is a typedef, "struct" is not needed.

I'll look at the code more deeply probably over next weekend.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#184

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#183)

Re: checkpointer continuous flushing - V16

On 2016-02-08 19:52:30 +0100, Fabien COELHO wrote:

I think I would appreciate comments to understand why/how the ringbuffer is
used, and more comments in general, so it is fine if you improve this part.

I'd suggest to leave out the ringbuffer/new bgwriter parts. I think
they'd be committed separately, and probably not in 9.6.

Thanks,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#185

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#184)

Re: checkpointer continuous flushing - V16

I think I would appreciate comments to understand why/how the
ringbuffer is used, and more comments in general, so it is fine if you
improve this part.

I'd suggest to leave out the ringbuffer/new bgwriter parts.

Ok, so the patch would only onclude the checkpointer stuff.

I'll look at this part in detail.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#186

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#185)

Re: checkpointer continuous flushing - V16

On February 9, 2016 10:46:34 AM GMT+01:00, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I think I would appreciate comments to understand why/how the
ringbuffer is used, and more comments in general, so it is fine if

you

improve this part.

I'd suggest to leave out the ringbuffer/new bgwriter parts.

Ok, so the patch would only onclude the checkpointer stuff.

I'll look at this part in detail.

Yes, that's the more pressing part. I've seen pretty good results with the new bgwriter, but it's not really worthwhile until sorting and flushing is in...

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#187

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#181)

2 attachment(s)

Re: checkpointer continuous flushing - V16

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently. The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
flushing every wal_writer_delay ms or wal_writer_flush_after
bytes.

Greetings,

Andres Freund

Attachments:

0001-Allow-SetHintBits-to-succeed-if-the-buffer-s-LSN-is-.patchtext/x-patch; charset=us-asciiDownload

>From f3bc3a7c40c21277331689595814b359c55682dc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 1/6] Allow SetHintBits() to succeed if the buffer's LSN is new
 enough.

Previously we only allowed SetHintBits() to succeed if the commit LSN of
the last transaction touching the page has already been flushed to
disk. We can't generally change the LSN of the page, because we don't
necessarily have the required locks on the page. But the required LSN
interlock does not require the commit record to be flushed, it just
requires that the commit record will be flushed before the page is
written out. Therefore if the buffer LSN is newer than the commit LSN,
the hint bit can be safely set.

In a number of scenarios (e.g. pgbench) this noticeably increases the
number of hint bits are set. But more importantly it also keeps the
success rate up when flushing WAL less frequently. That was the original
reason for commit 4de82f7d7, which has negative performance consequences
in a number of scenarios. This will allow a follup commit to reduce the
flush rate.

Discussion: 20160118163908.GW10941@awork2.anarazel.de
---
 src/backend/utils/time/tqual.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 465933d..503bd1d 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -89,12 +89,13 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
  * Set commit/abort hint bits on a tuple, if appropriate at this time.
  *
  * It is only safe to set a transaction-committed hint bit if we know the
- * transaction's commit record has been flushed to disk, or if the table is
- * temporary or unlogged and will be obliterated by a crash anyway.  We
- * cannot change the LSN of the page here because we may hold only a share
- * lock on the buffer, so we can't use the LSN to interlock this; we have to
- * just refrain from setting the hint bit until some future re-examination
- * of the tuple.
+ * transaction's commit record is guaranteed to be flushed to disk before the
+ * buffer, or if the table is temporary or unlogged and will be obliterated by
+ * a crash anyway.  We cannot change the LSN of the page here because we may
+ * hold only a share lock on the buffer, so we can only use the LSN to
+ * interlock this if the buffer's LSN already is newer than the commit LSN;
+ * otherwise we have to just refrain from setting the hint bit until some
+ * future re-examination of the tuple.
  *
  * We can always set hint bits when marking a transaction aborted.  (Some
  * code in heapam.c relies on that!)
@@ -122,8 +123,12 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 		/* NB: xid must be known committed here! */
 		XLogRecPtr	commitLSN = TransactionIdGetCommitLSN(xid);
 
-		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
-			return;				/* not flushed yet, so don't set hint */
+		if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) &&
+			BufferGetLSNAtomic(buffer) < commitLSN)
+		{
+			/* not flushed and no LSN interlock, so don't set hint */
+			return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
-- 
2.7.0.229.g701fa7f

0002-Allow-the-WAL-writer-to-flush-WAL-at-a-reduced-rate.patchtext/x-patch; charset=us-asciiDownload

>From e4facce2cf8b982408ff1de174cffc202852adfd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 2/6] Allow the WAL writer to flush WAL at a reduced rate.

Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al.  But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4.  The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force barriers
to be emitted to the storage.

This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending.

Discussion: 20160118163908.GW10941@awork2.anarazel.de
---
 doc/src/sgml/config.sgml                      |  41 +++++++---
 src/backend/access/transam/README             |  32 ++++----
 src/backend/access/transam/xlog.c             | 104 +++++++++++++++++++-------
 src/backend/postmaster/walwriter.c            |   1 +
 src/backend/utils/misc/guc.c                  |  13 +++-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/postmaster/walwriter.h            |   1 +
 7 files changed, 141 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index de84b77..ee8d63d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2344,15 +2344,38 @@ include_dir 'conf.d'
       </indexterm>
       </term>
       <listitem>
-       <para>
-        Specifies the delay between activity rounds for the WAL writer.
-        In each round the writer will flush WAL to disk. It then sleeps for
-        <varname>wal_writer_delay</> milliseconds, and repeats.  The default
-        value is 200 milliseconds (<literal>200ms</>).  Note that on many
-        systems, the effective resolution of sleep delays is 10 milliseconds;
-        setting <varname>wal_writer_delay</> to a value that is not a multiple
-        of 10 might have the same results as setting it to the next higher
-        multiple of 10. This parameter can only be set in the
+      <para>
+        Specifies how often the WAL writer flushes WAL. After flushing WAL it
+        sleeps for <varname>wal_writer_delay</> milliseconds, unless woken up
+        by an asynchronously committing transaction. In case the last flush
+        happened less than <varname>wal_writer_delay</> milliseconds ago and
+        less than <varname>wal_writer_flush_after</> bytes of of WAL been
+        produced since, WAL is only written to the OS, not flushed to disk.
+        The default value is 200 milliseconds (<literal>200ms</>).  Note that
+        on many systems, the effective resolution of sleep delays is 10
+        milliseconds; setting <varname>wal_writer_delay</> to a value that is
+        not a multiple of 10 might have the same results as setting it to the
+        next higher multiple of 10. This parameter can only be set in the
+        <filename>postgresql.conf</> file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-writer-flush-after" xreflabel="wal_writer_flush_after">
+      <term><varname>wal_writer_flush_after</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_writer_flush_after</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+      <para>
+        Specifies how often the WAL writer flushes WAL. In case the last flush
+        happened less than <varname>wal_writer_delay</> milliseconds ago and
+        less than <varname>wal_writer_flush_after</> bytes of WAL have been
+        produced since, WAL is only written to the OS, not flushed to disk.
+        If <varname>wal_writer_flush_after</> is set to <literal>0</> WAL is
+        flushed everytime the WAL writer has written WAL.  The default is
+        <literal>1MB</literal>. This parameter can only be set in the
         <filename>postgresql.conf</> file or on the server command line.
        </para>
       </listitem>
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index f6db580..2de0489 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -736,20 +736,24 @@ non-roll-backable side effects (such as filesystem changes) force sync
 commit to minimize the window in which the filesystem change has been made
 but the transaction isn't guaranteed committed.
 
-Every wal_writer_delay milliseconds, the walwriter process performs an
-XLogBackgroundFlush().  This checks the location of the last completely
-filled WAL page.  If that has moved forwards, then we write all the changed
-buffers up to that point, so that under full load we write only whole
-buffers.  If there has been a break in activity and the current WAL page is
-the same as before, then we find out the LSN of the most recent
-asynchronous commit, and flush up to that point, if required (i.e.,
-if it's in the current WAL page).  This arrangement in itself would
-guarantee that an async commit record reaches disk during at worst the
-second walwriter cycle after the transaction completes.  However, we also
-allow XLogFlush to flush full buffers "flexibly" (ie, not wrapping around
-at the end of the circular WAL buffer area), so as to minimize the number
-of writes issued under high load when multiple WAL pages are filled per
-walwriter cycle.  This makes the worst-case delay three walwriter cycles.
+The walwriter regularly wakes up (via wal_writer_delay) or is woken up
+(via its latch, which is set by backends committing asynchronously) and
+performs an XLogBackgroundFlush().  This checks the location of the last
+completely filled WAL page.  If that has moved forwards, then we write all
+the changed buffers up to that point, so that under full load we write
+only whole buffers.  If there has been a break in activity and the current
+WAL page is the same as before, then we find out the LSN of the most
+recent asynchronous commit, and write up to that point, if required (i.e.
+if it's in the current WAL page).  If more than wal_writer_delay has
+passed, or more than wal_writer_flush_after blocks have been written, since
+the last flush, WAL is also flushed up to the current location.  This
+arrangement in itself would guarantee that an async commit record reaches
+disk after at most two times wal_writer_delay after the transaction
+completes. However, we also allow XLogFlush to write/flush full buffers
+"flexibly" (ie, not wrapping around at the end of the circular WAL buffer
+area), so as to minimize the number of writes issued under high load when
+multiple WAL pages are filled per walwriter cycle. This makes the worst-case
+delay three wal_writer_delay cycles.
 
 There are some other subtle points to consider with asynchronous commits.
 First, for each page of CLOG we must remember the LSN of the latest commit
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..32e7ef2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
 #include "replication/logical.h"
@@ -2729,28 +2730,37 @@ XLogFlush(XLogRecPtr record)
 }
 
 /*
- * Flush xlog, but without specifying exactly where to flush to.
+ * Write & flush xlog, but without specifying exactly where to.
  *
- * We normally flush only completed blocks; but if there is nothing to do on
- * that basis, we check for unflushed async commits in the current incomplete
- * block, and flush through the latest one of those.  Thus, if async commits
- * are not being used, we will flush complete blocks only.  We can guarantee
- * that async commits reach disk after at most three cycles; normally only
- * one or two.  (When flushing complete blocks, we allow XLogWrite to write
- * "flexibly", meaning it can stop at the end of the buffer ring; this makes a
- * difference only with very high load or long wal_writer_delay, but imposes
- * one extra cycle for the worst case for async commits.)
+ * We normally write only completed blocks; but if there is nothing to do on
+ * that basis, we check for unwritten async commits in the current incomplete
+ * block, and write through the latest one of those.  Thus, if async commits
+ * are not being used, we will write complete blocks only.
+ *
+ * If, based on the above, there's anything to write we do so immediately. But
+ * to avoid calling fsync, fdatasync et. al. at a rate that'd impact
+ * concurrent IO, we only flush WAL every wal_writer_delay ms, or if there's
+ * more than wal_writer_flush_after unflushed blocks.
+ *
+ * We can guarantee that async commits reach disk after at most three
+ * wal_writer_delay cycles. (When flushing complete blocks, we allow XLogWrite
+ * to write "flexibly", meaning it can stop at the end of the buffer ring;
+ * this makes a difference only with very high load or long wal_writer_delay,
+ * but imposes one extra cycle for the worst case for async commits.)
  *
  * This routine is invoked periodically by the background walwriter process.
  *
- * Returns TRUE if we flushed anything.
+ * Returns TRUE if there was any work to do, even if we skipped flushing due
+ * to wal_writer_delay/wal_flush_after.
  */
 bool
 XLogBackgroundFlush(void)
 {
-	XLogRecPtr	WriteRqstPtr;
+	XLogwrtRqst WriteRqst;
 	bool		flexible = true;
-	bool		wrote_something = false;
+	static TimestampTz lastflush;
+	TimestampTz now;
+	int			flushbytes;
 
 	/* XLOG doesn't need flushing during recovery */
 	if (RecoveryInProgress())
@@ -2759,17 +2769,17 @@ XLogBackgroundFlush(void)
 	/* read LogwrtResult and update local state */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
-	WriteRqstPtr = XLogCtl->LogwrtRqst.Write;
+	WriteRqst = XLogCtl->LogwrtRqst;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/* back off to last completed page boundary */
-	WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
+	WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
 
 	/* if we have already flushed that far, consider async commit records */
-	if (WriteRqstPtr <= LogwrtResult.Flush)
+	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
 		SpinLockAcquire(&XLogCtl->info_lck);
-		WriteRqstPtr = XLogCtl->asyncXactLSN;
+		WriteRqst.Write = XLogCtl->asyncXactLSN;
 		SpinLockRelease(&XLogCtl->info_lck);
 		flexible = false;		/* ensure it all gets written */
 	}
@@ -2779,7 +2789,7 @@ XLogBackgroundFlush(void)
 	 * holding an open file handle to a logfile that's no longer in use,
 	 * preventing the file from being deleted.
 	 */
-	if (WriteRqstPtr <= LogwrtResult.Flush)
+	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
 		if (openLogFile >= 0)
 		{
@@ -2791,10 +2801,47 @@ XLogBackgroundFlush(void)
 		return false;
 	}
 
+	/*
+	 * Determine how far to flush WAL, based on the wal_writer_delay and
+	 * wal_writer_flush_after GUCs.
+	 */
+	now = GetCurrentTimestamp();
+	flushbytes =
+		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
+
+	if (WalWriterFlushAfter == 0 || lastflush == 0)
+	{
+		/* first call, or block based limits disabled */
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else if (TimestampDifferenceExceeds(lastflush, now, WalWriterDelay))
+	{
+		/*
+		 * Flush the writes at least every WalWriteDelay ms. This is important
+		 * to bound the amount of time it takes for an asynchronous commit to
+		 * hit disk.
+		 */
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else if (flushbytes >= WalWriterFlushAfter)
+	{
+		/* exceeded wal_writer_flush_after blocks, flush */
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else
+	{
+		/* no flushing, this time round */
+		WriteRqst.Flush = 0;
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
-		elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
-			 (uint32) (WriteRqstPtr >> 32), (uint32) WriteRqstPtr,
+		elog(LOG, "xlog bg flush request write %X/%X; flush: %X/%X, current is write %X/%X; flush %X/%X",
+			 (uint32) (WriteRqst.Write >> 32), (uint32) WriteRqst.Write,
+			 (uint32) (WriteRqst.Flush >> 32), (uint32) WriteRqst.Flush,
 			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
 		   (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
 #endif
@@ -2802,17 +2849,13 @@ XLogBackgroundFlush(void)
 	START_CRIT_SECTION();
 
 	/* now wait for any in-progress insertions to finish and get write lock */
-	WaitXLogInsertionsToFinish(WriteRqstPtr);
+	WaitXLogInsertionsToFinish(WriteRqst.Write);
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 	LogwrtResult = XLogCtl->LogwrtResult;
-	if (WriteRqstPtr > LogwrtResult.Flush)
+	if (WriteRqst.Write > LogwrtResult.Write ||
+		WriteRqst.Flush > LogwrtResult.Flush)
 	{
-		XLogwrtRqst WriteRqst;
-
-		WriteRqst.Write = WriteRqstPtr;
-		WriteRqst.Flush = WriteRqstPtr;
 		XLogWrite(WriteRqst, flexible);
-		wrote_something = true;
 	}
 	LWLockRelease(WALWriteLock);
 
@@ -2827,7 +2870,12 @@ XLogBackgroundFlush(void)
 	 */
 	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
 
-	return wrote_something;
+	/*
+	 * If we determined that we need to write data, but somebody else
+	 * wrote/flushed already, it should be considered as being active, to
+	 * avoid hibernating too early.
+	 */
+	return true;
 }
 
 /*
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 243adb6..9852fed 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -64,6 +64,7 @@
  * GUC parameters
  */
 int			WalWriterDelay = 200;
+int			WalWriterFlushAfter = 128;
 
 /*
  * Number of do-nothing loops before lengthening the delay time, and the
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a69ca..ea5a09a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2235,7 +2235,7 @@ static struct config_int ConfigureNamesInt[] =
 
 	{
 		{"wal_writer_delay", PGC_SIGHUP, WAL_SETTINGS,
-			gettext_noop("WAL writer sleep time between WAL flushes."),
+			gettext_noop("Time between WAL flushes performed in the WAL writer."),
 			NULL,
 			GUC_UNIT_MS
 		},
@@ -2245,6 +2245,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"wal_writer_flush_after", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Amount of WAL written out by WAL writer triggering a flush."),
+			NULL,
+			GUC_UNIT_XBLOCKS
+		},
+		&WalWriterFlushAfter,
+		128, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		/* see max_connections */
 		{"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
 			gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 09b2003..ee3d378 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -192,6 +192,7 @@
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
 					# (change requires restart)
 #wal_writer_delay = 200ms		# 1-10000 milliseconds
+#wal_writer_flush_after = 1MB		# 0 disables
 
 #commit_delay = 0			# range 0-100000, in microseconds
 #commit_siblings = 5			# range 1-1000
diff --git a/src/include/postmaster/walwriter.h b/src/include/postmaster/walwriter.h
index d94cb97..49c5c1d 100644
--- a/src/include/postmaster/walwriter.h
+++ b/src/include/postmaster/walwriter.h
@@ -14,6 +14,7 @@
 
 /* GUC options */
 extern int	WalWriterDelay;
+extern int	WalWriterFlushAfter;
 
 extern void WalWriterMain(void) pg_attribute_noreturn();
 
-- 
2.7.0.229.g701fa7f

#188

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#187)

Re: checkpointer continuous flushing - V16

On Thu, Feb 11, 2016 at 1:44 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently. The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
flushing every wal_writer_delay ms or wal_writer_flush_after
bytes.

I previously reviewed 0001 and I think it's fine. I haven't reviewed
0002 in detail, but I like the concept.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#189

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#187)

Re: checkpointer continuous flushing - V16

Hello Andres,

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
flushing every wal_writer_delay ms or wal_writer_flush_after
bytes.

I've looked at these patches, especially the whole bench of explanations
and comments which is a good source for understanding what is going on in
the WAL writer, a part of pg I'm not familiar with.

When reading the patch 0002 explanations, I had the following comments:

AFAICS, there are several levels of actions when writing things in pg:

0: the thing is written in some internal buffer

1: the buffer is advised to be passed to the OS (hint bits?)

2: the buffer is actually passed to the OS (write, flush)

3: the OS is advised to send the written data to the io subsystem
(sync_file_range with SYNC_FILE_RANGE_WRITE)

4: the OS is required to send the written data to the disk
(fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

It is not clear when reading the text which level is discussed. In
particular, I'm not sure that "flush" refers to level 2, which is
misleading. When reading the description, I'm rather under the impression
that it is about level 4, but then if actual fsync are performed every 200
ms then the tps would be very low...

After more considerations, my final understanding is that this behavior
only occurs with "asynchronous commit", aka a situation when COMMIT does
not wait for data to be really fsynced, but the fsync is to occur within
some delay so it will not be too far away, some kind of compromise for
performance where commits can be lost.

Now all this is somehow alien to me because the whole point of committing
is having the data to disk, and I would not consider a database to be safe
if commit does not imply fsync, but I understand that people may have to
compromise for performance.

Is my understanding right?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#190

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#187)

Re: checkpointer continuous flushing - V16

On 2016-02-11 19:44:25 +0100, Andres Freund wrote:

The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently. The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
flushing every wal_writer_delay ms or wal_writer_flush_after
bytes.

I've pushed these after some more polishing, now working on the next
two.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#191

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#189)

Re: checkpointer continuous flushing - V16

On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote:

I've looked at these patches, especially the whole bench of explanations and
comments which is a good source for understanding what is going on in the
WAL writer, a part of pg I'm not familiar with.

When reading the patch 0002 explanations, I had the following comments:

AFAICS, there are several levels of actions when writing things in pg:

0: the thing is written in some internal buffer

1: the buffer is advised to be passed to the OS (hint bits?)

Hint bits aren't related to OS writes. They're about information like
'this transaction committed' or 'all tuples on this page are visible'.

2: the buffer is actually passed to the OS (write, flush)

3: the OS is advised to send the written data to the io subsystem
(sync_file_range with SYNC_FILE_RANGE_WRITE)

4: the OS is required to send the written data to the disk
(fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) -
the guarantees it gives aren't well defined, and actually changed across
releases.

0002 is about something different, it's about the WAL writer. Which
writes WAL to disk, so individual backends don't have to. It does so in
the background every wal_writer_delay or whenever a tranasaction
asynchronously commits. The reason this interacts with checkpoint
flushing is that, when we flush writes on a regular pace, the writes by
the checkpointer happen inbetween the very frequent writes/fdatasync()
by the WAL writer. That means the disk's caches are flushed every
fdatasync() - which causes considerable slowdowns. On a decent SSD the
WAL writer, before this patch, often did 500-1000 fdatasync()s a second;
the regular sync_file_range calls slowed down things too much.

That's what caused the large regression when using checkpoint
sorting/flushing with synchronous_commit=off. With that fixed - often a
performance improvement on its own - I don't see that regression anymore.

After more considerations, my final understanding is that this behavior only
occurs with "asynchronous commit", aka a situation when COMMIT does not wait
for data to be really fsynced, but the fsync is to occur within some delay
so it will not be too far away, some kind of compromise for performance
where commits can be lost.

Right.

Now all this is somehow alien to me because the whole point of committing is
having the data to disk, and I would not consider a database to be safe if
commit does not imply fsync, but I understand that people may have to
compromise for performance.

It's obviously not applicable for every scenario, but in a *lot* of
real-world scenario a sub-second loss window doesn't have any actual
negative implications.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#192

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#181)

Re: checkpointer continuous flushing - V16

Hello Andres,

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

Below the results of a lot of tests with pgbench to exercise checkpoints
on the above version when fetched.

Overall comments:
- sorting & flushing is basically always a winner
- benchmarking with short runs on large databases is a bad idea
the results are very different if a longer run is used
(see andres00b vs andres00c)

# HOST/SOFT

16 GB 2 cpu 8 cores
200 GB RAID1 HDD, ext4 FS
Ubuntu 12.04 LTS (precise)

# ABOUT THE REPORTED STATISTICS

tps: is the "excluding connection" time tps, the higher the better
1-sec tps: average of measured per-second tps
note - it should be the same as the previous one, but due to various
hazards in the trace, especially when things go badly and pg get
stuck, it may be different. Such hazard also explain why there
may be some non-integer tps reported for some seconds.
stddev: standard deviation, the lower the better
the five figures in bracket give a feel of the distribution:
- min: minimal per-second tps seen in the trace
- q1: first quarter per-second tps seen in the trace
- med: median per-second tps seen in the trace
- q3: third quarter per-second tps seen in the trace
- max: maximal per-second tps seen in the trace
the last percentage dubbed "<=10.0" is percent of seconds where performance
is below 10 tps: this measures of how unresponsive pg was during the run

###### TINY2

pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
with scale = 10 (~ 200 MB)

postgresql.conf:
shared_buffers = 1GB
max_wal_size = 1GB
checkpoint_timeout = 300s
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 2574.1 / 2574.3 ï¿œ 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0%
1 | 2575.0 / 2575.1 ï¿œ 359.3 [ 1.0, 2595.9, 2712.0, 2732.0, 2847.0] 0.1%
2 | 2602.6 / 2602.7 ï¿œ 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0%

0 0 | 2583.2 / 2583.7 ï¿œ 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0%
1 | 2596.6 / 2596.9 ï¿œ 307.4 [296.0, 2590.5, 2707.9, 2738.0, 2847.8] 0.0%
2 | 2604.8 / 2605.0 ï¿œ 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0%

32 0 | 2625.5 / 2625.5 ï¿œ 250.5 [ 1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1%
1 | 2630.2 / 2630.2 ï¿œ 243.1 [301.8, 2654.9, 2697.2, 2726.0, 2837.4] 0.0%
2 | 2648.3 / 2648.4 ï¿œ 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0%

64 0 | 2587.8 / 2587.9 ï¿œ 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0%
1 | 2591.1 / 2591.1 ï¿œ 305.2 [455.9, 2608.9, 2680.2, 2734.1, 2859.0] 0.0%
2 | 2047.8 / 2046.4 ï¿œ 925.8 [ 0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ?

Pretty small setup, all data fit in buffers. Good tps performance all around
(best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction
in tps stddev.

###### SMALL

pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
with scale = 120 (~ 2 GB)

postgresql.conf:
shared_buffers = 2GB
checkpoint_timeout = 300s
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 209.2 / 204.2 ï¿œ 516.5 [0.0, 0.0, 4.0, 5.0, 2251.0] 82.3%
1 | 207.4 / 204.2 ï¿œ 518.7 [0.0, 0.0, 4.0, 5.0, 2245.1] 82.3%
2 | 217.5 / 211.0 ï¿œ 530.3 [0.0, 0.0, 3.0, 5.0, 2255.0] 82.0%
3 | 217.8 / 213.2 ï¿œ 531.7 [0.0, 0.0, 4.0, 6.0, 2261.9] 81.7%
4 | 230.7 / 223.9 ï¿œ 542.7 [0.0, 0.0, 4.0, 7.0, 2282.0] 80.7%

0 0 | 734.8 / 735.5 ï¿œ 879.9 [0.0, 1.0, 16.5, 1748.3, 2281.1] 47.0%
1 | 694.9 / 693.0 ï¿œ 849.0 [0.0, 1.0, 29.5, 1545.7, 2428.0] 46.4%
2 | 735.3 / 735.5 ï¿œ 888.4 [0.0, 0.0, 12.0, 1781.2, 2312.1] 47.9%
3 | 736.0 / 737.5 ï¿œ 887.1 [0.0, 1.0, 16.0, 1794.3, 2317.0] 47.5%
4 | 734.9 / 735.1 ï¿œ 885.1 [0.0, 1.0, 15.5, 1781.0, 2297.1] 47.2%

32 0 | 738.1 / 737.9 ï¿œ 415.8 [0.0, 553.0, 679.0, 753.0, 2312.1] 0.2%
1 | 730.5 / 730.7 ï¿œ 413.2 [0.0, 546.5, 671.0, 744.0, 2319.0] 0.1%
2 | 741.9 / 741.9 ï¿œ 416.5 [0.0, 556.0, 682.0, 756.0, 2331.0] 0.2%
3 | 744.1 / 744.1 ï¿œ 414.4 [0.0, 555.5, 685.2, 758.0, 2285.1] 0.1%
4 | 746.9 / 746.9 ï¿œ 416.6 [0.0, 566.6, 685.0, 759.0, 2308.1] 0.1%

64 0 | 743.0 / 743.1 ï¿œ 416.5 [1.0, 555.0, 683.0, 759.0, 2353.0] 0.1%
1 | 742.5 / 742.5 ï¿œ 415.6 [0.0, 558.2, 680.0, 758.2, 2296.0] 0.1%
2 | 742.5 / 742.5 ï¿œ 415.9 [0.0, 559.0, 681.1, 757.0, 2310.0] 0.1%
3 | 529.0 / 526.6 ï¿œ 410.9 [0.0, 245.0, 444.0, 701.0, 2380.9] 1.5% ??
4 | 734.8 / 735.0 ï¿œ 414.1 [0.0, 550.0, 673.0, 754.0, 2298.0] 0.1%

Sorting brings * 3.3 tps, flushing significantly reduces tps stddev.
Pg comes from 80% unresponsive to nearly always responsive.

###### MEDIUM

pgbench: -M prepared -N -P 1 -T 4000 -j 2 -c 4
with scale = 250 (~ 3.8 GB)

postgresql.conf:
shared_buffers = 4GB
max_wal_size = 4GB
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 214.8 / 211.8 ï¿œ 513.7 [0.0, 1.0, 4.0, 5.0, 2344.0] 82.4%
1 | 219.2 / 215.0 ï¿œ 524.1 [0.0, 0.0, 4.0, 5.0, 2316.0] 82.2%
2 | 240.9 / 234.6 ï¿œ 550.8 [0.0, 0.0, 4.0, 6.0, 2320.2] 81.0%

0 0 | 1064.7 / 1065.3 ï¿œ 888.2 [0.0, 11.0, 1089.0, 2017.7, 2461.9] 24.7%
1 | 1060.2 / 1061.2 ï¿œ 889.9 [0.0, 10.0, 1056.7, 2022.0, 2444.9] 25.1%
2 | 1060.2 / 1061.4 ï¿œ 889.1 [0.0, 9.0, 1085.8, 2002.8, 2473.0] 25.6%

32 0 | 1059.4 / 1059.4 ï¿œ 476.3 [3.0, 804.9, 980.0, 1123.0, 2448.1] 0.1%
1 | 1062.5 / 1062.6 ï¿œ 475.6 [0.0, 807.0, 988.0, 1132.0, 2441.0] 0.1%
2 | 1063.7 / 1063.7 ï¿œ 475.4 [0.0, 814.0, 987.0, 1131.2, 2432.1] 0.1%

64 0 | 1052.6 / 1052.6 ï¿œ 475.3 [0.0, 793.0, 974.0, 1118.1, 2445.1] 0.1%
1 | 1059.8 / 1059.8 ï¿œ 475.1 [0.0, 799.0, 987.5, 1131.0, 2457.1] 0.1%
2 | 1058.5 / 1058.5 ï¿œ 472.8 [0.0, 807.0, 985.0, 1127.7, 2442.0] 0.1%

Sorting brings * 4.8 tps, flushing significantly reduces tps stddev.
Pg comes from +80% unresponsive to nearly always responsive.

Performance is significantly better than "small" above, probably thanks to
the longer checkpoint timeout.

###### LARGE

pgbench -M prepared -N -P 1 -T 7500 -j 2 -c 4
with scale = 1000 (~ 15 GB)

postgresql.conf:
shared_buffers = 4GB
max_wal_size = 2GB
checkpoint_timeout = 40min
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64}

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 68.7 / 65.3 ï¿œ 78.6 [0.0, 3.0, 6.0, 136.0, 291.0] 53.1%
1 | 70.6 / 70.3 ï¿œ 80.1 [0.0, 4.0, 10.0, 151.0, 282.0] 50.1%
2 | 74.3 / 75.8 ï¿œ 84.9 [0.0, 4.0, 9.0, 162.0, 311.2] 50.3%

0 0 | 117.2 / 116.9 ï¿œ 83.8 [0.0, 14.0, 139.0, 193.0, 372.4] 24.0%
1 | 117.3 / 117.8 ï¿œ 83.8 [0.0, 16.0, 140.0, 193.0, 279.0] 23.9%
2 | 117.6 / 118.2 ï¿œ 84.1 [0.0, 16.0, 141.0, 194.0, 297.8] 23.7%

32 0 | 114.2 / 114.2 ï¿œ 45.7 [0.0, 84.0, 100.0, 131.0, 613.6] 0.4%
1 | 112.5 / 112.6 ï¿œ 44.0 [0.0, 83.0, 98.0, 130.0, 293.0] 0.2%
2 | 108.0 / 108.0 ï¿œ 44.7 [0.0, 79.0, 94.0, 124.0, 303.6] 0.3%

64 0 | 113.0 / 113.0 ï¿œ 45.5 [0.0, 83.0, 99.0, 131.0, 289.0] 0.4%
1 | 80.0 / 80.3 ï¿œ 39.1 [0.0, 56.0, 72.0, 95.0, 281.0] 0.8% ??
2 | 112.2 / 112.3 ï¿œ 44.5 [0.0, 82.0, 99.0, 129.0, 282.0] 0.3%

Data do not fit in the available memory, so plenty of read accesses.
Sorting still has some impact on tps performance (* 1.6), flushing
greatly improves responsiveness.

###### ANDRES00

pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16
with scale = 800 (~ 13 GB)

postgresql.conf:
shared_buffers = 2GB
max_wal_size = 100GB
wal_level = hot_standby
maintenance_work_mem = 2GB
checkpoint_timeout = 30s
checkpoint_completion_target = 0.8
synchronous_commit = off
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 328.7 / 329.9 ï¿œ 716.9 [0.0, 0.0, 0.0, 0.0, 3221.2] 77.7%
1 | 338.2 / 338.7 ï¿œ 728.6 [0.0, 0.0, 0.0, 17.0, 3296.3] 75.0%
2 | 304.5 / 304.3 ï¿œ 705.5 [0.0, 0.0, 0.0, 0.0, 3463.4] 79.3%

0 0 | 425.6 / 464.0 ï¿œ 724.0 [0.0, 0.0, 0.0, 1000.6, 3363.7] 61.0%
1 | 461.5 / 463.1 ï¿œ 735.8 [0.0, 0.0, 0.0, 1011.2, 3490.9] 58.7%
2 | 452.4 / 452.6 ï¿œ 744.3 [0.0, 0.0, 0.0, 1078.9, 3631.9] 63.3%

32 0 | 514.4 / 515.8 ï¿œ 651.8 [0.0, 0.0, 337.4, 808.3, 2876.0] 40.7%
1 | 512.0 / 514.6 ï¿œ 661.6 [0.0, 0.0, 317.6, 690.8, 3315.8] 35.0%
2 | 529.5 / 530.3 ï¿œ 673.0 [0.0, 0.0, 321.1, 906.4, 3360.8] 40.3%

64 0 | 529.6 / 530.9 ï¿œ 668.2 [0.0, 0.0, 322.1, 786.1, 3538.0] 33.3%
1 | 496.4 / 498.0 ï¿œ 606.6 [0.0, 0.0, 321.4, 746.0, 2629.6] 36.3%
2 | 521.0 / 521.7 ï¿œ 657.0 [0.0, 0.0, 328.4, 737.9, 3262.9] 34.3%

Data just hold in memory, maybe. Run is very short, settings are low, this
is not representative of an sane installation, this is for testing a lot of
checkpoints in a difficult situation. Sorting and flushing do bring
significant benefits.

###### ANDRES00b (same as ANDRES00 but scale 800->1000)

pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16
with scale = 1000 (~ 15 GB)

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 150.2 / 150.3 ï¿œ 401.6 [0.0, 0.0, 0.0, 0.0, 2199.4] 75.1%
1 | 139.2 / 139.2 ï¿œ 372.2 [0.0, 0.0, 0.0, 0.0, 2111.4] 78.3% ***
2 | 127.3 / 127.1 ï¿œ 341.2 [0.0, 0.0, 0.0, 53.0, 2144.3] 74.7% ***

0 0 | 199.0 / 209.2 ï¿œ 400.4 [0.0, 0.0, 0.0, 243.6, 1846.0] 65.7%
1 | 220.4 / 226.7 ï¿œ 423.2 [0.0, 0.0, 0.0, 264.0, 1777.0] 63.5% *
2 | 195.5 / 205.3 ï¿œ 337.9 [0.0, 0.0, 123.0, 212.0, 1721.9] 43.2%

32 0 | 362.3 / 359.0 ï¿œ 308.4 [0.0, 200.0, 265.0, 416.4, 1816.6] 5.0%
1 | 323.6 / 321.2 ï¿œ 327.1 [0.0, 142.9, 210.0, 353.4, 1907.0] 4.0%
2 | 309.0 / 310.7 ï¿œ 381.3 [0.0, 122.0, 175.5, 298.0, 2090.4] 5.0%

64 0 | 342.7 / 343.6 ï¿œ 331.1 [0.0, 143.0, 239.5, 409.9, 1623.6] 5.3%
1 | 333.8 / 328.2 ï¿œ 356.3 [0.0, 132.9, 211.5, 358.1, 1629.1] 10.7% ??
2 | 352.0 / 352.0 ï¿œ 332.3 [0.0, 163.5, 239.9, 400.1, 1643.4] 5.3%

A little bit larger than previous so that it does not really fit in memory.
The performance inpact is significant compared to previous. Sorting and
flushing brings * 2 tps, unresponsiveness comes from 75% to reach a better 5%.

###### ANDRES00c (same as ANDRES00b but time 300 -> 4000)

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 115.2 / 114.3 ï¿œ 256.4 [0.0, 0.0, 75.0, 131.1, 3389.0] 46.5%
1 | 118.4 / 117.9 ï¿œ 248.3 [0.0, 0.0, 87.0, 151.0, 3603.6] 46.7%
2 | 120.1 / 119.2 ï¿œ 254.4 [0.0, 0.0, 91.0, 143.0, 3307.8] 43.8%

0 0 | 217.4 / 211.0 ï¿œ 237.1 [0.0, 139.0, 193.0, 239.0, 3115.4] 16.8%
1 | 216.2 / 209.6 ï¿œ 244.9 [0.0, 138.9, 188.0, 231.0, 3331.3] 16.3%
2 | 218.6 / 213.8 ï¿œ 246.7 [0.0, 137.0, 187.0, 232.0, 3229.6] 16.2%

32 0 | 146.6 / 142.5 ï¿œ 234.5 [0.0, 59.0, 93.0, 151.1, 3294.7] 17.5%
1 | 148.0 / 142.6 ï¿œ 239.2 [0.0, 64.0, 95.9, 144.0, 3361.8] 16.0%
2 | 147.6 / 140.4 ï¿œ 233.2 [0.0, 59.4, 94.0, 148.0, 3108.4] 18.0%

64 0 | 145.3 / 140.5 ï¿œ 233.6 [0.0, 61.0, 93.0, 147.7, 3212.6] 16.5%
1 | 145.6 / 140.3 ï¿œ 233.3 [0.0, 58.0, 93.0, 146.0, 3351.8] 17.3%
2 | 147.7 / 142.2 ï¿œ 233.2 [0.0, 61.0, 97.0, 148.4, 3616.3] 17.0%

The only difference between ANDRES00B and ANDRES00C is the duration, from
5 minutes to 66 minutes. This show that short runs can be widelely misleading:
In particular the longer runs shows less that half tps for some settings, and
the relative comparison of head vs sort vs sort+flush is different.

###### ANDRES00d (same as ANDRES00b but wal_level hot_standby->minimal)

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 191.6 / 195.1 ï¿œ 439.3 [0.0, 0.0, 0.0, 0.0, 2540.2] 76.3%
1 | 211.3 / 213.6 ï¿œ 461.9 [0.0, 0.0, 0.0, 13.0, 3203.7] 75.0%
2 | 152.4 / 154.9 ï¿œ 217.6 [0.0, 0.0, 58.0, 235.6, 995.9] 39.3% ???

0 0 | 247.2 / 251.7 ï¿œ 454.0 [0.0, 0.0, 0.0, 375.3, 2592.4] 67.7%
1 | 215.4 / 232.7 ï¿œ 446.5 [0.0, 0.0, 0.0, 103.0, 3046.7] 72.3%
2 | 160.6 / 160.8 ï¿œ 222.1 [0.0, 0.0, 80.0, 209.6, 885.3] 42.0% ???

32 0 | 399.9 / 397.0 ï¿œ 356.6 [0.0, 67.0, 348.0, 572.8, 2604.2] 21.0%
1 | 391.8 / 392.5 ï¿œ 371.7 [0.0, 85.5, 314.4, 549.3, 2590.3] 20.7%
2 | 406.1 / 404.8 ï¿œ 380.6 [0.0, 95.0, 348.5, 569.0, 3383.7] 21.3%

64 0 | 395.9 / 396.1 ï¿œ 352.4 [0.0, 89.5, 342.5, 556.0, 2366.9] 17.7%
1 | 355.1 / 351.9 ï¿œ 296.7 [0.0, 172.5, 306.1, 468.1, 1663.5] 16.0%
2 | 403.6 / 401.8 ï¿œ 390.5 [0.0, 0.0, 337.0, 636.1, 2591.3] 26.7% ???

###### ANDRES00e (same as ANDRES00b but maintenance_work_mem=2GB->64MB)

opts # | tps / 1-sec tps ï¿œ stddev [ min q1 med q2 max ] <=10.0

head 0 | 153.5 / 161.3 ï¿œ 401.3 [0.0, 0.0, 0.0, 0.0, 2546.0] 82.0%
1 | 170.7 / 175.9 ï¿œ 399.9 [0.0, 0.0, 0.0, 14.0, 2537.4] 74.7%
2 | 184.7 / 190.4 ï¿œ 389.2 [0.0, 0.0, 0.0, 158.5, 2544.6] 69.3%

0 0 | 211.2 / 227.8 ï¿œ 418.8 [0.0, 0.0, 0.0, 334.6, 2589.3] 65.7%
1 | 221.7 / 226.0 ï¿œ 415.7 [0.0, 0.0, 0.0, 276.8, 2588.2] 68.4%
2 | 232.5 / 233.2 ï¿œ 403.5 [0.0, 0.0, 0.0, 377.0, 2260.2] 62.0%

32 0 | 373.2 / 374.4 ï¿œ 309.2 [0.0, 180.6, 321.8, 475.2, 2596.5] 11.3%
1 | 348.7 / 348.1 ï¿œ 328.4 [0.0, 127.0, 284.1, 451.9, 2595.1] 17.3%
2 | 376.3 / 375.3 ï¿œ 315.5 [0.0, 186.5, 329.6, 487.1, 2365.4] 15.3%

64 0 | 388.9 / 387.8 ï¿œ 348.7 [0.0, 164.0, 305.9, 546.5, 2587.2] 15.0%
1 | 380.3 / 378.7 ï¿œ 338.8 [0.0, 171.1, 317.4, 524.8, 2592.4] 16.7%
2 | 369.8 / 367.4 ï¿œ 340.5 [0.0, 77.4, 320.6, 525.5, 2484.7] 20.7%

Hmmm, interesting: maintenance_work_mem seems to have some influence on
performance, although it is not too consistent between settings, probably
because as the memory is used to its limit the performance is quite
sensitive to the available memory.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#193

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#192)

Re: checkpointer continuous flushing - V16

Hi,

On 2016-02-19 10:16:41 +0100, Fabien COELHO wrote:

Below the results of a lot of tests with pgbench to exercise checkpoints on
the above version when fetched.

Wow, that's a great test series.

Overall comments:
- sorting & flushing is basically always a winner
- benchmarking with short runs on large databases is a bad idea
the results are very different if a longer run is used
(see andres00b vs andres00c)

Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.

I gather that you didn't play with
backend_flush_after/bgwriter_flush_after, i.e. you left them at their
default values? Especially backend_flush_after can have a significant
positive and negative performance impact.

16 GB 2 cpu 8 cores
200 GB RAID1 HDD, ext4 FS
Ubuntu 12.04 LTS (precise)

That's with 12.04's standard kernel?

postgresql.conf:
shared_buffers = 1GB
max_wal_size = 1GB
checkpoint_timeout = 300s
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64 }

Did you re-initdb between the runs?

I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.

Hmmm, interesting: maintenance_work_mem seems to have some influence on
performance, although it is not too consistent between settings, probably
because as the memory is used to its limit the performance is quite
sensitive to the available memory.

That's probably because of differing behaviour of autovacuum/vacuum,
which sometime will have to do several scans of the tables if there are
too many dead tuples.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#194

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#193)

Re: checkpointer continuous flushing - V16

Hello.

Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.

Yes, these many runs show that 32 is basically as good or better than 64.

I'll do some runs with 16/48 to have some more data.

I gather that you didn't play with
backend_flush_after/bgwriter_flush_after, i.e. you left them at their
default values? Especially backend_flush_after can have a significant
positive and negative performance impact.

Indeed, non reported configuration options have their default values.
There were also minor changes in the default options for logging (prefix,
checkpoint, ...), but nothing significant, and always the same for all
runs.

[...] Ubuntu 12.04 LTS (precise)

That's with 12.04's standard kernel?

Yes.

checkpoint_flush_after = { none, 0, 32, 64 }

Did you re-initdb between the runs?

Yes, all runs are from scratch (initdb, pgbench -i, some warmup...).

I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.

Yes, I agree that probably the performance changes on long vs short runs
(andres00c vs andres00b) is due to autovacuum.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#195

Patric Bechtel

patric.bechtel@gmail.com

almost 10 years ago

In reply to: Fabien COELHO (#194)

Re: checkpointer continuous flushing - V16

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Fabien,

Fabien COELHO schrieb am 19.02.2016 um 16:04:

[...] Ubuntu 12.04 LTS (precise)

That's with 12.04's standard kernel?

Yes.

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO somehow. The difference
to 3.13 (the latest LTS kernel for 12.04) is huge.

https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu

You might consider upgrading your kernel to 3.13 LTS. It's quite easy normally:

https://wiki.ubuntu.com/Kernel/LTSEnablementStack

/Patric
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2

iEYEARECAAYFAlbHW4AACgkQfGgGu8y7ypC1EACgy8mW6AoaWjKycbuAnCZ3CEPW
Al8AmwfF0smqmDvNsaPkq0dAtop7jP5M
=TxT+
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#196

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Patric Bechtel (#195)

Re: checkpointer continuous flushing - V16

Hallo Patric,

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify
IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is
huge.

https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu

Interesting! To summarize it, 25% performance degradation from best kernel
(2.6.32) to worst (3.2.0), that is indeed significant.

You might consider upgrading your kernel to 3.13 LTS. It's quite easy
[...]

There are other stuff running on the hardware that I do not wish to touch,
so upgrading the particular host is currently not an option, otherwise I
would have switched to trusty.

Thanks for the pointer.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#197

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#181)

2 attachment(s)

Re: checkpointer continuous flushing - V18

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:

Hi,

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

I've updated the git tree.

Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.

For 0001 I've recently changed:
* Don't schedule writeback after smgrextend() - that defeats linux
delayed allocation mechanism, increasing fragmentation noticeably.
* Add docs for the new GUC variables
* comment polishing
* BackendWritebackContext now isn't dynamically allocated anymore

I think this patch primarily needs:
* review of the docs, not sure if they're easy enough to
understand. Some language polishing might also be needed.
* review of the writeback API, combined with the smgr/md.c changes.
* Currently *_flush_after can be set to a nonzero value, even if there's
no support for flushing on that platform. Imo that's ok, but perhaps
other people's opinion differ.

For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
it doesn't take a lot of time.
* Minor comment polishing.

I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
inclined to leave it off till then.

Regards,

Andres

Attachments:

0001-Allow-to-trigger-kernel-writeback-after-a-configurab.patchtext/x-patch; charset=us-asciiDownload

From 58aee659417372f3dda4420d8f2a4f4d41c56d31 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 19 Feb 2016 12:13:05 -0800
Subject: [PATCH 1/4] Allow to trigger kernel writeback after a configurable
 number of writes.

Currently writes to the main data files of postgres all go through the
OS page cache. This means that currently several operating systems can
end up collecting a large number of dirty buffers in their respective
page caches.  When these dirty buffers are flushed to storage rapidly,
be it because of fsync(), timeouts, or dirty ratios, latency for other
writes can increase massively.  This is the primary reason for regular
massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.

On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.

Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache.  Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.

With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.

A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences.  This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.

TODO:
* verify msync codepath
* properly detect mmap() && msync(MS_ASYNC) support, use it by default
  if available and sync_file_range is *not* available

Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
---
 doc/src/sgml/config.sgml              |  81 +++++++++++++++
 doc/src/sgml/wal.sgml                 |  13 +++
 src/backend/postmaster/bgwriter.c     |   8 +-
 src/backend/storage/buffer/buf_init.c |   5 +
 src/backend/storage/buffer/bufmgr.c   | 185 +++++++++++++++++++++++++++++++++-
 src/backend/storage/file/copydir.c    |   4 +-
 src/backend/storage/file/fd.c         | 153 +++++++++++++++++++++++++---
 src/backend/storage/smgr/md.c         |  49 +++++++++
 src/backend/storage/smgr/smgr.c       |  19 +++-
 src/backend/utils/misc/guc.c          |  36 +++++++
 src/include/storage/buf_internals.h   |  31 +++++-
 src/include/storage/bufmgr.h          |  22 +++-
 src/include/storage/fd.h              |   3 +-
 src/include/storage/smgr.h            |   4 +
 src/tools/pgindent/typedefs.list      |   2 +
 15 files changed, 586 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..3dc6719 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1843,6 +1843,32 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-bgwriter-flush-after" xreflabel="bgwriter_flush_after">
+       <term><varname>bgwriter_flush_after</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>bgwriter_flush_after</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Whenever more than <varname>bgwriter_flush_after</varname> bytes have
+         been written by the bgwriter, hint to OS to flush these writes to the
+         underlying storage.  Doing so will limit the amount of dirty data in
+         the kernel's page cache, reducing the likelihood of stalls when fsync
+         is issued at the end of a checkpoint, or when the OS writes out data
+         in larger batches in the background.  Often that will result in
+         greatly reduced transaction latency, but there also are some cases,
+         especially with workloads that are bigger than <xref
+         linkend="guc-shared-buffers">, but smaller than the OS's page cache,
+         where performance might degrade.  This setting may have no effect on
+         some platforms.  <literal>0</literal> disables controlled flushing.
+         The default is <literal>256Kb</> on Linux, <literal>0</> otherwise.
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
 
      <para>
@@ -1944,6 +1970,35 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-backend-flush-after" xreflabel="backend_flush_after">
+       <term><varname>backend_flush_after</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>backend_flush_after</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Whenever more than <varname>backend_flush_after</varname> bytes have
+         been written by a single backend, hint to OS to flush these writes to
+         the underlying storage.  Doing so will limit the amount of dirty data
+         in the kernel's page cache, reducing the likelihood of stalls when
+         fsync is issued at the end of a checkpoint, or when the OS writes out
+         data in larger batches in the background.  Often that will result in
+         greatly reduced transaction latency, but there also are some cases,
+         especially with workloads that are bigger than <xref
+         linkend="guc-shared-buffers">, but smaller than the OS's page cache,
+         where performance might degrade. Note that because
+         <varname>backend_flush_after</varname> is per-backend, the total
+         amount of dirty data in the kerne's page cache can be considerably
+         bigger than this setting.  This setting may have no effect on some
+         platforms.  <literal>0</literal> disables controlled flushing.  The
+         default is <literal>256Kb</> on Linux, <literal>0</> otherwise.  This
+         parameter can only be set in the <filename>postgresql.conf</> file or
+         on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
     </sect2>
    </sect1>
@@ -2475,6 +2530,32 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-checkpoint-flush-after" xreflabel="checkpoint_flush_after">
+      <term><varname>checkpoint_flush_after</varname> (<type>int</type>)
+      <indexterm>
+       <primary><varname>checkpoint_flush_after</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whenever more than <varname>checkpoint_flush_after</varname> bytes
+        have been written while performing a checkpoint, hint to OS to flush
+        these writes to the underlying storage.  Doing so will limit the
+        amount of dirty data in the kernel's page cache, reducing the
+        likelihood of stalls when fsync is issued at the end of a checkpoint,
+        or when the OS writes out data in larger batches in the background.
+        Often that will result in greatly reduced transaction latency, but
+        there also are some cases, especially with workloads that are bigger
+        than <xref linkend="guc-shared-buffers">, but smaller than the OS's
+        page cache, where performance might degrade.  This setting may have no
+        effect on some platforms.  <literal>0</literal> disables controlled
+        flushing.  The default is <literal>256Kb</> on Linux, <literal>0</>
+        otherwise.  This parameter can only be set in the
+        <filename>postgresql.conf</> file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
       <term><varname>checkpoint_warning</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..96496b0 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,19 @@
   </para>
 
   <para>
+   On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after">
+   allows to guide the OS that pages written by the checkpoint should be
+   flushed to disk.  Otherwise, these pages may be kept in the OS's page
+   cache, inducing a stall when <literal>fsync</> is called later.  This
+   setting helps to reduce transaction latency, but it also can an adverse
+   effect on performance; particularly for workloads that are bigger than
+   <xref linkend="guc-shared-buffers">, but smaller than the OS's page cache.
+   It should be beneficial for high write loads on HDD.  This feature probably
+   brings no benefit on SSD, as the I/O write latency is small on such
+   hardware, thus it may be disabled.
+  </para>
+
+  <para>
    The number of WAL segment files in <filename>pg_xlog</> directory depends on
    <varname>min_wal_size</>, <varname>max_wal_size</> and
    the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4ff4caf..7d0371d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -111,6 +111,7 @@ BackgroundWriterMain(void)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	WritebackContext wb_context;
 
 	/*
 	 * Properly accept or ignore signals the postmaster might send us.
@@ -164,6 +165,8 @@ BackgroundWriterMain(void)
 											 ALLOCSET_DEFAULT_MAXSIZE);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	WritebackContextInit(&wb_context, &bgwriter_flush_after);
+
 	/*
 	 * If an exception is encountered, processing resumes here.
 	 *
@@ -208,6 +211,9 @@ BackgroundWriterMain(void)
 		/* Flush any leaked data in the top-level context */
 		MemoryContextResetAndDeleteChildren(bgwriter_context);
 
+		/* re-initilialize to avoid repeated errors causing problems */
+		WritebackContextInit(&wb_context, &bgwriter_flush_after);
+
 		/* Now we can allow interrupts again */
 		RESUME_INTERRUPTS();
 
@@ -269,7 +275,7 @@ BackgroundWriterMain(void)
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync();
+		can_hibernate = BgBufferSync(&wb_context);
 
 		/*
 		 * Send off activity statistics to the stats collector
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index f013a4d..e10071d 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -23,6 +23,7 @@ char	   *BufferBlocks;
 LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
 LWLockTranche BufferIOLWLockTranche;
 LWLockTranche BufferContentLWLockTranche;
+WritebackContext BackendWritebackContext;
 
 
 /*
@@ -149,6 +150,10 @@ InitBufferPool(void)
 
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
+
+	/* Initialize per-backend file flush context */
+	WritebackContextInit(&BackendWritebackContext,
+						 &backend_flush_after);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7141eb8..cdbda0f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -83,6 +83,14 @@ bool		track_io_timing = false;
 int			effective_io_concurrency = 0;
 
 /*
+ * GUC variables about triggering kernel writeback for buffers written; OS
+ * dependant defaults are set via the GUC mechanism.
+ */
+int			checkpoint_flush_after = 0;
+int			bgwriter_flush_after = 0;
+int			backend_flush_after = 0;
+
+/*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
  * ReadBuffer calls by.  This is maintained by the assign hook for
  * effective_io_concurrency.  Zero means "never prefetch".  This value is
@@ -399,7 +407,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *flush_context);
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -416,6 +424,7 @@ static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
+static int	buffertag_comparator(const void *p1, const void *p2);
 
 
 /*
@@ -818,6 +827,12 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
+
+		/*
+		 * XXX: Note that we're *not* doing a ScheduleBufferTagForWriteback
+		 * here. At least on linux doing so defeats 'delayed allocation',
+		 * leading to more fragmented files.
+		 */
 	}
 	else
 	{
@@ -1084,6 +1099,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				FlushBuffer(buf, NULL);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
+				ScheduleBufferTagForWriteback(&BackendWritebackContext,
+											  &buf->tag);
+
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
 											   smgr->smgr_rnode.node.spcNode,
 												smgr->smgr_rnode.node.dbNode,
@@ -1642,6 +1660,7 @@ BufferSync(int flags)
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	WritebackContext wb_context;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1694,6 +1713,9 @@ BufferSync(int flags)
 	if (num_to_write == 0)
 		return;					/* nothing to do */
 
+
+	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
 
 	/*
@@ -1725,7 +1747,7 @@ BufferSync(int flags)
 		 */
 		if (bufHdr->flags & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
 			{
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
@@ -1777,7 +1799,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(void)
+BgBufferSync(WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -2002,7 +2024,8 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int			buffer_state = SyncOneBuffer(next_to_clean, true,
+												 wb_context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -2079,10 +2102,11 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 {
 	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
+	BufferTag	tag;
 
 	ReservePrivateRefCountEntry();
 
@@ -2123,8 +2147,13 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	FlushBuffer(bufHdr, NULL);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+
+	tag = bufHdr->tag;
+
 	UnpinBuffer(bufHdr, true);
 
+	ScheduleBufferTagForWriteback(wb_context, &tag);
+
 	return result | BUF_WRITTEN;
 }
 
@@ -3724,3 +3753,149 @@ rnode_comparator(const void *p1, const void *p2)
 	else
 		return 0;
 }
+
+
+/*
+ * BufferTag comparator.
+ */
+static int
+buffertag_comparator(const void *a, const void *b)
+{
+	const BufferTag *ba = (const BufferTag *) a;
+	const BufferTag *bb = (const BufferTag *) b;
+	int ret;
+
+	ret = rnode_comparator(&ba->rnode, &bb->rnode);
+
+	if (ret != 0)
+		return ret;
+
+	if (ba->forkNum < bb->forkNum)
+		return -1;
+	if (ba->forkNum > bb->forkNum)
+		return 1;
+
+	if (ba->blockNum < bb->blockNum)
+		return -1;
+	if (ba->blockNum > bb->blockNum)
+		return 1;
+
+	return 0;
+}
+
+
+/*
+ * Initialize a writeback context, discarding potential previous state.
+ *
+ * *max_coalesce is a pointer to a variable containing the current maximum
+ * number of writeback requests that will be coalesced into a bigger one. A
+ * value <= 0 means that no writeback control will be performed. max_pending
+ * is a pointer instead of an immediate value, so the coalesce limits can
+ * easily changed by the GUC mechanism, and so calling code does not have to
+ * check the current config variables.
+ */
+void
+WritebackContextInit(WritebackContext *context, int *max_pending)
+{
+	Assert(*max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
+
+	context->max_pending = max_pending;
+	context->nr_pending = 0;
+}
+
+
+/*
+ * Add buffer to list of pending writeback requests.
+ */
+void
+ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
+{
+	PendingWriteback *pending;
+
+	/* nothing to do if flushing is disabled */
+	if (*context->max_pending <= 0 && context->nr_pending <= 0)
+		return;
+
+	Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
+
+	pending = &context->pending_writebacks[context->nr_pending++];
+
+	pending->tag = *tag;
+
+	if (context->nr_pending >= *context->max_pending)
+		IssuePendingWritebacks(context);
+}
+
+/*
+ * Issue all pending writeback requests, previously scheduled with
+ * ScheduleBufferTagForWriteback, to the OS.
+ *
+ * Because this is only used to improve the OSs IO scheduling we try to never
+ * error out - it's just a hint.
+ */
+void
+IssuePendingWritebacks(WritebackContext *context)
+{
+	int			i;
+
+	if (context->nr_pending == 0)
+		return;
+
+	/*
+	 * Executing the writes in-order can make them a lot faster, and allows to
+	 * merge writeback requests to consecutive blocks into larger writebacks.
+	 */
+	qsort(&context->pending_writebacks, context->nr_pending,
+		  sizeof(PendingWriteback), buffertag_comparator);
+
+	/*
+	 * Coalesce neighbouring writes, but nothing else. For that we iterate
+	 * through the, now sorted, array of pending flushes, and look forward to
+	 * find all neighbouring (or identical) writes.
+	 */
+	for (i = 0; i < context->nr_pending; i++)
+	{
+		PendingWriteback *cur;
+		PendingWriteback *next;
+		SMgrRelation reln;
+		int ahead;
+		BufferTag tag;
+		Size nblocks = 1;
+
+		cur = &context->pending_writebacks[i];
+		tag = cur->tag;
+
+		/*
+		 * Peek ahead, into following writeback requests, to see if they can
+		 * be combined with the current one.
+		 */
+		for (ahead = 0; i + ahead + 1 < context->nr_pending; ahead++)
+		{
+			next = &context->pending_writebacks[i + ahead + 1];
+
+			/* different file, skip */
+			if (!RelFileNodeEquals(cur->tag.rnode, next->tag.rnode) ||
+				cur->tag.forkNum != cur->tag.forkNum)
+				break;
+
+			/* ok, block flushed twice, skip */
+			if (cur->tag.blockNum == next->tag.blockNum)
+				continue;
+
+			/* only merge consecutive writes */
+			if (cur->tag.blockNum + 1 != next->tag.blockNum)
+				break;
+
+			nblocks++;
+			cur = next;
+		}
+
+		i += ahead;
+
+		/* and finally tell the kernel to write the data to storage */
+		reln = smgropen(tag.rnode, InvalidBackendId);
+		smgrwriteback(reln, tag.forkNum, tag.blockNum, nblocks);
+	}
+
+	context->nr_pending = 0;
+}
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 522f420..a51ee81 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -190,9 +190,9 @@ copy_file(char *fromfile, char *tofile)
 		/*
 		 * We fsync the files later but first flush them to avoid spamming the
 		 * cache and hopefully get the kernel to start writing them out before
-		 * the fsync comes.  Ignore any error, since it's only a hint.
+		 * the fsync comes.
 		 */
-		(void) pg_flush_data(dstfd, offset, nbytes);
+		pg_flush_data(dstfd, offset, nbytes);
 	}
 
 	if (CloseTransientFile(dstfd))
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..5b8a765 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -61,6 +61,9 @@
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#ifndef WIN32
+#include <sys/mman.h>
+#endif
 #include <unistd.h>
 #include <fcntl.h>
 #ifdef HAVE_SYS_RESOURCE_H
@@ -82,6 +85,8 @@
 /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
 #if defined(HAVE_SYNC_FILE_RANGE)
 #define PG_FLUSH_DATA_WORKS 1
+#elif !defined(WIN32) && defined(MS_ASYNC)
+#define PG_FLUSH_DATA_WORKS 1
 #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
 #define PG_FLUSH_DATA_WORKS 1
 #endif
@@ -380,29 +385,126 @@ pg_fdatasync(int fd)
 }
 
 /*
- * pg_flush_data --- advise OS that the data described won't be needed soon
+ * pg_flush_data --- advise OS that the described dirty data should be flushed
  *
- * Not all platforms have sync_file_range or posix_fadvise; treat as no-op
- * if not available.  Also, treat as no-op if enableFsync is off; this is
- * because the call isn't free, and some platforms such as Linux will actually
- * block the requestor until the write is scheduled.
+ * An offset of 0 with an amount of 0 means that the entire file should be
+ * flushed.
  */
-int
-pg_flush_data(int fd, off_t offset, off_t amount)
+void
+pg_flush_data(int fd, off_t offset, off_t nbytes)
 {
 #ifdef PG_FLUSH_DATA_WORKS
-	if (enableFsync)
-	{
+
+	/*
+	 * Right now file flushing is primarily used to avoid making later
+	 * fsync()/fdatasync() calls have a less impact. Thus don't trigger
+	 * flushes if fsyncs are disabled - that's a decision we might want to
+	 * make configurable at some point.
+	 */
+	if (!enableFsync)
+		return;
+
 #if defined(HAVE_SYNC_FILE_RANGE)
-		return sync_file_range(fd, offset, amount, SYNC_FILE_RANGE_WRITE);
+	{
+		int			rc = 0;
+
+		/*
+		 * sync_file_range(SYNC_FILE_RANGE_WRITE), currently linux specific,
+		 * tells the OS that writeback for the passed in blocks should be
+		 * started, but that we don't want to wait for completion.  Note that
+		 * this call might block if too much dirty data exists in the range.
+		 * This is the preferrable method on OSs supporting it, as it works
+		 * reliably when available (contrast to msync()) and doesn't flush out
+		 * clean data (like FADV_DONTNEED).
+		 */
+		rc = sync_file_range(fd, offset, nbytes,
+							 SYNC_FILE_RANGE_WRITE);
+
+		/* don't error out, this is just a performance optimization */
+		if (rc != 0)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not flush dirty data: %m")));
+		}
+	}
+#elif !defined(WIN32) && defined(MS_ASYNC)
+	{
+		int			rc = 0;
+		void	   *p;
+
+		/*
+		 * On many OSs msync() on a mmap'ed file triggers writeback. On linux
+		 * it only does so when MS_SYNC is specified, but then it does the
+		 * writeback synchronously. Luckily all common linux systems have
+		 * sync_file_range().  This is preferrable over FADV_DONTNEED because
+		 * it doesn't flush out clean data.
+		 *
+		 * We map the file (mmap()), tell the kernel to sync back the contents
+		 * (msync()), and then remove the mapping again (munmap()).
+		 */
+		p = mmap(NULL, context->nbytes,
+				 PROT_READ | PROT_WRITE, MAP_SHARED,
+				 context->fd, context->offset);
+		if (p == MAP_FAILED)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not mmap while flushing dirty data in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+			goto out;
+		}
+
+		rc = msync(p, context->nbytes, MS_ASYNC);
+		if (rc != 0)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not flush dirty data in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+			/* NB: need to fall through to munmap()! */
+		}
+
+		rc = munmap(p, context->nbytes);
+		if (rc != 0)
+		{
+			/* FATAL error because mapping would remain */
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg("could not munmap while flushing blocks in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+		}
+	}
 #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-		return posix_fadvise(fd, offset, amount, POSIX_FADV_DONTNEED);
+	{
+		int			rc = 0;
+
+		/*
+		 * Signal the kernel that the passed in range should not be cached
+		 * anymore. This has the, desired, side effect of writing out dirty
+		 * data, and the, undesired, side effect of likely discarding useful
+		 * clean cached blocks.  For the latter reason this is the least
+		 * preferrable method.
+		 */
+
+		rc = posix_fadvise(context->fd, context->offset, context->nbytes,
+						   POSIX_FADV_DONTNEED);
+
+		/* don't error out, this is just a performance optimization */
+		if (rc != 0)
+		{
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not flush dirty data in file \"%s\": %m",
+							context->filename ? context->filename : "")));
+			goto out;
+		}
+	}
 #else
 #error PG_FLUSH_DATA_WORKS should not have been defined
 #endif
-	}
-#endif
-	return 0;
+
+#endif /* PG_FLUSH_DATA_WORKS */
 }
 
 
@@ -1289,6 +1391,24 @@ FilePrefetch(File file, off_t offset, int amount)
 #endif
 }
 
+void
+FileWriteback(File file, off_t offset, int amount)
+{
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileWriteback: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset, amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return;
+
+	pg_flush_data(VfdCache[file].fd, offset, amount);
+}
+
 int
 FileRead(File file, char *buffer, int amount)
 {
@@ -2655,9 +2775,10 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
 	}
 
 	/*
-	 * We ignore errors from pg_flush_data() because this is only a hint.
+	 * pg_flush_data() ignores errors, which is ok because this is only a
+	 * hint.
 	 */
-	(void) pg_flush_data(fd, 0, 0);
+	pg_flush_data(fd, 0, 0);
 
 	(void) CloseTransientFile(fd);
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f6b79a9..bb2b465 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -662,6 +662,55 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 #endif   /* USE_PREFETCH */
 }
 
+/*
+ * mdwriteback() -- Tell the kernel to write pages back to storage.
+ *
+ * This accepts a rnage of blocks because flushing several pages at once is
+ * considerably more efficient than doing so individually.
+ */
+void
+mdwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+
+	/*
+	 * Issue flush requests in as few requests as possible; have to split at
+	 * segment boundaries though, since those are actually separate files.
+	 */
+	while (nblocks != 0)
+	{
+		int nflush = nblocks;
+		int segnum_start, segnum_end;
+
+		v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_RETURN_NULL);
+
+		/*
+		 * We might be flushing buffers of already removed relations, that's
+		 * ok, just ignore that case.
+		 */
+		if (!v)
+			return;
+
+		/* compute offset inside the current segment */
+		segnum_start = blocknum / RELSEG_SIZE;
+
+		/* compute number of desired writes within the current segment */
+		segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
+		if (segnum_start != segnum_end)
+			nflush = RELSEG_SIZE - (blocknum  % ((BlockNumber) RELSEG_SIZE) );
+
+		Assert(nflush >= 1);
+		Assert(nflush <= nblocks);
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+		FileWriteback(v->mdfd_vfd, seekpos, BLCKSZ * nflush);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
+}
 
 /*
  *	mdread() -- Read the specified block from a relation.
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 87ff358..2cae5aa 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,8 @@ typedef struct f_smgr
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
+	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber nblocks);
@@ -66,8 +68,8 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
-		mdpreckpt, mdsync, mdpostckpt
+		mdprefetch, mdread, mdwrite, mdwriteback, mdnblocks, mdtruncate,
+		mdimmedsync, mdpreckpt, mdsync, mdpostckpt
 	}
 };
 
@@ -649,6 +651,19 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 											  buffer, skipFsync);
 }
 
+
+/*
+ *	smgrwriteback() -- Trigger kernel writeback for the supplied range of
+ *					   blocks.
+ */
+void
+smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  int nblocks)
+{
+	(*(smgrsw[reln->smgr_which].smgr_writeback)) (reln, forknum, blocknum,
+												  nblocks);
+}
+
 /*
  *	smgrnblocks() -- Calculate the number of blocks in the
  *					 supplied relation.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea5a09a..789efbc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2385,6 +2385,42 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"checkpoint_flush_after", PGC_SIGHUP, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&checkpoint_flush_after,
+		/* see bufmgr.h: OS dependant default */
+		DEFAULT_CHECKPOINT_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"backend_flush_after", PGC_USERSET, WAL_CHECKPOINTS,
+			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&backend_flush_after,
+		/* see bufmgr.h: OS dependant default */
+		DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"bgwriter_flush_after", PGC_SIGHUP, WAL_CHECKPOINTS,
+			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&bgwriter_flush_after,
+		/* see bufmgr.h: 16 on Linux, 0 otherwise */
+		DEFAULT_BGWRITER_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_worker_processes",
 			PGC_POSTMASTER,
 			RESOURCES_ASYNCHRONOUS,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index cbc4843..fe8b423 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -16,6 +16,7 @@
 #define BUFMGR_INTERNALS_H
 
 #include "storage/buf.h"
+#include "storage/bufmgr.h"
 #include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
@@ -208,16 +209,44 @@ extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
 #define UnlockBufHdr(bufHdr)	SpinLockRelease(&(bufHdr)->buf_hdr_lock)
 
 
+/*
+ * The PendingWriteback & WritebackContext structure are used to keep
+ * information about pending flush requests to be issued to the OS.
+ */
+typedef struct PendingWriteback
+{
+	/* could store different types of pending flushes here */
+	BufferTag tag;
+}	PendingWriteback;
+
+/* typedef forward declared in bufmgr.h */
+typedef struct WritebackContext
+{
+	/* max number of writeback requests to coalesce */
+	int		   *max_pending;
+
+	/* current number of pending writeback requests */
+	int			nr_pending;
+
+	/* pending requests */
+	PendingWriteback pending_writebacks[WRITEBACK_MAX_PENDING_FLUSHES];
+}	WritebackContext;
+
 /* in buf_init.c */
 extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
+extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
 
 /*
- * Internal routines: only called by bufmgr
+ * Internal buffer management routines
  */
+/* bufmgr.c */
+extern void WritebackContextInit(WritebackContext *context, int *max_coalesce);
+extern void IssuePendingWritebacks(WritebackContext *context);
+extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 92c4bc5..a4b1b37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -45,16 +45,36 @@ typedef enum
 								 * replay; otherwise same as RBM_NORMAL */
 } ReadBufferMode;
 
+/* forward declared, to avoid having to expose buf_internals.h here */
+struct WritebackContext;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
 /* in bufmgr.c */
+#define WRITEBACK_MAX_PENDING_FLUSHES 128
+
+/* FIXME: Also default to on for mmap && msync(MS_ASYNC)? */
+#ifdef HAVE_SYNC_FILE_RANGE
+#define DEFAULT_CHECKPOINT_FLUSH_AFTER 32
+#define DEFAULT_BACKEND_FLUSH_AFTER 16
+#define DEFAULT_BGWRITER_FLUSH_AFTER 64
+#else
+#define DEFAULT_CHECKPOINT_FLUSH_AFTER 0
+#define DEFAULT_BACKEND_FLUSH_AFTER 0
+#define DEFAULT_BGWRITER_FLUSH_AFTER 0
+#endif   /* HAVE_SYNC_FILE_RANGE */
+
 extern bool zero_damaged_pages;
 extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	target_prefetch_pages;
 
+extern int checkpoint_flush_after;
+extern int backend_flush_after;
+extern int bgwriter_flush_after;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
@@ -209,7 +229,7 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
-extern bool BgBufferSync(void);
+extern bool BgBufferSync(struct WritebackContext *wb_context);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..0f67760 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -75,6 +75,7 @@ extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
 extern char *FilePathName(File file);
+extern void FileWriteback(File file, off_t offset, int amount);
 
 /* Operations that allow use of regular stdio --- USE WITH CAUTION */
 extern FILE *AllocateFile(const char *name, const char *mode);
@@ -112,7 +113,7 @@ extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
-extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+extern void pg_flush_data(int fd, off_t offset, off_t amount);
 extern void fsync_fname(char *fname, bool isdir);
 extern void SyncDataDirectory(void);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a7267ea..0483fa3 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,6 +96,8 @@ extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 		  BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
+		  BlockNumber blocknum, int nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
@@ -122,6 +124,8 @@ extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
 		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, int nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d96896b..f501f55 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1411,6 +1411,7 @@ Pattern_Type
 PendingOperationEntry
 PendingRelDelete
 PendingUnlinkEntry
+PendingWriteback
 PerlInterpreter
 Perl_ppaddr_t
 Permutation
@@ -2142,6 +2143,7 @@ WriteBytePtr
 WriteDataPtr
 WriteExtraTocPtr
 WriteFunc
+WritebackContext
 X509
 X509_NAME
 X509_NAME_ENTRY
-- 
2.7.0.229.g701fa7f

0002-Checkpoint-sorting-and-balancing.patchtext/x-patch; charset=us-asciiDownload

From 73e9eb9fa487aef370c0ffac710e71d0ee431b8d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 19 Feb 2016 12:17:51 -0800
Subject: [PATCH 2/4] Checkpoint sorting and balancing.

Up to now checkpoints were written in the order they're in the
BufferDescriptors. That's nearly random in a lot of cases, which
performs badly on rotating media, but even on SSDs it causes slowdowns.

To avoid that, sort checkpoints before writing them out. We currently
sort by tablespace, relfilenode, fork and block number.

Previously that wasn't done out of fear of imbalance between
tablespaces, so additionally balance writes between tablespaces.

Another concern was that the relatively large allocation to sort the
buffers in might fail, preventing checkpoints from happening. Thus
pre-allocate the required memory in shared memory, at server startup.

This particularly makes it more efficient to have checkpoint flushing
enabled, because that'll often result in a lot of writes that can be
coalesced into one flush.

TODO:
* remove debugging output

Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
---
 src/backend/storage/buffer/README     |   5 -
 src/backend/storage/buffer/buf_init.c |  22 ++-
 src/backend/storage/buffer/bufmgr.c   | 289 +++++++++++++++++++++++++++++-----
 src/include/storage/buf_internals.h   |  18 +++
 src/tools/pgindent/typedefs.list      |   2 +
 5 files changed, 291 insertions(+), 45 deletions(-)

diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index c4a7668..dc12c8c 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -267,11 +267,6 @@ only needs to take the lock long enough to read the variable value, not
 while scanning the buffers.  (This is a very substantial improvement in
 the contention cost of the writer compared to PG 8.0.)
 
-During a checkpoint, the writer's strategy must be to write every dirty
-buffer (pinned or not!).  We may as well make it start this scan from
-nextVictimBuffer, however, so that the first-to-be-written pages are the
-ones that backends might otherwise have to write for themselves soon.
-
 The background writer takes shared content lock on a buffer while writing it
 out (and anyone else who flushes buffer contents to disk must do so too).
 This ensures that the page image transferred to disk is reasonably consistent.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e10071d..bfa37f1 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -24,6 +24,7 @@ LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
 LWLockTranche BufferIOLWLockTranche;
 LWLockTranche BufferContentLWLockTranche;
 WritebackContext BackendWritebackContext;
+CkptSortItem *CkptBufferIds;
 
 
 /*
@@ -70,7 +71,8 @@ InitBufferPool(void)
 {
 	bool		foundBufs,
 				foundDescs,
-				foundIOLocks;
+				foundIOLocks,
+				foundBufCkpt;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *)
@@ -104,10 +106,21 @@ InitBufferPool(void)
 	LWLockRegisterTranche(LWTRANCHE_BUFFER_CONTENT,
 						  &BufferContentLWLockTranche);
 
-	if (foundDescs || foundBufs || foundIOLocks)
+	/*
+	 * The array used to sort to-be-checkpointed buffer ids is located in
+	 * shared memory, to avoid having to allocate significant amounts of
+	 * memory at runtime. As that'd be in the middle of a checkpoint, or when
+	 * the checkpointer is restarted, memory allocation failures would be
+	 * painful.
+	 */
+	CkptBufferIds = (CkptSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
+
+	if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt)
 	{
 		/* should find all of these, or none of them */
-		Assert(foundDescs && foundBufs && foundIOLocks);
+		Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -190,5 +203,8 @@ BufferShmemSize(void)
 	/* to allow aligning the above */
 	size = add_size(size, PG_CACHE_LINE_SIZE);
 
+	/* size of checkpoint sort array in bufmgr.c */
+	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+
 	return size;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cdbda0f..7a13997 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
+#include "lib/binaryheap.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -75,6 +76,34 @@ typedef struct PrivateRefCountEntry
 /* 64 bytes, about the size of a cache line on common systems */
 #define REFCOUNT_ARRAY_ENTRIES 8
 
+/*
+ * Status of buffers to checkpoint for a particular tablespace, used
+ * internally in BufferSync.
+ */
+typedef struct CkptTsStatus
+{
+	/* oid of the tablespace */
+	Oid			tsId;
+
+	/*
+	 * Checkpoint progress for this tablespace. To make progress comparable
+	 * between tablespaces the progress is, for each tablespace, measured as a
+	 * number between 0 and the total number of to-be-checkpointed pages. Each
+	 * page checkpointed in this tablespace increments this space's progress
+	 * by progress_slice.
+	 */
+	float8		progress;
+	float8		progress_slice;
+
+	/* number of to-be checkpointed pages in this tablespace */
+	int			num_to_scan;
+	/* already processed pages in this tablespace */
+	int			num_scanned;
+
+	/* current offset in CkptBufferIds for this tablespace */
+	int			index;
+}	CkptTsStatus;
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -425,6 +454,8 @@ static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
 static int	buffertag_comparator(const void *p1, const void *p2);
+static int	ckpt_buforder_comparator(const void *pa, const void *pb);
+static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
 
 /*
@@ -1657,8 +1688,13 @@ BufferSync(int flags)
 {
 	int			buf_id;
 	int			num_to_scan;
-	int			num_to_write;
+	int			num_spaces;
+	int			num_processed;
 	int			num_written;
+	CkptTsStatus *per_ts_stat = NULL;
+	Oid			last_tsid;
+	binaryheap *ts_heap;
+	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
 
@@ -1676,7 +1712,7 @@ BufferSync(int flags)
 
 	/*
 	 * Loop over all buffers, and mark the ones that need to be written with
-	 * BM_CHECKPOINT_NEEDED.  Count them as we go (num_to_write), so that we
+	 * BM_CHECKPOINT_NEEDED.  Count them as we go (num_to_scan), so that we
 	 * can estimate how much work needs to be done.
 	 *
 	 * This allows us to write only those pages that were dirty when the
@@ -1690,7 +1726,7 @@ BufferSync(int flags)
 	 * BM_CHECKPOINT_NEEDED still set.  This is OK since any such buffer would
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
-	num_to_write = 0;
+	num_to_scan = 0;
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1703,35 +1739,140 @@ BufferSync(int flags)
 
 		if ((bufHdr->flags & mask) == mask)
 		{
+			CkptSortItem *item;
+
 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
-			num_to_write++;
+
+			item = &CkptBufferIds[num_to_scan++];
+			item->buf_id = buf_id;
+			item->tsId = bufHdr->tag.rnode.spcNode;
+			item->relNode = bufHdr->tag.rnode.relNode;
+			item->forkNum = bufHdr->tag.forkNum;
+			item->blockNum = bufHdr->tag.blockNum;
 		}
 
 		UnlockBufHdr(bufHdr);
 	}
 
-	if (num_to_write == 0)
+	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
-
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
 
-	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
+	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
 	/*
-	 * Loop over all buffers again, and write the ones (still) marked with
-	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
-	 * since we might as well dump soon-to-be-recycled buffers first.
-	 *
-	 * Note that we don't read the buffer alloc count here --- that should be
-	 * left untouched till the next BgBufferSync() call.
+	 * Sort buffers that need to be written to reduce the likelihood of random
+	 * IO. The sorting is also important for the implementation of balancing
+	 * writes between tablespaces. Without balancing writes we'd potentially
+	 * end up writing to the tablespaces one-by-one; possibly overloading the
+	 * underlying system.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	qsort(CkptBufferIds, num_to_scan, sizeof(CkptSortItem),
+		  ckpt_buforder_comparator);
+
+	num_spaces = 0;
+
+	/*
+	 * Allocate progress status for each tablespace with buffers that need to
+	 * be flushed. This requires the to-be-flushed array to be sorted.
+	 */
+	last_tsid = InvalidOid;
+	for (i = 0; i < num_to_scan; i++)
+	{
+		CkptTsStatus *s;
+		Oid			cur_tsid;
+
+		cur_tsid = CkptBufferIds[i].tsId;
+
+		/*
+		 * Grow array of per-tablespace status structs, everytime a new
+		 * tablespace is found.
+		 */
+		if (last_tsid == InvalidOid || last_tsid != cur_tsid)
+		{
+			Size		sz;
+
+			num_spaces++;
+
+			/*
+			 * Not worth adding grow-by-power-of-2 logic here - even with a
+			 * few hundred tablespaces this will be fine.
+			 */
+			sz = sizeof(CkptTsStatus) * num_spaces;
+
+			if (per_ts_stat == NULL)
+				per_ts_stat = (CkptTsStatus *) palloc(sz);
+			else
+				per_ts_stat = (CkptTsStatus *) repalloc(per_ts_stat, sz);
+
+			s = &per_ts_stat[num_spaces - 1];
+			memset(s, 0, sizeof(*s));
+			s->tsId = cur_tsid;
+
+			/*
+			 * The first buffer in this tablespace. As CkptBufferIds is sorted
+			 * by tablespace all (s->num_to_scan) buffers in this tablespace
+			 * will follow afterwards.
+			 */
+			s->index = i;
+
+			/*
+			 * progress_slice will be determined once we know how many buffers
+			 * are in each tablespace, i.e. after this loop.
+			 */
+
+			last_tsid = cur_tsid;
+		}
+		else
+		{
+			s = &per_ts_stat[num_spaces - 1];
+		}
+
+		s->num_to_scan++;
+	}
+
+	Assert(num_spaces > 0);
+
+	/*
+	 * Build a min-heap over the write-progress in the individual tablespaces,
+	 * and compute how large a portion of the total progress a single
+	 * processed buffer is.
+	 */
+	ts_heap = binaryheap_allocate(num_spaces,
+								  ts_ckpt_progress_comparator,
+								  NULL);
+
+	for (i = 0; i < num_spaces; i++)
+	{
+		CkptTsStatus *ts_stat = &per_ts_stat[i];
+
+		ts_stat->progress_slice = (float8) num_to_scan / ts_stat->num_to_scan;
+
+		binaryheap_add_unordered(ts_heap, PointerGetDatum(ts_stat));
+	}
+
+	binaryheap_build(ts_heap);
+
+	/*
+	 * Iterate through to-be-checkpointed buffers and write the ones (still)
+	 * marked with BM_CHECKPOINT_NEEDED. The writes are balanced between
+	 * tablespaces.
+	 */
+	num_processed = 0;
 	num_written = 0;
-	while (num_to_scan-- > 0)
+	while (!binaryheap_empty(ts_heap))
 	{
-		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		BufferDesc *bufHdr = NULL;
+		CkptTsStatus *ts_stat = (CkptTsStatus *)
+		DatumGetPointer(binaryheap_first(ts_heap));
+
+		buf_id = CkptBufferIds[ts_stat->index].buf_id;
+		Assert(buf_id != -1);
+
+		bufHdr = GetBufferDescriptor(buf_id);
+
+		num_processed++;
 
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
@@ -1752,31 +1893,52 @@ BufferSync(int flags)
 				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 				BgWriterStats.m_buf_written_checkpoints++;
 				num_written++;
+			}
+		}
 
-				/*
-				 * We know there are at most num_to_write buffers with
-				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-				 * num_written reaches num_to_write.
-				 *
-				 * Note that num_written doesn't include buffers written by
-				 * other backends, or by the bgwriter cleaning scan. That
-				 * means that the estimate of how much progress we've made is
-				 * conservative, and also that this test will often fail to
-				 * trigger.  But it seems worth making anyway.
-				 */
-				if (num_written >= num_to_write)
-					break;
+		/*
+		 * Measure progress independent of actualy having to flush the buffer
+		 * - otherwise writing become unbalanced.
+		 */
+		ts_stat->progress += ts_stat->progress_slice;
+		ts_stat->num_scanned++;
+		ts_stat->index++;
 
-				/*
-				 * Sleep to throttle our I/O rate.
-				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
-			}
+		/* Have all the buffers from the tablespace been processed? */
+		if (ts_stat->num_scanned == ts_stat->num_to_scan)
+		{
+			binaryheap_remove_first(ts_heap);
+		}
+		else
+		{
+			/* update heap with the new progress */
+			binaryheap_replace_first(ts_heap, PointerGetDatum(ts_stat));
 		}
 
-		if (++buf_id >= NBuffers)
-			buf_id = 0;
+		/*
+		 * Sleep to throttle our I/O rate.
+		 */
+		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+
+/* #define CHECKPOINT_PROGRESS */
+#ifdef CHECKPOINT_PROGRESS
+		/* FIXME: remove before commit */
+		/* delete current content of the line, print progress */
+		fprintf(stderr, "\33[2K\rto_scan: %d, scanned: %d, %%processed: %.2f, %%writeouts: %.2f",
+				num_to_scan, num_processed,
+				(((double) num_processed) / num_to_scan) * 100,
+				((double) num_written / num_processed) * 100);
+#endif
 	}
+#ifdef CHECKPOINT_PROGRESS
+	fprintf(stderr, "\n");
+#endif
+
+	/* issue all pending flushes */
+	IssuePendingWritebacks(&wb_context);
+
+	pfree(per_ts_stat);
+	per_ts_stat = NULL;
 
 	/*
 	 * Update checkpoint statistics. As noted above, this doesn't include
@@ -3754,6 +3916,59 @@ rnode_comparator(const void *p1, const void *p2)
 		return 0;
 }
 
+/*
+ * Comparator determining the writeout order in a checkpoint.
+ *
+ * It is important that tablespaces are compared first, the logic balancing
+ * writes between tablespaces relies on it.
+ */
+static int
+ckpt_buforder_comparator(const void *pa, const void *pb)
+{
+	const CkptSortItem *a = (CkptSortItem *) pa;
+	const CkptSortItem *b = (CkptSortItem *) pb;
+
+	/* compare tablespace */
+	if (a->tsId < b->tsId)
+		return -1;
+	else if (a->tsId > b->tsId)
+		return 1;
+	/* compare relation */
+	if (a->relNode < b->relNode)
+		return -1;
+	else if (a->relNode > b->relNode)
+		return 1;
+	/* compare fork */
+	else if (a->forkNum < b->forkNum)
+		return -1;
+	else if (a->forkNum > b->forkNum)
+		return 1;
+	/* compare block number */
+	else if (a->blockNum < b->blockNum)
+		return -1;
+	else	/* should not be the same block anyway... */
+		return 1;
+}
+
+/*
+ * Comparator for a Min-Heap over the, per-tablespace, checkpoint completion
+ * progress.
+ */
+static int
+ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
+{
+	CkptTsStatus *sa = (CkptTsStatus *) a;
+	CkptTsStatus *sb = (CkptTsStatus *) b;
+
+	/* we want a min-heap, so return 1 for the a < b */
+	if (sa->progress < sb->progress)
+		return 1;
+	else if (sa->progress == sb->progress)
+		return 0;
+	else
+		return -1;
+}
+
 
 /*
  * BufferTag comparator.
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index fe8b423..de84bc4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -239,6 +239,24 @@ extern PGDLLIMPORT WritebackContext BackendWritebackContext;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
+/* in bufmgr.c */
+
+/*
+ * Structure to sort buffers per file on checkpoints.
+ *
+ * This structure is allocated per buffer in shared memory, so it should be
+ * kept as small as possible.
+ */
+typedef struct CkptSortItem
+{
+	Oid			tsId;
+	Oid			relNode;
+	ForkNumber	forkNum;
+	BlockNumber blockNum;
+	int			buf_id;
+}	CkptSortItem;
+
+extern CkptSortItem *CkptBufferIds;
 
 /*
  * Internal buffer management routines
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f501f55..b850db0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -283,6 +283,8 @@ CheckpointerRequest
 CheckpointerShmemStruct
 Chromosome
 City
+CkptSortItem
+CkptTsStatus
 ClientAuthentication_hook_type
 ClientData
 ClonePtr
-- 
2.7.0.229.g701fa7f

#198

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#197)

Re: checkpointer continuous flushing - V18

Hello Andres,

Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.

I will look into these two in depth.

Note that I would have ordered them in reverse because sorting is nearly
always very beneficial, and "writeback" (formely called flushing) is then
nearly always very beneficial on sorted buffers.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#199

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#198)

Re: checkpointer continuous flushing - V18

On 2016-02-19 22:46:44 +0100, Fabien COELHO wrote:

Hello Andres,

Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.

I will look into these two in depth.

Note that I would have ordered them in reverse because sorting is nearly
always very beneficial, and "writeback" (formely called flushing) is then
nearly always very beneficial on sorted buffers.

I had it that way earlier. I actually saw pretty large regressions from
sorting alone in some cases as well, apparently because the kernel
submits much larger IOs to disk; although that probably only shows on
SSDs. This way the modifications imo look a trifle better ;). I'm
intending to commit both at the same time, keep them separate only
because they're easier to ynderstand separately.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#200

Michael Paquier

michael.paquier@gmail.com

almost 10 years ago

In reply to: Fabien COELHO (#196)

Re: checkpointer continuous flushing - V16

On Sat, Feb 20, 2016 at 5:08 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO
somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge.

https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu

Interesting! To summarize it, 25% performance degradation from best kernel
(2.6.32) to worst (3.2.0), that is indeed significant.

As far as I recall, the OS cache eviction is very aggressive in 3.2,
so it would be possible that data from the FS cache that was just read
could be evicted even if it was not used yet. Thie represents a large
difference when the database does not fit in RAM.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#201

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#197)

Re: checkpointer continuous flushing - V18

Hello Andres,

For 0001 I've recently changed:
* Don't schedule writeback after smgrextend() - that defeats linux
delayed allocation mechanism, increasing fragmentation noticeably.
* Add docs for the new GUC variables
* comment polishing
* BackendWritebackContext now isn't dynamically allocated anymore

I think this patch primarily needs:
* review of the docs, not sure if they're easy enough to
understand. Some language polishing might also be needed.

Yep, see below.

* review of the writeback API, combined with the smgr/md.c changes.

See various comments below.

* Currently *_flush_after can be set to a nonzero value, even if there's
no support for flushing on that platform. Imo that's ok, but perhaps
other people's opinion differ.

In some previous version I think a warning was shown of the feature was
requested but not available.

Here are some quick comments on the patch:

Patch applies cleanly on head. Compiled and checked on Linux. Compilation
issues on other systems, see below.

When pages are written by a process (checkpointer, bgwriter, backend worker),
the list of recently written pages is kept and every so often an advisory
fsync (sync_file_range, other options for other systems) is issued so that
the data is sent to the io system without relying on more or less
(un)controllable os policy.

The documentation seems to use "flush" but the code talks about "writeback"
or "flush", depending. I think one vocabulary, whichever it is, should be
chosen and everything should stick to it, otherwise everything look kind of
fuzzy and raises doubt for the reader (is it the same thing? is it something
else?). I initially used "flush", but it seems a bad idea because it has
nothing to do with the flush function, so I'm fine with writeback or anything
else, I just think that *one* word should be chosen and used everywhere.

The sgml documentation about "*_flush_after" configuration parameter talks
about bytes, but the actual unit should be buffers. I think that keeping
a number of buffers should be fine, because that is what the internal stuff
will manage, not bytes. Also, the maximum value (128 ?) should appear in
the text. In the discussion in the wal section, I'm not sure about the effect
of setting writebacks on SSD, but I think that you have made some tests so
maybe you have an answer and the corresponding section could be written with
some more definitive text than "probably brings no benefit on SSD".

A good point of the whole approach is that it is available to all kind
of pg processes. However it does not address the point that bgwriter and
backends basically issue random writes, so I would not expect much positive
effect before these writes are somehow sorted, which means doing some
compromise in the LRU/LFU logic... well, all this is best kept for later,
and I'm fine to have the logic flushing logic there. I'm wondering why you
choose 16 & 64 as default for backends & bgwriter, though.

IssuePendingWritebacks: you merge only strictly neightboring writes.
Maybe the merging strategy could be more aggressive than just strict
neighbors?

mdwriteback: all variables could be declared within the while, I do not
understand why some are in and some are out. ISTM that putting writeback
management at the relation level does not help a lot, because you have to
translate again from relation to files. The good news is that it should work
as well, and that it does avoid the issue that the file may have been closed
in between, so why not.

The PendingWriteback struct looks useless. I think it should be removed,
and maybe put back if one day if it is needed, which I rather doubt it.

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.

ScheduleBufferTagForWriteback: the "pending" variable is not very useful.
Maybe consider shortening the "pending_writebacks" field name to "writebacks"?

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.
AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?

Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always
true, I think, it is already checked in the initialization and when setting
gucs.

SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
I guess it is okay because it is still pinned.

pg_flush_data: in the first #elif, "context" is undeclared line 446.
Label "out" is not defined line 455. In the second #elif, "context" is
undeclared line 490 and label "out" line 500 is not defined either.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#202

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#201)

Re: checkpointer continuous flushing - V18

Hi,

On 2016-02-20 20:56:31 +0100, Fabien COELHO wrote:

* Currently *_flush_after can be set to a nonzero value, even if there's
no support for flushing on that platform. Imo that's ok, but perhaps
other people's opinion differ.

In some previous version I think a warning was shown of the feature was
requested but not available.

I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.

Here are some quick comments on the patch:

Patch applies cleanly on head. Compiled and checked on Linux. Compilation
issues on other systems, see below.

For those I've already pushed a small fixup commit to git... Stupid
mistake.

The documentation seems to use "flush" but the code talks about "writeback"
or "flush", depending. I think one vocabulary, whichever it is, should be
chosen and everything should stick to it, otherwise everything look kind of
fuzzy and raises doubt for the reader (is it the same thing? is it something
else?). I initially used "flush", but it seems a bad idea because it has
nothing to do with the flush function, so I'm fine with writeback or anything
else, I just think that *one* word should be chosen and used everywhere.

Hm.

The sgml documentation about "*_flush_after" configuration parameter talks
about bytes, but the actual unit should be buffers.

The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.

Also, the maximum value (128 ?) should appear in the text. \

Right.

In the discussion in the wal section, I'm not sure about the effect of
setting writebacks on SSD, but I think that you have made some tests
so maybe you have an answer and the corresponding section could be
written with some more definitive text than "probably brings no
benefit on SSD".

Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.

A good point of the whole approach is that it is available to all kind
of pg processes.

Exactly.

However it does not address the point that bgwriter and
backends basically issue random writes, so I would not expect much positive
effect before these writes are somehow sorted, which means doing some
compromise in the LRU/LFU logic...

The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either... I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.

well, all this is best kept for later, and I'm fine to have the logic
flushing logic there. I'm wondering why you choose 16 & 64 as default
for backends & bgwriter, though.

I chose a small value for backends because there often are a large
number of backends, and thus the amount of dirty data of each adds up. I
used a larger value for bgwriter because I saw that ending up using
bigger IOs.

IssuePendingWritebacks: you merge only strictly neightboring writes.
Maybe the merging strategy could be more aggressive than just strict
neighbors?

I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls. And if the writes aren't actually neighbouring
there's not much gained from issuing them in one sync_file_range call.

mdwriteback: all variables could be declared within the while, I do not
understand why some are in and some are out.

Right.

ISTM that putting writeback management at the relation level does not
help a lot, because you have to translate again from relation to
files.

Sure, but what's the problem with that? That's how normal read/write IO
works as well?

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.

It has, it's just in WritebackContextInit(). Can duplicateit.

ScheduleBufferTagForWriteback: the "pending" variable is not very
useful.

Shortens line length a good bit, at no cost.

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.

Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.

AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other. For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.

SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
I guess it is okay because it is still pinned.

Don't do things while holding a lock that don't require said lock. A pin
prevents a buffer changing its identity.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?

Several reasons: Most importantly there's absolutely no guarantee that
you'll ever end up sleeping, it's quite common to happen only
seldomly. If you're bottlenecked on IO, you can end up being behind all
the time. But even then you don't want to cause massive latency spikes
due to gigabytes of dirty data - a slower checkpoint is a much better
choice. It'd make the writeback infrastructure less generic. I also
don't really believe it helps that much, although that's a complex
argument to make.

Thanks for the review!

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#203

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#202)

Re: checkpointer continuous flushing - V18

On Sun, Feb 21, 2016 at 3:37 AM, Andres Freund <andres@anarazel.de> wrote:

The documentation seems to use "flush" but the code talks about "writeback"
or "flush", depending. I think one vocabulary, whichever it is, should be
chosen and everything should stick to it, otherwise everything look kind of
fuzzy and raises doubt for the reader (is it the same thing? is it something
else?). I initially used "flush", but it seems a bad idea because it has
nothing to do with the flush function, so I'm fine with writeback or anything
else, I just think that *one* word should be chosen and used everywhere.

Hm.

I think there might be a semantic distinction between these two terms.
Doesn't writeback mean writing pages to disk, and flushing mean making
sure that they are durably on disk? So for example when the Linux
kernel thinks there is too much dirty data, it initiates writeback,
not a flush; on the other hand, at transaction commit, we initiate a
flush, not writeback.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#204

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#202)

Re: checkpointer continuous flushing - V18

Hallo Andres,

In some previous version I think a warning was shown if the feature was
requested but not available.

I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.

I like "ignoring with a warning" in the log file, because when things do
not behave as expected that is where I'll be looking. I do not think that
it should error out.

The sgml documentation about "*_flush_after" configuration parameter
talks about bytes, but the actual unit should be buffers.

The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.

So I understand that it works with kb as well. Now I do not think that it
would need a lot if explanations if you say that it is a number of pages,
and I think that a number of pages is significant because it is a number
of IO requests to be coalesced, eventually.

In the discussion in the wal section, I'm not sure about the effect of
setting writebacks on SSD, [...]

Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.

Ok, fine with me. Does that mean that flushing as a significant positive
impact on SSD in your tests?

However it does not address the point that bgwriter and backends
basically issue random writes, [...]

The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either... I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.

So I understand that there is a positive impact under some load. Good!

Maybe the merging strategy could be more aggressive than just strict
neighbors?

I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls.

Ok. Maybe the neightbor definition could be relaxed just a little bit so
that small holes are overtake, but not large holes? If there is only a few
pages in between, even if written by another process, then writing them
together should be better? Well, this can wait for a clear case, because
hopefully the OS will recoalesce them behind anyway.

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.

It has, it's just in WritebackContextInit(). Can duplicateit.

I missed it, I expected something in the struct definition. Do not
duplicate, but cross reference it?

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.

Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.

Yep, but not much benefit to expect from a few dozens random pages either.

[...] I do think that this whole writeback logic really does make sense
*per table space*,

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other. For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.

I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per
device as well (otherwise what the point?), so you should want to coalesce
and "writeback" pages per device as wel. Calling sync_file_range on
distinct devices should probably be issued more or less randomly, and
should not interfere one with the other.

If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes
per device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them,
which would not be that bad, and I think that it is what is needed.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?

Several reasons: Most importantly there's absolutely no guarantee that
you'll ever end up sleeping, it's quite common to happen only seldomly.

Well, that would be under a situation when pg is completely unresponsive.
More so, this behavior *makes* pg unresponsive.

If you're bottlenecked on IO, you can end up being behind all the time.

Hopefully sorting & flushing should improve this situation a lot.

But even then you don't want to cause massive latency spikes
due to gigabytes of dirty data - a slower checkpoint is a much better
choice. It'd make the writeback infrastructure less generic.

Sure. It would be sufficient to have a call to ask for writebacks
independently of the number of writebacks accumulated in the queue, it
does not need to change the infrastructure.

Also, I think that such a call would make sense at the end of the
checkpoint.

I also don't really believe it helps that much, although that's a
complex argument to make.

Yep. My thinking is that doing things in the sleeping interval does not
interfere with the checkpointer scheduling, so it is less likely to go
wrong and falling behind.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#205

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Fabien COELHO (#204)

Re: checkpointer continuous flushing - V18

Hallo Andres,

[...] I do think that this whole writeback logic really does make sense
*per table space*,

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other. For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.

I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device
as well (otherwise what the point?), so you should want to coalesce and
"writeback" pages per device as wel. Calling sync_file_range on distinct
devices should probably be issued more or less randomly, and should not
interfere one with the other.

If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes per
device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them, which
would not be that bad, and I think that it is what is needed.

Note: I think that an easy way to do that in the "checkpoint sort" patch
is simply to keep a WritebackContext in CkptTsStatus structure which is
per table space in the checkpointer.

For bgwriter & backends it can wait, there is few "writeback" coalescing
because IO should be pretty random, so it does not matter much.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#206

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#197)

Re: checkpointer continuous flushing - V18

Hallo Andres,

Here is a review for the second patch.

For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
it doesn't take a lot of time.

I put it there initialy to demonstrate that there was no cache performance
issue when sorting on just buffer indexes. As it is always small, I agree
that it is not needed. Well, it could be still be in seconds on a very
large shared buffers setting with a very large checkpoint, but then the
checkpoint would be tremendously huge...

* Minor comment polishing.

Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little
on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4
bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also,
there are very few tablespaces, they could be given a small number and
this number could be used instead of the Oid, so the space requirement
could be reduced to say 16 bytes per buffer by combining space & fork in 2
shorts and keeping 4 bytes alignement and also getting 8 byte
alignement... If this is too much, I have shown that it can work with only
4 bytes per buffer, as the sorting is really just a performance
optimisation and is not broken if some stuff changes between sorting &
writeback, but you did not like the idea. If the amount of shared memory
required is a significant concern, it could be resurrected, though.

* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8
precision is okay for the purpose. I think it is quite certainly fine in
the worst case with 32 bits buffer_ids, but it would not be if this size
is changed someday.

* BufferSync

After a first sweep to collect buffers to write, they are sorted, and then
there those buffers are swept again to compute some per tablespace data
and organise a heap.

ISTM that nearly all of the collected data on the second sweep could be
collected on the first sweep, so that this second sweep could be avoided
altogether. The only missing data is the index of the first buffer in the
array, which can be computed by considering tablespaces only, sweeping
over buffers is not needed. That would suggest creating the heap or using
a hash in the initial buffer sweep to keep this information. This would
also provide a point where to number tablespaces for compressing the
CkptSortItem struct.

I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense. This remark is independent of
this patch. Probably it works fine because after a sleep the checkpointer
is behind enough so that it will write a bunch of buffers before sleeping
again.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.

I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
inclined to leave it off till then.

I do not have that. As "msync" seems available on Linux, it is possible to
force using it with a "ifdef 0" to skip sync_file_range and check whether
it does some good there. Idem for the "posix_fadvise" stuff. I can try to
do that, but it takes time to do so, if someone can test on other OS it
would be much better. I think that if it works it should be kept in, so it
is just a matter of testing it.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#207

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#204)

Re: checkpointer continuous flushing - V18

On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote:

In the discussion in the wal section, I'm not sure about the effect of
setting writebacks on SSD, [...]

Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.

Ok, fine with me. Does that mean that flushing as a significant positive
impact on SSD in your tests?

Yes. The reason we need flushing is that the kernel amasses dirty pages,
and then flushes them at once. That hurts for both SSDs and rotational
media. Sorting is the the bigger question, but I've seen it have clearly
beneficial performance impacts. I guess if you look at devices with a
internal block size bigger than 8k, you'd even see larger differences.

Maybe the merging strategy could be more aggressive than just strict
neighbors?

I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls.

Ok. Maybe the neightbor definition could be relaxed just a little bit so
that small holes are overtake, but not large holes? If there is only a few
pages in between, even if written by another process, then writing them
together should be better? Well, this can wait for a clear case, because
hopefully the OS will recoalesce them behind anyway.

I'm against doing so without clear measurements of a benefit.

Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.

Yep, but not much benefit to expect from a few dozens random pages either.

Actually, there's kinda frequently a benefit observable. Even if few
requests can be merged, doing IO requests in an order more likely doable
within a few rotations is beneficial. Also, the cost is marginal, so why
worry?

[...] I do think that this whole writeback logic really does make
sense *per table space*,

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other. For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.

I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device
as well (otherwise what the point?), so you should want to coalesce and
"writeback" pages per device as wel. Calling sync_file_range on distinct
devices should probably be issued more or less randomly, and should not
interfere one with the other.

The kernel's dirty buffer accounting is global, not per block device.
It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.

If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes per
device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them, which
would not be that bad, and I think that it is what is needed.

That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?

Several reasons: Most importantly there's absolutely no guarantee that
you'll ever end up sleeping, it's quite common to happen only seldomly.

Well, that would be under a situation when pg is completely unresponsive.
More so, this behavior *makes* pg unresponsive.

No. The checkpointer being bottlenecked on actual IO performance doesn't
impact production that badly. It'll just sometimes block in
sync_file_range(), but the IO queues will have enough space to
frequently give way to other backends, particularly to synchronous reads
(most pg reads) and synchronous writes (fdatasync()). So a single
checkpoint will take a bit longer, but otherwise the system will mostly
keep up the work in a regular manner. Without the sync_file_range()
calls the kernel will amass dirty buffers until global dirty limits are
reached, which then will bring the whole system to a standstill.

It's pretty common that checkpoint_timeout is too short to be able to
write all shared_buffers out, in that case it's much better to slow down
the whole checkpoint, instead of being incredibly slow at the end.

I also don't really believe it helps that much, although that's a complex
argument to make.

Yep. My thinking is that doing things in the sleeping interval does not
interfere with the checkpointer scheduling, so it is less likely to go wrong
and falling behind.

I don't really see why that's the case. Triggering writeback every N
writes doesn't really influence the scheduling in a bad way - the
flushing is done *before* computing the sleep time. Triggering the
writeback *after* computing the sleep time, and then sleep for that
long, in addition of the time for sync_file_range, skews things more.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#208

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#206)

Re: checkpointer continuous flushing - V18

Hi,

On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote:

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little on
the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes
to hold 4 values, could be one byte or even 2 bits somewhere. Also, there
are very few tablespaces, they could be given a small number and this number
could be used instead of the Oid, so the space requirement could be reduced
to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping
4 bytes alignement and also getting 8 byte alignement... If this is too
much, I have shown that it can work with only 4 bytes per buffer, as the
sorting is really just a performance optimisation and is not broken if some
stuff changes between sorting & writeback, but you did not like the idea. If
the amount of shared memory required is a significant concern, it could be
resurrected, though.

This is less than 0.2 % of memory related to shared buffers. We have the
same amount of memory allocated in CheckpointerShmemSize(), and nobody
has complained so far. And sorry, going back to the previous approach
isn't going to fly, and I've no desire to discuss that *again*.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

They don't cause much space usage, and we access the values
frequently. So why not store them?

If these fields are kept, I think that a comment should justify why float8
precision is okay for the purpose. I think it is quite certainly fine in the
worst case with 32 bits buffer_ids, but it would not be if this size is
changed someday.

That seems pretty much unrelated to having the fields - the question of
accuracy plays a role regardless, no? Given realistic amounts of memory
the max potential "skew" seems fairly small with float8. If we ever
flush one buffer "too much" for a tablespace it's pretty much harmless.

ISTM that nearly all of the collected data on the second sweep could be
collected on the first sweep, so that this second sweep could be avoided
altogether. The only missing data is the index of the first buffer in the
array, which can be computed by considering tablespaces only, sweeping over
buffers is not needed. That would suggest creating the heap or using a hash
in the initial buffer sweep to keep this information. This would also
provide a point where to number tablespaces for compressing the CkptSortItem
struct.

Doesn't seem worth the complexity to me.

I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense.

Why? There's not really much benefit of doing more work than needed. I
think we should sleep far shorter in many cases, but that's indeed a
separate issue.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...

I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
inclined to leave it off till then.

I do not have that. As "msync" seems available on Linux, it is possible to
force using it with a "ifdef 0" to skip sync_file_range and check whether it
does some good there.

Unfortunately it doesn't work well on linux:
* On many OSs msync() on a mmap'ed file triggers writeback. On linux
* it only does so when MS_SYNC is specified, but then it does the
* writeback synchronously. Luckily all common linux systems have
* sync_file_range(). This is preferrable over FADV_DONTNEED because
* it doesn't flush out clean data.

I've verified beforehand, with a simple demo program, that
msync(MS_ASYNC) does something reasonable of freebsd...

Idem for the "posix_fadvise" stuff. I can try to do
that, but it takes time to do so, if someone can test on other OS it would
be much better. I think that if it works it should be kept in, so it is just
a matter of testing it.

I'm not arguing for ripping it out, what I mean is that we don't set a
nondefault value for the GUCs on platforms with just posix_fadivise
available...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#209

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#207)

Re: checkpointer continuous flushing - V18

[...] I do think that this whole writeback logic really does make
sense *per table space*,

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other. For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.

I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device
as well (otherwise what the point?), so you should want to coalesce and
"writeback" pages per device as wel. Calling sync_file_range on distinct
devices should probably be issued more or less randomly, and should not
interfere one with the other.

The kernel's dirty buffer accounting is global, not per block device.

Sure, but this is not my point. My point is that "sync_file_range" moves
buffers to the device io queues, which are per device. If there is one
queue in pg and many queues on many devices, the whole point of coalescing
to get sequential writes is somehow lost.

It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.

Ok, some people would not benefit if the use many tablespaces on one
device, too bad but that does not look like a useful very setting anyway,
and I do not think it would harm much in this case.

If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes per
device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them, which
would not be that bad, and I think that it is what is needed.

That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.

ISTM that you would need a significant number of tablespaces to see the
benefit. If you do not do that, the more table spaces the more random the
IOs, which is disappointing. Also, "the cost is marginal", so I do not see
any good argument not to do it.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#210

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#208)

Re: checkpointer continuous flushing - V18

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

They don't cause much space usage, and we access the values frequently.
So why not store them?

The same question would work the other way around: these values are one
division away, why not compute them when needed? No big deal.

[...] Given realistic amounts of memory the max potential "skew" seems
fairly small with float8. If we ever flush one buffer "too much" for a
tablespace it's pretty much harmless.

I do agree. I'm suggesting that a comment should be added to justify why
float8 accuracy is okay.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...

I'm not sure that a memory context is justified here, there are only two
mallocs and the checkpointer works for very long times. I think that it is
simpler to just get the malloc/free right.

[...] I'm not arguing for ripping it out, what I mean is that we don't
set a nondefault value for the GUCs on platforms with just
posix_fadivise available...

Ok with that.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#211

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#202)

Re: checkpointer continuous flushing - V18

Hallo Andres,

AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other. For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.

I did a quick & small test with random updates on 16 tables with
checkpoint_flush_after=16 checkpoint_timeout=30

(1) with 16 tablespaces (1 per table, but same disk) :
tps = 1100, 27% time under 100 tps

(2) with 1 tablespace :
tps = 1200, 3% time under 100 tps

This result is logical: with one writeback context shared between
tablespaces the sync_file_range is issued on a few buffers per file at a
time on the 16 files, no coalescing occurs there, so this result in random
IOs, while with one table space all writes are aggregated per file.

ISTM that this quick test shows that a writeback context are relevant per
tablespace, as I expected.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#212

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Fabien COELHO (#211)

Re: checkpointer continuous flushing - V18

I did a quick & small test with random updates on 16 tables with
checkpoint_flush_after=16 checkpoint_timeout=30

Another run with more "normal" settings and over 1000 seconds, so less
"quick & small" that the previous one.

checkpoint_flush_after = 16
checkpoint_timeout = 5min # default
shared_buffers = 2GB # 1/8 of available memory

Random updates on 16 tables which total to 1.1GB of data, so this is in
buffer, no significant "read" traffic.

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ï¿œ 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ï¿œ 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#213

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#212)

Re: checkpointer continuous flushing - V18

On 2016-02-22 14:11:05 +0100, Fabien COELHO wrote:

I did a quick & small test with random updates on 16 tables with
checkpoint_flush_after=16 checkpoint_timeout=30

Another run with more "normal" settings and over 1000 seconds, so less
"quick & small" that the previous one.

checkpoint_flush_after = 16
checkpoint_timeout = 5min # default
shared_buffers = 2GB # 1/8 of available memory

Random updates on 16 tables which total to 1.1GB of data, so this is in
buffer, no significant "read" traffic.

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ï¿½ 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ï¿½ 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#214

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Andres Freund (#213)

Re: checkpointer continuous flushing - V18

Andres Freund <andres@anarazel.de> writes:

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Hmm ... that kernel commit is less than 4 months old. Would it be
reflected in *any* production kernels yet?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#215

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Tom Lane (#214)

Re: checkpointer continuous flushing - V18

On 2016-02-22 11:05:20 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Hmm ... that kernel commit is less than 4 months old. Would it be
reflected in *any* production kernels yet?

Probably not - so far I though it mainly has some performance benefits
on relatively extreme workloads; where without the patch, flushing still
is better performancewise than not flushing. But in the scenario Fabien
has brought up it seems quite possible that sync_file_range emitting
"storage cache flush" instructions, could explain the rather large
performance difference between his and my experiments.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#216

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#213)

3 attachment(s)

Re: checkpointer continuous flushing - V18

Random updates on 16 tables which total to 1.1GB of data, so this is in
buffer, no significant "read" traffic.

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ï¿½ 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ï¿½ 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?

See attached scripts (sh to create the 16 tables in the default or 16
table spaces, small sql bench script, stat computation script).

The per-second stats were computed with:

grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300

Host is 8 cpu 16 GB, 2 HDD in RAID 1.

--
Fabien.

#217

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Andres Freund (#190)

3 attachment(s)

Re: checkpointer continuous flushing - V16

Hi,

On 02/18/2016 11:31 AM, Andres Freund wrote:

On 2016-02-11 19:44:25 +0100, Andres Freund wrote:

The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently. The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
flushing every wal_writer_delay ms or wal_writer_flush_after
bytes.

I've pushed these after some more polishing, now working on the next
two.

I've finally had time to do some benchmarks on those two (already
committed) pieces. I've promised to do more testing while discussing the
patches with Andres some time ago, so here we go.

I do have two machines I use for this kind of benchmarks

1) HP DL380 G5 (old rack server)
- 2x Xeon E5450, 16GB RAM (8 cores)
- 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC)
- RedHat 6
- shared_buffers = 4GB
- min_wal_size = 2GB
- max_wal_size = 6GB

2) workstation with i5 CPU
- 1x i5-2500k, 8GB RAM
- 6x Intel S3700 100GB (in RAID0 for this benchmark)
- Gentoo
- shared_buffers = 2GB
- min_wal_size = 1GB
- max_wal_size = 8GB

Both machines were using the same kernel version 4.4.2 and default io
scheduler (cfq). The

The test procedure was quite simple - pgbench with three different
scales, for each scale three runs, 1h per run (and 30 minutes of warmup
before each run).

Due to the difference in amount of RAM, each machine used different
scales - the goal is to have small, ~50% RAM, >200% RAM sizes:

1) Xeon: 100, 400, 6000
2) i5: 50, 200, 3000

The commits actually tested are

cfafd8be (right before the first patch)
7975c5e0 Allow the WAL writer to flush WAL at a reduced rate.
db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ...

For the Xeon, the total tps for each run looks like this:

scale commit 1 2 3
----------------------------------------------------
100 cfafd8be 5136 5132 5144
7975c5e0 5172 5148 5164
db76b1ef 5131 5139 5131

400 cfafd8be 3049 3042 2880
7975c5e0 3038 3026 3027
db76b1ef 2946 2940 2933

6000 cfafd8be 394 389 391
7975c5e0 391 479 467
db76b1ef 443 416 481

So I'd say not much difference, except for the largest data set where
the improvement is visible (although it's a bit too noisy and additional
runs would be useful).

On the i5 workstation with SSDs, the results look like this:

scale commit 1 2 3
------------------------------------------------
50 cfafd8be 5478 5486 5485
7975c5e0 5473 5468 5436
db76b1ef 5484 5453 5452

200 cfafd8be 5169 5176 5167
7975c5e0 5144 5151 5148
db76b1ef 5162 5131 5131

3000 cfafd8be 2392 2367 2359
7975c5e0 2301 2340 2347
db76b1ef 2277 2348 2342

So pretty much no difference, or perhaps maybe a slight slowdown.

One of the goals of this thread (as I understand it) was to make the
overall behavior smoother - eliminate sudden drops in transaction rate
due to bursts of random I/O etc.

One way to look at this is in terms of how much the tps fluctuates, so
let's see some charts. I've collected per-second tps measurements (using
the aggregation built into pgbench) but looking at that directly is
pretty pointless because it's very difficult to compare two noisy lines
jumping up and down.

So instead let's see CDF of the per-second tps measurements. I.e. we
have 3600 tps measurements, and given a tps value the question is what
percentage of the measurements is below this value.

y = Probability(tps <= x)

We prefer higher values, and the ideal behavior would be that we get
exactly the same tps every second. Thus an ideal CDF line would be a
step line. Of course, that's rarely the case in practice. But comparing
two CDF curves is easy - the line more to the right is better, at least
for tps measurements, where we prefer higher values.

1) tps-xeon.png

The original behavior (red lines) is quite consistent. The two patches
generally seem to improve the performance, although sadly it seems that
the variability of the performance actually increased quite a bit, as
the CDFs are much wider (but generally to the right of the old ones).
I'm not sure what exactly causes the volatility.

2) maxlat-xeon.png

Another view at the per-second data, this time using "max latency" from
the pgbench aggregated log. Of course, this time "lower is better" so
we'd like to move the CDF to the left (to get lower max latencies).
Sadly, it changes is mostly the other direction, i.e. the max latency
slightly increases (but the differences are not as significant as for
the tps rate, discussed in the previous paragraph). But apparently the
average latency actually improves (which gives us better tps).

Note: In this chart, x-axis is logarithmic.

3) tps-i5.png

Same chart with CDF of tps, but for the i5 workstation. This actually
shows the consistent slowdown due to the two patches, the tps
consistently shifts to the lower end (~2000tps).

I do have some more data, but those are the most interesting charts. The
rest usually shows about the same thing (or nothing).

Overall, I'm not quite sure the patches actually achieve the intended
goals. On the 10k SAS drives I got better performance, but apparently
much more variable behavior. On SSDs, I get a bit worse results.

Also, I really wonder what will happen with non-default io schedulers. I
believe all the testing so far was done with cfq, so what happens on
machines that use e.g. "deadline" (as many DB machines actually do)?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

maxlat-xeon.pngimage/png; name=maxlat-xeon.pngDownload

�PNG


IHDR ��G�c	pHYs=�=����ttIME�
/%tL�� IDATx���{\L���_�i��EI5c��]��2�����7��m���.�,�n��I���]JS���.�tS"�uYR�%���n
����������SB����������v��y����!`0��`��	��`0��b0��`��`0�	,��`0X��`0L`1��`0��`0��`��`0�	,��`0&���`0L`1��`0��`0����`0X��8��g���)�3QGv�&��c���~�����Xv��k�n��=�J
��}���W|���8�������]�b���so�������������f����q����]���?{����]���g
Ettt]]��*��=55u������Md��W/�����7x�`�v>p������;Z�~�����(�3�%�&`0Ms�������	�������������^(�����rkk�M�6M�6���j��9N�>moo?`�B�P�P������.]�����T�Fs�������[��W�\9}�t��}}}���-Z��}��/B222F�ikk��/��h4�����pvv>z����]DDDhh(m"=f�����OIJJ�D���s���V�]]]���
�"-��_����d2;;;�X1���x�h��&,
�6}�	�{9?�%� --���D"QVV�#����d2WWW]���h4�{{{�\�[�������g��[�n-x�	!r�������yyy�-HJJ�s���F777www���=**J(R���(�J��T*�����[��B
����2B���?��Xq�����|���&����������J�H���'O��p����We2��BA/���]�|9=

}���	!�������`������*�*!!�������W��
Jdd�{����X���|��R)��uuu577�H$, �(
�'33��:..�����}||��mYY���5���B���g=x����@����j�<<W�^
>v�X�*77@III�VRR�B������z���*���tV���'"===����s����{���%^VR�t/���k@@��P(�����(&&F*��Zz����w���b���|#h4'''Zf�L�����f�q���w�	,���0t�P�Z}���^x@UUU}}��o�)��!K�.
�UUU�:�C��������VTT��
r��J�\����2y�d:C��q#�'t���/��\������G���ppp������K�.[�n%�dff8x� _���<�c�6mf��E�p����cs'N��+B�p�������_������INNn��Q	�W������+M�'N4L���z�=999>L�8q"-C�U�WC���V����
�������R��j�iii�u����]�)m�T�k)--m��-=�N�:M�<Y�VWTT���SYb``�[�S����?����k����h��������7����s��o�'N���������*���#�H��]��������P5*��{a��d��u555�8m�4�HD��b1u�D��^^^tr�n��+6--Mw�v����b``@��cbbj)�?��j���X���7L�{��z=a�n�����{U������^���P(ttt��dr�|�����3**����aS��n``�+�m�@���;88�����������f!�H�^.>�F�8=vuu�?��s���`<�y���F�����Z-�8::n��m������UUU�����9s�,X������;n"��.O��k�������iii���UUU���z>���qqq��-��������Q���#=�t��X,���G���W�5j���+!������F(�f�o�>sssGG��V988x��YM�455�����32D*�r'����?��3KK�F�p��j��x��������������d�n�jX`[[�F�/��bYY��~BCCKJJ�����D�N9r�}��b�����m+++377�o���GVV��jx���E���`<�0��`<}���W�Z?p��1c������J��N�:UYY)��b!KJJ���G�u�������t����kjj��C��q����1c�_�^PP��������a�dgg���(o�����KXtF���������I�w���M�����%$$���*��F/?{�l�n��h������e��=�6�:ujdd������t,�Q���	�P�}�uyy��y�6l��������R"��S�N?��#m=�@`dd��{��fG�U�X�i�r��m�����;zb����R���o�R��vn=(|g����(���y���?}�����������g�^�w�}�o���&��c���*�Jebb`�����_��k��Q��Z���#]]]v�	�B�L0t�P����4i���k���S��d2��&HLL���_���/��}�����z�-�������YYY5���q��n���������G�1�qK�.6l4���Tim��������
�
��v�i4�3g�xxx�F"��*��}����:��a��L6|�pgg��={�����8�����DW����,]����MOO���aO=�	,����J�s���u����aPP���"OO�3f��b�BCC������if��1h���G�fgg[XX<t!{��}�����W�o����#111��u;s��j�{�nGG���XWWW�H����egg�.��c���������?~���G�������)�B�7j��~���]����[ZZn��%::Z.����,Z�(>>�WTT����_�������������~���3g��?�V���S�&M���K�Uh������J������������GBBBYYYpp������O@@���QUU�B�(((���Q{pp�B����Z�������uvv>r����C9��|�2������C��d���{��y��7��J�b��6����'�|`���2�L,�rj�Z��	:��G��3��44�q�/U������	!;w��.�T*www�H����WUUM�0���L���9s��(T*�_}�����_N����H7prr��&L��?��}��U���L+++kk�Y�f]�x�S�N4���h�HD�

h����t��^��������[YY��V���ehh�o���7n��r�H$��RiAA*22R~����k���(�����o_��h��b����c�Uh��t��V���w"�H&������XYY��X�v�D�����*99�����k��v���o�������:u���$BIOO�J�z���/))a�;�	a��������.��H-^<�G�Q__����l����6�}���9rWZZ*�H���3,,������sX�����gdd,]�4//�^��-��������7o����������cMM���1ky�s���`0Zb�������'�#���z���'����'�?�oA�`<���b0��hYX0��`0��`0����`0X��`0&���`0��b0��xNyj�`���������Q����z���a���n�s����[����O�8������m�`0���S���Xpqq			���)���G����;w�\hh��9s!#G�����T*5
���`0�u���r�D"9t���1::����������r����s�yzz��������77�����{�`0�	,}�j��I���m�k\�z���/=�J�B�p���999����}��	��mc���`0F+�)�-X�����u�/^��������������B!5ZZZj4�F�||����`�{h)E�Z�3����D"=cEE���is.�8�	,��`0�.�Z$��"�^|����2=�����[�(����=�X(.Z�@PP�����v���k�����J�G�����o������T�&2`��<�JQ��~��S����[��!*��y����Ommm��h���x���A�	�R�S��V����3�~���{|��\�u1�1��p��1Ss�j.�~#C�	��N���37@ 8����%�����~.�U)�����%P�m|�"�K�+��R�|��.Z0QU��Zk������E`��G�r�2\��1��G$~�[��0�k���U6�{:�\�W�����8N}�6��L
4�R�s����������SA,�����J$�y��i�Z����CHHHrr�n	322�(����c��������d-��������W
�\.����-U��}����R���[){{���N��z�G����}'4���S��V������n�Ji���I�%����H���S�{/�K9\B�$�L3+b�}���C��N�
M~�/"�ByF�(%X�e�4��������8�s�W��E���_H�OCl��X���_o������m�\�8b�+a_������������g�����G�r"�y��R���N'�?�/����Ok�.�����l���u[�sC��V�R�s�������h-=X
�b��Y4��3g<<<�R)���t����M�\]]Y�%��`0Z9{��8�W��s|��n�V�z
4�0�fg��6Q��_��Ex	��3��j�����Q\�9�����#���F��5�f����n�_�Cz���`��p�
j:��Xw�=pu��m?��V/X�@3Xc0�������n��ko�	o��[	`��g�z��~�r�����s�;�>���y�u/�q]�N+h��8<s~Z��
V(��������U�����^�>>>FFFUUU
�����_�������u}}=�V�j�EK�X+�@Y4�RO��>hS?��j����#>�|=D�����)O�T��|	���ld���/���x�l�B����u�y���U�yT��3U���������D��������o_�VUI^w����7o���[�$zj��F���f�u��h�@����`mm�pv��}����V/�2�'��=�;��OJ+|?5��;�jkk[VV6t����hgg���|j���

�0a������{����u��=�������<h��O9))��}�Dz����+5v��f���J�H��-
�������>������qT*))���~�����'�9�i��?C\�����J������H�� '!�v���^p����8[2��/�����X�E�X��_F��J����6	>x�7�����T�W/\������+�W5��h�N/��tQ�Q�����z��c�0����!���E��{�=�;��OJs*���R��N��qor�$=�U���%==}��r�|��9�5�"aaat���Q�=d�7�T���#c�������#����u8�U6H�}���n����W�����p�F^+�e��~A��}S�i}�����u��F�����p�[��3���;=O�xb��F�
	�g����`� ����m�H/TjU����b/�V
�q[x��?��N��#�0`���
�wvU�q.��h�y����N��:�������w�jX��I�.�i#
��������<<Xc��`q����c'2bX1�O��*�J]b�k�k�]|%��=���g�&[�+��-?Ts:���M���K��xc�5
�n����6.B\3nW?,���""X3��b0Z"��}���)�(���777o��#����x��������v$X�)����$v(�
��_�,��7��:Si����A��6;�<�
�
��]:���uc�6���F���g���������p��o���y��5��g�KKKjl8���v��-}��C���k%%%={�|\�6B8�c���������Ww��f��6��*pY��Z�6�G���@Hu`[���X\����[�/�VU���u��@8��&grzLr#17���k�zW��{�I"��8����s�\��NHH�P(\�x�C��D���;;�eff��BR����L�����<b�����������ha0����-����Q����O�O:�L~ktG��?>�����O9}aBb�+r��*M2�.|�v�N���ZVH�!�'��U��ry``�C�	���^^^
����|��m�D�R���G!77��_~yLW*�K�,���3�`0)��}��v�\^���  ����^�R�-��T�����t�[���E�o��Fx�p����
��[&m�{���#�������z|���geeYYYM�8���8??�m��G�-..�#}K�.���������������������>}:���b1����s���7n��@tttuu5��nI���v���aQ��;�p;��k��;v��e=z��g�Q����>��C������6m�$$$��������>}�O?���^{�1�C����|Jp���|��U8���n���=�n�8B�qd�hn������GX�F�$l�|�ZS���~�-�E��X�����s;s���k���������������MLL��?���[QQQQQQYY	���y���B�������-55�SQQq��a�D�u��+W������g	!��
������M�lll��B�p���4���;�V�X���v�����"�D��
E���������>}��E�
��]�v�Z������;�[�p�����������v�ZRR��Ndd$!d�����M��cE�AB~��a�!��\w�jb��7�kA�4Yn�1l�q(:y;q~�W�
���}M����T���c.�z��%�����U��/���0���
��0yyyT��8�NNNAAA4�$u�������:L�6����EI}���&O�L�TH��zyy�D�6m�,Y���:g��@w��h4J����y{���u��h{�����6u���n�M�\nll�G,500�WH���4�X��y��$���X�����;��9s�������D*%����Y �].�	��� ���tRR�����`=cdee988XYY����7o����r�JFFF�5���c����VVV����������g;;;'$$�9������ ,,l��
/^ttt�p�Bzz:�����������M�T�#�z<�����gtX��K{H8�)w��Y��m�`0ZSG8�O����XD�Y���h\�!��-$%����?��[����/
V�hR9�K��i�����wX{2�L`�<���H����������b'''[[[���s��)�2�&�q������sCBB�d277���ww�F�QYYy��%ooo~��X�'��l�2u����e��F
����*��X3�Q<*-������I���C�#���^�q����n�/y�Q���~O�����j��Z��$as�Z���z�O��o��~�A�!55u��UUUU���YYY
�X����h�E7��3
^���oZZ���)

�~�@�VK����/WTT������+�J�B��_?��"##}||��A1x�6���`4F���D}��p��0�>J�b%5���s'Y���7�tw��egc����}GW�J
#EUWX�>���g��9r��������cG]���rt�G�����FDDP��R�x;]����:&&&,,�����L�B��`��1C���S{DD�L&����/�_p���z�:���8����f��bs�.,�����"xFZ�~�v����F��1�n����	��<���/~�w����N/(��������d0����K�+V8;;���YZZ�7���W�N�8q����@ ���K�v����jjj/^��������offfgg������S�������

�������������hD"��M�

Alllee%!���x��e���vvvvvv�����5::���A�K���;??����K�.t�SzJ�R]�z���r�
I���\��>�i-�o��1���#O����1�������������Z\Vu`���mI��yo�'~e6� IDAT��k���ZW�N�q�,����xr<?�r�iF��<���;|��Z�3f���$�jll�~��=zt�������>z�9;;�m.���I���m�����~��1z�8~��X,����^���!j�z���Z�v����:�f�� ,T�������K��WQn��?Dv�;q7�������b��56����.i��O_�[r�P��	�}��-���)3���:��+��	&���tx�5�G��'�!o>�g"s	��I�0n�+�8�����ac��7b#E��P��x�O�������ggTT��a��l>'�U	6D�`0����R�P[��1$D}G]-��7��0;������av���q�����C��@��q��V��" ��;6���e����.^D�.L]1ZL`=Q���W\\�>{����
�ummmbb�����k����`<��`cSS�]u&^��	_����D@�1������W��j�����
������%�b$���l��-�uCf&��X3����:`��M�6=�����&++K,s���R^^N��c������m���'O�|���������������}�!7o"*
� (�(��I����TL����6��a!�6�������{�/�G|�� '����ih���`
�`�_��U��ry``�C����RSSG�������&�H�j��oR����p�z�`gg��_���gff�������`�KX����9�5
vc��(=����|�A����\c#��8P������d�0���JWZ
��oC.���O?��(�h'���������J�T�a��Is���3f$$$h����P>�q�'???++���j���������m��=z�hqq���#��K�������{xx����^��S^^`�����`BB�X,�0aOO����o�����4::����OP�$|}}�4�p���s���z��k��;v��e=z��gi�$&&~��:t(..&��i��n����mkk��N0''�������J����R�����`<�����N�];�C�	��	?�G���9�,�����h��y�B���J��a�
?���I������x��\�c���`�����7o���|���7n\vvvfffvv����q��gb��������������MLL��?���[QQQQQQYY	���y���B�������-55�SQQq��a�D�u��+W������j��a��UTT�\���|@�P�}�v� �DW�X���v�����"�D�WTBHvvv@@��}��q��O_�hQaa���k�J^^^;v��"l��������������kII�n"AAAObx���(.��?.VC��.a�8���D���^�������;�;v����uv��_(��>�����|�Mpg@P�5k��\\p�SW�VO+	am``@����B������b�������/�H�L�G!//��Z''��� B����\.�>����a��i������XYYM�<��������266�Dm��Y�d	!$--
��9s���9sh4������] 4,p�n�mO???z,��������4j766&���P{�h�Z�D"�J��=7���-���h
��]d���<B���SA ���B6�����������Iwh!%�(E��^�
y�]��d����E�
������j---��%�X|������������^VV��5kVV�������Naa�^,BB���7{���o��M�6edd�%R]]=v�Xzlee���@/��aCNN����
�{��x��9����6l��y�����������X__������R��G�<x�}+K��hMm���244������2??��iq����nnw�W����n`pg�ED�?�;""�!��;;v�_\�Cp�`aq��;������x|yZ��~����P*�u{��w�~�#�62^�����#�#�{�b��Q�G8r��Vw_�
q����0r$ko��Ak��%����e2��������o����O}}�}����h�NM�R-X����E 4]T��|}}����b������-�j�9u����zK���yzz��;7$$�L&sss���pwwo�����.]����7O��Z�V�m��-MD�ytn��mcc#


z���to*�Y����k�������I*4����C���=���}
\����C�Y:m����g�;��~��&��Gsc�
��c"��><��a�D���a�dfb�����mYc3Z�o�����D��1�9X���7J����j�Y�f��������SS�m�F�OR`����}B������{�~����Cjj��U����h������+W�����/HOO����:JHqvv��������8j�(���m��j�T*�|�2��4��-[�����Wdd���G�[���Gecc������M=X��i�L��`4Fq1��{!$hx����;-���~�?N��)����hX�&s��%�I�{np�����_=�A"A}=[*�hI"h_kc�-8��Ur��W����W_}���[G�555uvv����u���j���V����g��9r��Z�~�:v���P^^��+BH\\\�����FDDP�O#=S;]����:&&&,,�����L�B��`��1C����)�L��M����%#]��������r���O>�$�.���g�6����W^���w�����,��j��o�;�8a��w���>�j�gB�q�y���w�UW
�q���~���g0Z�O�	I�V2�l����N���
�QQQ{��PRRB�qqq��;!d��lmm�B�����7!��������7nX[[����m���9��^���u}�l3###[[[�1�+�J+++�@@�>>>����$�H$�H�j��������iI�?o�DFFFGG��=����1��H$b�X*��D"z*88������������j?r���3��G|55��!��5d����W���������
�D��$�������@�$2����O�������Nro-�v�������"8�JEk;o�<j�����Z��,��(IKKKLL����9$&&&''�j&%%�8qB�����]����K��i�&�R����iI��q��������-���D�������k��l���l�%Z�-��S��{9��"*BH��x���t�lJWi�D�"r9Hx8!�L�H^�59�X-9��(���u�����F,���VVV��-���D"���mEE����R�l������rA��
���H�>uiU�_#����v�Z����?�����F+7���&dF��{�[|+����-�/7B&�T��ac�5k0u*��ah�n����B�kU�$))I :������Z���qc}}��1c�X�0���Fc��Su�+A;�����n�W]-�X$o���7��������PY	����!����:���CvL`=w�����o���9s��>{������FB��	����[�l�����`��]+))���'{��)T���s���Qvv�����V��-���y�o������4O��X���_��C���`�J����A��`�\8w�}���q��
�Y,���#44t���6mz///>XMVV
����R^^N��c������m���'O�|olR^^����q�q�l�)������[ �~[�����]��=fq�ms����#��F����2i�}�I��4�>���h�P�O!#�~B0w.��{{��p�8��z���iO�D�O�P>�NJJ�P(LHHHKK�<y2�JEC�<J��w��������������
BBB�<J�5�s'���I",�q��P
�${�G?��2#{>\�j5��!�� ��$&�����Y��D&#ee�60�!�&>.���������&N�hll�����m��G�����K�������{xx����^��S^^`�����OHH��&L����|���7���FGGWWW�	�����o�.]���stI�.k��;v��e�z��A�������|�A����	!m���[sy{{����_�[����~d���
�\:�`0��kp���x1�����5���!n���mwE.-��>��[�NB,�:�K�T,�r@�����c�<df��& '��{���=q�4f�DA���`<s�!���������������MLL��?���[QQQQQQYY	���y���B�������-55�SQQq��a�D�u��+W���������a��UTT�\���|@�P�}�v�`XX5�X��������EEE�$66�afvvv@@��}��q��O_�hQaa���k�J^^^;v��"l��������������kII�n"���qUUU������`<e�����:~��s�����fj��=1�����;���k������O�N�$ ��o~uuHH��
��0l�j��TT >..�8�Y����5����/�������Q�C7|rrr


��#����P�i��yzz�
ZYYM�<��������266�Dm��Y�d	!���3�:���4
� �����] 4,p�n�mO???z,��uw ���F�������V��ooo{{{������.)������jjHJ�=���b�c�Hc[�1�%���7�jP�����q}�&>�(��"�u!n���K������H�D�}p��FE���"d< YYY4����������rrr�\�����m����z�����������^������3{�lgg��������w���Al�ryXX��
.^����x�����t����
#�T*�F=|����3:,���5������#::z��-'N�gr&�}{\�����X���?��7v��]�Z����7�@����x�(.Fr2		�-W��k�/!�K`��<�t�g��������R��3��DD "�_��<�h������C��Gi)$�����������;1`��L`�<����|��y���7>>^,;99�����&�S�N��A4A��<==���@&����EDD���7Z����K�.y{{�KO�b�����e���Sw/)
�-��|O�E�]��@���p�)#�#���G��|r�H���U�����9I�zgk��W���|����=��3�mA���y��q�������������xz��!����@��+��
6��������!������tRSSW�ZUUUU^^���5h� �^~�e���BG	)����q��}���LMMq7r3�V��R����+**�������J�B�����n^���>>>���<�G�������3?��>��u7�j4�+��`�W�HO���<��/.c�X����)W|���v�)
c���L^?�CTT������N��8�\yg�oo����w��
_�;�v���a0��h	|}}��9s�������c�������"!$..�4�����5""�J���x�J����=���111aaa��_0e�j�c��2d���#""d2�n.4}!�������C��@ds5j���/]�D���m�x0�:{}^t�|����8a�W������N,%~����a���JT� �����M��SV�I�`b����a�~`��;?�f�!��e?���<lnZ��b�
���B��������ss���`B��7���
�bq��mg����/�����>eee���lmm|�����T*�������B���D"�D"������N<�[$���R�N�������W{�����D"��R�T$�S�������%fff
�A�����^=�0�G��u~}F�j��z<�P������� ���i	�c.cKbO�o���S��L2��r��"}�����dE�"�� ��Uv?LH����y\�;w����j���P$
���������_��G�����>�_�G�Z###ggg{{�F���9�����9�����b��C��_zn�nDH�V;��i�O���Q*�����)�pc0Fq1gb���_����#���f�7�������F����H��[���`�b;��!�1��a��;�����.����g(�J�����'�`0!���`<3#�8^���?�q�X�eym�t����'��^/���2�.G�!�A�)>�;?��P���i���5$$��_��b5k$�p�;c�.�����&$tas��Y�V�o#!Bp\z���'�W����q~�����'��W��K�8���b�$LRC�������+\ma{G��at����+W���@�����>}�q��@BSWF��m����h����=��cb�v�ZIII��=k[��A��Y���F�i~bwy=g}�����W}w_���e
���jc��@9��
���Fct
��'%+Wb�&������g��sg\��>>�w�`0���z�7��];��t�������rgggCCC��BCC�bb"��u��54&&&6�i���r������e���'O�||mU__O[�}l�f�T":�N��s����ndt���=Y���|�z����4/�]��>��L�5Z�����n�(������5�n��F��txx�������!��C^��n�Bt4�����g���h�����i�_�E�MLL?��C�F#�H&O�,��MMM����b����|����o(u���i���1"''���[o��@��6������7���033S ������zxx4:�������c��P.Y�D,���,F�FV�t�6����;|D�!�wJOMLH��o�n
>�)����p�o�>��g	����p�;�I���COI�����DE��1>bb`k��x����
���Q�_{
����
������-�V�������#!$**J(�v�mU�K:v������j322MS�REEE������Q���	!yyy!!!!!!eee�BCD��������/�H^�T*����g\�f
!d���;w�<q���'�S������!��8q���H�T�|�Je�-�M�3������f0�9q��������IbV�z����G"�v(�'�����}������wid��RC�b ���er2	'�� ��$<�$'�/r�t#1�M����[�Z-���;����n:��8ss������o�����C9�+((x���W�\M����011��}������������y��nnn�/_.**�H$���222���Rwww��������h��5}��E��]�v����)//�;vX�v���MMM�?��k����t"##	![�n�j������������$���_P\������z�{�&�(Y4��%��g����_�u	\B�~�*�7����U�~)~��O�����:�A�S5oll����!8�}|�=f���[7�������������!�gz4�$]$pvv�kTTT@7�r]]]xxxii)�111)..���������?������i3r������������J�����:_�x�������R��\.���322�-d�9����O�>����[�h!����N��N�@�n�:�JEk*�����]�6L��������^o�����vu@�R`������\q1^yb1�R�
��;��"R�����.��oql��[�w�cn4>���o��{4���b��W����~W�6���,��E_[��pw�W����C��F@��0j���P�����b�~�����Mv���N`eee���UVVJ��
6��b���]�6c�����V����5PWW'���0r�H'''�@^PP�u�V�3���H$T�����9@TT�����5k|||


h�^�z���F����/\�@�80,,,''�����)l�R�GN�c��������
��B��	�����MTC����?{)0ZFT��X���X�/�:`�bh������V�pBt�����p�T���#��q
����;`�^�O.�F"�]9�V\� IDAT9}'��BU����
��{��pq�g�!?��-��X���^^^
����|��m�D�R7s��S��ruu���SXX�t�����w�}w����}���:Uc������_�c]����\.���95�<u��K�.y{{�����b�V+��_��e���S[�W�5\c�$�y����7����bc�����
���0���r�����mJ~��?1��=�!�p�q���{�vPU�R����xu����C^rW�w��;�t��R��_�z5�������x�?�Z������$''7s��S��n``��{���u���Nr����7K=00���S�����2���(++���<y�Q���,�4�juff&!d���JJJ�=..���������F�3��LW�\>w�����r��]��i�����&���2��NbmM.$���\��GLLH�����D�9j�C�?MA�t��J? ���!�?4�X�	'���!$��9�9$g(��Pm:!~����Ors/]i)����^OIaw��xB��[���j�
��_yw��y999����U��e�M`Q��QT*�����uuu
�M���u��������������d�{���9s�Pm���J�200������s��M`��y��w��r�\7k�Z��{�{�'��BBB������	,�������������HP��kbhH������y�B=�`L`�:�����]�GN�"{��'�+��|D>�'��$�/��������, �e��
��;����X�b���8??�]�v���XXX���.Y���7�����ukt��[�n���h4��!�[���B������W�X���lkk[ZZ���C������
�J$CCC�X\ZZ�����o_����������H$*++S(M8/[��������PZZZYYI?+���"�(66������<((H7����&�5
=�������j�VVV"���R�TW�^���\��_�0�+W��c�3���0����]����\����mp0�X��a��N�F�]R�M���5���x��;��4�$ a%V��:��8��N/O%�;�s+2���V��Y�p�bD|<>Oid�O:�j�:���!�c�J�3X<��og����O��m��!B�V���.�H������1|�pJ�2**���T���=XO��
�R*�iii999M��������4~8����J�}�U*Urrrbb�nFt3-����c���VMW�O$))�^��V����`�^�Z����(2v�������?�$��qu%��"O$�U���sH�gb��O��.V��']��q �i�Qu�C�Z-I'k�6l3C"���|t�D�SI	Y������L%`0OA0p���F?z'���sCBB��nnn�z�rww6l_����#F���t~������N�T�0��`0��WW������'��jX�q����j�A���lBz�/�������O���I�\���}������k����tn��L�O��e��;wb�,TV�Y��w�6�����c<ww���o��w�`0z|���&&&zFcc�o���
��y�����!����_}�Ubb��-[���KJJ����x�{
���G��`<hO>JJ���X�QQ���j�� -
������c>S���wc�p��

��'|����c�1ph\�h����r�]��+~u�c�n=!>�����b���&�/	���q8s�;�];c�6�������x)�
�>}��D"��Suu�D"Y�l����q����5kWW��C��+b�SX?�����������0������?���'N�hdd�q\qq���ku7����;v������������up����[�����Orrr]]��q�G+�;wn����bgg��aFk�_?XX`���c�(����Y�[���3�k����s��Z�2&�8���@����Va���8
P�^�-��Em�JUj$j�[�P�XD(�j��U#aR�u�$�~?M#��2��?>$����sv����k��������	 �����6~�rW|�C���vw�b�J	$	Hp��c��=Pqg�t.����J%:�!$o�	O�Z����n"#QR�M�H�G�N`5�4
*��������/2#�J�����"������������kV��
���.]�����D"��T*522rpp�}o|���G��?���?�g�����.�nPz����n322�;��t5����w���zA�P� ��i��h�<������99l�40��Me�s/3�B�B���At��Z�������������g��do��������ON����v�������c��x�,sue��,#HD=�f��h�����n������
EEE|�����������z����b�X�lu��:�&�SXyzz��X�=zTQf�������T���[[�j�h<~(���cYY,4����K�;��Bo��w!��;/,z���bb133{�w�g�&�2���J��-_�01��
}*����z�x�"VVF�H$�Z�u���b>�Tddd�����5���=s�?i������G��D"�Z]YYY[g�N���������X%GWbb�>��![�n���U�������wDDD,Y�������/�n�%K�$&&N�<�_��h���ox����6l�@_c���w�_�v�dVV�����G�r� �=ZX3�9,�����/23cs�0��.�/��X��I� c�rV��Z-M*�:�vyU�UOj�L,f|*�m�h��%		#Zs�N�>������	}]B|�U�Q>>>r��?t������s��%\�r%�'***..���D��zzz:88pW[g{{��^{M���9sf��AJKKa�������<U			#F���UQQ��K��{���u�0��������;�[�nM�>����1v��5�B�m�6�A��c������7�>0D�q����q8}*���P�	�z������k�����K�&�|/��}�fy�{�����
��]���/�������2�G#��q��k9{{p�=����/0��,H�xT=�V^y��8�/����Pw� �-&t	����>���������d2����($$@RR�����Wc���|;��G����Z[[/X� 44������#..N�P���p�`�>�����n�JHH��]NNN��u:]aa��=���5��o��i��auW~������������i-��5�W_}��K��Y�e�?7Au@��� �����+G�|�p�����^��K���������v-���ao��(���nP�������R�������/��Lt���~@b"��@l������
��SC=��b�%\+U!y? �r�Qiw-Hb�b��ajj�Z����8�?Le���?s�R�� ��<��S�n��m����`��	�����+^�)
^�1�~��'�������������wWVVZZZ���}���/�����{��B!����@rr�����=���c���5*++{b]m���e�HII�H$"�(33����>0DC�+��\������;p�E����E�$�U��%St��C,H@���ho�������DmUv����l�.AIB���OL��T��=a$I������t�
�V4l�������1chF	���MXXXhh�a_.�����~�z���������'��]c���K�.���/����w����Z�h�hlllff6`����7��G�TV1�~Y�j�T*U*�>>>O��;���;��3Cx��5C���������>..5
;wb�����oL��[��O��@oL7�b�wV����1����`�P�kTW����q�>g��L}�%���$s��}�R~����{>��yyHJBd$D���A-���b��	>o�I�B����/��f�k www����POO�*����l36�A�W�^��3��0==�/1���/�jh	�a1�*V�_(�����@��,��R���s�� c*�C4���}�J�����\�
�9l�eVsH���������FMb%l0~�\�`��1c00���[�.6cx�y�m�/fb1c&��&$��������4�	���laa8i��3g�$''��j�����I�&�9m�4SSSkk���WM�0�1�T*����B���]xx�H$���������H$
�|�)�T�����J����p33����\���6t����{O<GKK���8�DRYY9~�x++���8|~�~��UVV�-�o���	����qX�"LL����������V��Q����},_^k���,H�Y��P��
�z�����K��W��1C�?��=������BD�����1��� ��E��9p���e��4�����b�@��_?>�����3�W��*��mmm�D"Q�T��J��o��z�X�����'��lmm����$����������U*�U�w��y'N����"""rss���S��g�.Y��?t��!��
�/6|	�A�8�}��$���c`����W��K��r��9i������)v�=^>��V�n���'Yh(33
e;w2��U������Y�g������L�x�u�r�D���T����k2�j���^[��7Z}�J����:��/��%c���+�,�Xy�r4<w��������1mZ�Cn��uk��
�]q��������Y����I�?����yyx�e�D����N��	��	XA���p/�T
ss|����������#��9d�����������k�?�+������:�Z�2?�����������|����'�%A<B�������;��M����_��}�ob�X4���+���X����*�J�CL8K�aO�v���8���%�N�,`�
^�����oB�~OO�m����A<?P��3o��@��O�������***�L�bff�q\^^����
s%TTT:;;��5��#j�X�����.]��o�>CK��IJJ�h4c����&###%%�����	XY!/KW�>	��68��L�3�����l��1c ���$$j5���T�]	N�3O���.�`�O�0����#JKak�:jJ�BDk�b���a��9=�K�.666"��1&�J�������T*����G��?���?�'S����.�J5��:WVV&''�����auz��}�iJ��u���

P��GI<-�����KE����k+f�|������{�jh?u���2�}=�m���*���rc'�I1�����{
c'��YH���g���N�<	����o��������y���b��1�����������dZ����;V�G�U��%a�_�bu��e�����Q�F�� �����TL"a��e������WP.ce5�(?���f����]���f���tg"0���1����6���������W� ��KHPV����-�����SSS�w�nkk��O�<999�J�������]�v(--u���kTT���������1V[������}}}5
���g��}������_����m���O�AX_I��*=s�������x�����;v�_!U(S�L��788��B���^�����������	�.8�|WW����S�j�i���KR]�@+��3��������c�����Cp��|d]��&x@7��E.F0�D�Q\�
����eA��g��nE�;$�����O���FGG����C|�_�F����\.�]�z��2��Y�r�^]������h�ZOOO��j�loo��k�i�����3g�4��n�����������0b��*�Q]�t��woPP��[�C�����'�u�����mmmc��]S(��m3���_�s�Nee���J�\.��d� ��,�U?��svDrv)���]�y�y��Q}�� ��#:�a��
s����N�cX��C��\fgt�@���f�4�����~K�H@A��EHH����������j�E���������>z����Iaa�����BCC]\\<<<���
���
�q,��k�\PPp�����^����t��Q���i�������t���g��M��
�;&���>..�Wo���z1���72��d��=ccc�mfY����J�y�����0���:�k>c�A@�g�07
QQ�7<���<R����}%����U���7?��p����C������q��R�<	OOpGq�
3JQzWLaz�{�'C��t).�����VC(��$�<X�����E�u��m�������B���C����
�$�����
���idd���'/Y*++---�1��o�x���:���o:t
��;w���s�>}$'';::����m��;\�����'���������n����#F�����Q�n�����q�U������m���W�>a1t(��n��W���n&LcHLd���c�1������8�����Y�������QZ���)w(A���?���B�����iJMM]�~��PFFFvv���kK�����t������`���}�]++�E��+����fff8x��a��6�+������2������V���:t��eu�����8�9���8C���40��T8;�Ff|��iy(BnNa;@�"((
��@�������H��L���������C�-��1��3r�����n�F��4�A<y���N���9q�D}K^^^����0s
����]t��������4^6�\��d�-]�T&��G��9�?())�8���k5v���k^^������������M@@@�N�|}}�C�SSS
-���e����OS�^TT�����s3g�����Y�$����B���W����!���y��J�TT�V�6��x���C���/�v��S��GV����z<�����1�s8�����UA\]1r�������Ns�`����i�d6!)))<
*�9;;[XXN�4���3���j�Z4--�J�wdd��i�LMM���W�^]TT4a���R����
�vvv���"����������"�H(*
R�t��q��H*���������������L�����x����x����qqq����r���VVVOyq�L�2u�T���'���J�R�TJ�:+W"4��kH��`���c�^�e���r8�xD������=�ypO�FH��!�a�v�w+Zh����`����@�GEa�
|�!��h"	���f�|����O�'�	�`8p`��eU����b�@ ����Z�6�5c����2���r[[[�@ �HT*���R����bqyyy����'��lmm������7���������U*�U�w��y'N����"""rss���S��g�.Y��?t���%K���,^���%|�,�J%�H�����/�������(d�V�J���H���Rc��Yp�����3ww0��=�B�?�2&Cg�9�%���2��m�������%���D�i�g��������9�������0p�@KKK�cT�K�b�=�!VO�^[��7Z}����%Z7o��cH��D("+f����O;!�������	��1q"���~p�� �t�9����c�����^��?�)$��I�
��(����}�����k�Z>bI�P�����bZ��"�V�*���{�np�����I���C^�v�����Cv6������F-����x�;�9�
m��g��A$��
�.��0c�S��|��H^t)����]�0���y@���koN��
-T*T+N�0GhT�����j�}�k�
�pA]�������'��	>�f���0c�S��4�{�����Uo���go4�.��O��`x4//o����I***����|Q�V�2669r$����AY�`��NDGG/\������%%%i4���(��)�$Z!��	��)H��[��=Z�_��?��g0y|��}{��
��am
GG���a/)���*�����|�
��AD��q�K/����/Vi72�g�S�XUR�?M������M�������^�����jaa���7b����|D��~�T*}|�,�� IDAT|jDOvv�7�|�D�r�������"���������� Oi$�:����]�n�@������yy�����K�"/>��b`|>�`��
EU��[�<x{{H��2��� ��,��d��nUc��*\�x���������|����O-Zt��!�z���_XX_�f��9AAA��]C-.�����7?���K/��D3n��	@,�x)***kH�}�^���I�Jr���XW17Z<�X�&&(.�.��0�_~AP�b�t���9�A�mz���i�0�*XY��w��HsH��������W'4���;�&O�S������;���|��w�y�]�v�C�/_vss�k3GEEUyaqqqDD��]��T���������i������������J����������j4���"�����m�'�|b���7��J��93>>^o	�q���?v����x�B1e������k�85I<�dF��*�C)�;�(��||PV/�*��o���H�))06a�\�+\E�"u/�� ��h�
�����"��F�y[�O���;wV�G*�
������W���lny��veeeUTT�K�2��1������c|�~rr��E"�\.�sY�W��������=!!��S� ;w�
��cZYY���i����B'''>�������c������K�RU��/�����{sss�����v�����dm���={6c���m������bdu��E�dd2f�N��m��*��2�������Y�~���BCYy��O�.��,���W���p��M!AP�g�-�o��z�������(���T*�*�5+��Y�T*����q�\._�v����Q���72����3qqq~~~��*1g%%%�����w����c��M���ILL�{�*l�����[�"�@ �������N��{�nm���������y6m�4l�������=_�������Q���o�U�}{t��43��b�����o��-^��J����T����K��X������eeP*���g����7c�a���#7���D��iF�����?�����������h����?_QQa�E��]O��}�>����?
		Q(�QMW�^-))�����O___^`�1H~���:�B~L~[Drrrm�g�����PYY�]�����f
�A�=�����~�����%��E����PZ
�j�sb�|0�
B!\�%kX� ��(���n���v�Dq1�"A�E`=����&���~=�o1��7��i�8###;;{��������`kk�?�H$xT���A�#���-Z��3666330`��#Gj��T*
74-�G�W�5����Y�;]�n���+���7��������pp�L�����1��A�	�D����.�i�F�Q�p���abB�H�3aD��~�����
�O
73fFHII����x��������%>����_����w�T�o��yyy���������666�:u����-[<<<�x�|�W�yQ����\�|��s~u��UT�MH$U�J�x�M�P(�+
4��V@��:��pndu��
���r��A��������_~	���5((H��������~�J����*��?���6i�$�ALMM�R����B�())�J�R���A�3h� �H$

�T:n���:gdd���{O<;KK���8�DRYY9~�x+++�q�*�G�Z�jl����v��PQ	ccT,c�b�fP(����+��l�v^����Mu{�PA���r�������622���S�b�-^�8++K,�H�V����f��q�����DFF������1����x�b����'��lmm��������FDDTy�y���8q����������bU�N���=�d����C��,Y���x�b��R�H�U�d	�(X�t���V���	L���w�<fd��^5x5[����P��������!�B� �YH4�b��&/���;�i��<�"Z�}Z^n�q������Svn>�;��~=~�
����bb ���vv�8�����]�C|G������/�[7�F�x��_!A� �gSW`%(0AvI�����7�f�b�������II�v���"�p�.5����A�=A���G!A1X���CrA��E���o����
�o������Eq1�AM~���q��l�����(.&iED��5�`5������O�?0<����y�f�T�������1�V�j�*cc��#G��?k��<M������������$�F3v����W�\������h%DG��Q��;��D�������f������03Cy9���G����&lz�5��&&P�a��� ��^����������I_*������G��#���?~�1�V����Jem��������{�I�����R����{_�|�!.��QF��{�eSY��^e�T=�U��E.d31a
E�}�B��i���y��*����3��DDC
	����T�����zaa��GL�<Y,4���_XX����MP>|8  `��!��8V�����bq�CUTT��������I�?x�@*������(�q��ro�����]R����\����07Ga�����\
bc��)�LGz.r�a��G��=\��A� X���;w|}}9��8��w�i�����������8�311������������]�vP�T�����i������=������w�5<pqq������h4�����q��m��}��'�������s7s����x�%��o����c���
��)S��
�2rRR�����'O�����q#}`Z��C��=_�U�>���GR�kL�3���i����8g8p�s#o��UI	<=��G3HD��W����H�R##������
�x�~�@tt4c,::@rr��E"�\.��Z����w5I$ww�����c����S(�iee��j��="��1��~��TV[���R�T���K/��w����\�Lf�F	`����1�L��m���g3��]�`���5-�<|���X##��Wm��O#E��=v�������Z�w�'�+�H���j��v�6�����EE`�LMD#	X����qJJ�^I������Q���F��������ih��+lmm_z��y�����I��'RE`%%%������+�Z��Wo5��w��gZZFu���������������]w�/0���?//�|����/^�X�P���L��P��������)X���Q�B�L��"�����R(�����X�o��,��Y-b���p��5�A����4
�����+**�
�_}�7����o_�������!!!
���M�z�jII���
�>}����&&&�=H~���:�B~L�2trr�a�CCv��1x�����=1����d��J=)))�D$eff����1��504�Q�!������]�{�@,�����o%�@(����(*b��Y�11������;�]��XDE���xd�pKA ?��Mb��������<A

	�z�W�^��8��0/q�?�������8q�^�X[[���}�/_�Ge�k�:����j��E:������l��G����R�

m���j�*�T�T*}||��y����_��vg�?�3��+�A~
����B���l�e���LX-�{��S'\/����Q�z/� �MV��cG,[��@�B� 	
r���hd�a����/��
I��O�>�n��zzz:�|}}`�)�J���uR��}����y{{���������t�����[�l���x�	���>ra=�e�r��T*�s����b�m���^p��#2����q����	v������:A.�K���;�M��n�@�� �F�<X�������_puu

z���TEE�o�� �J322T*(--m��I�����J�Rwww�BQRR"�J�Ri��Tg��A"�H(�	��R��q�j������{ON�hii'�H*++��oee�����/��_���J������.�-���J�����J%���l��!���2�V��tp�������7�	Fn.���#�V$66V(�[����%��{�L�>}��!���R����N�VE��5*��=ubbbTT�R��o��7$�1gg�3fr����UTT��;����Y�B�����V����p~��6m������'j��c������?�����C�Kf����������L&3l)//�OKK����T�j��HN��x|���c�`�p��]��$��"�.4��
F�����
A4&T*����w��z�yZE�0.]�d�[�3VyB��WWDGW����b���*5N���ldW��I�W
:wF@i��h!�L���/��jD���[�����.��.�����07��k�>x{c�N,��v��c�+\��$k�0���+h��h!�\��c����	�B}L��;w�����������-�����3|P/��v�����h4?��C�G���������]D
�8q�4K6���H�V77��UWWN��k_$�LI�N�!�Q]�����>�_}E�GD��\�V�upp�����sgxx�D"�r�
��c����)������s�����E�����y�f��l�����^�DEEqgll���}�����SQQabb�D��V�����-#�G}�����Au3�q�+-K������������QY	��z��bDFb�.���,�I�����i��h2�E�;cl���� !!�������O�<)�J��S����������7��jff����?0������}�>���+W�O����9s����UTT����=:%%��AxT*��_-|��{��p����_O�<��w���u�8���!,,,22�M�6|QB>�;�"``)�g��w�!EE�����l���W�t)�O`_Fr��Y ���31��n�����~��A�@�4A\\cl���Z�V_������w�ap���Z��������q�F�.]��r�������|�������;u��T*�v�w����������A�O������S�1�����j��1��,X�@JJJu�U*�����#6m��[�t��C�rsssss��~���?���1�����o��T*��T*U��������RYY���@�Zj�C6��d���1������������.]�H��Ef�CKh*�CC�XL�F��1p����	������V���h4�*�������oP����#�<<<<<<,--
K�,Z�h��Y�-�����N����LL���QQQ���pn���D"Q�T���������d��%K�,���|Z|��������\���u��I����H��������W9��c�fffn��}��7n�0�<�3f����F���ys~~��-[���N�>��[�.t3(^3b���'�q%%%|�z�E�{a���#F� ���#q%N���pp�w),������$�E@�>���!�IL�]l��GA52k�,�*��5�����0c������"!!a���m��1�j�DVS����mXX�������ebb��oc|*Q����P��\�w�n������o�EO�J�g>iVV�4::����������_����q��������X��16y�d''������?�����q��D�![f%bn�'���9v�l�>��lr����2)c�6�}�]o*���Xd$�AO��jh��\<X��w

������1b���s��)++�K�����7D����W�^�g
��������PLLc��8^���rC�D"qx�HOO��>& <<\ ����
?�������k�������111�����J��-�k��O<q�W��
�vU������NHH��w��Z��r����n�����8x	��jn���������1t(�����$�^W�W���A���T�v�4���T�5�H�M���{����D�n�jX���]�v����wi.����������(����]�@�^������^{�5}��j�N����q:�N�f4�L&�W�x<==�Y����������1HM�$E��_�<y�>��P(�W��Q�~��G
����%Z�D2b���cJ(����{���68R�
Q�w��f
_@�A\����;|����X�3gb�Hx�7�3��7�X2|��Z�:� ������]�K��I�&}����&�z�*_�O$�g(���4������c������$�����	|��wFFF���Ce�=HuD"��7





T*�\.���k��d����>(F��2s�L�������[kL:�h��|��������q��S�1o�h\��y�����C��sv����Me��s�8�m��i��h64�5Q�����#>��������o466����w�^���U���=<<������0��277�c�JJJLMM�r9����^�,�����aaa��1H����r##���D�)�^[�V����wm�SP������A��F�PX%+==�W�U��h���20\f�c�����3�ef�G���,)��s�k��.���5����l�F�7� �'6���`��I�&]�v����_�8������|��=���n���;v�pppx����������}>������G������������R)_z@ff���������"�@��w�VTT����A�t�[�n�����/[�,88x�������H���X��d���s��qtt�������������o����u���xxx[[[;::���Q5��|��$x�	N��/�vE�v��W��8xC����5�n6u��a�T����	c������b����8n������>>>�nO���r��/��bcc3h� ��5���j��R����A����h������sB��c���*��VUVVV_��G��&�T*UNN�V��b�L�J0N�����+�m�����E��1jjqb����0o*;Ci)lm�V�sg�7� ��� �E�#�����G>�B&���qHK��s�1C� 8y�`��:��h�������r�s	� ���0�J��p2��<�^nHH��;�z3f(,��#�B�_�~CAWt��M�;��>"uED3�E%A�s_!G���k	��3WW������lD���97��q����#��Du��Ep�D�!g�a���rs����dgC(�D�W�c;��
W9�M��>�w�����T��EK�"�Y8����X|�8g�o�;��D�#8��&15&r9�����"�h�B�A�@g��2��q�3	��X����0�'���11������8r�F2��@$�m�A��"�E���<aE�����]��5
��07����w����;�S�po;<��)�Jx{S
~� ���0�J��o
8���S�c��[t��u�s�>��o��	�r�	@_�m*;G�Dh(||h��h	?���"���t���?7a�x�tE�;
��aM��h�>�C\S��y3$
�"�e	X���
�w��$|�EEa�
<�,��'���������y9��`FW� X$��x�w�e\vz����C� #�1�������C�\�Jz��&�~��+Wb�P�+� Z���,�x�����@'8������{TT�8�o������S��M�8�v�H]�� �E�1�u���� ���\j*��=[�B���0ylW��2w���c))��PYISED��	��E�;����2����#
����W������s�3�F�\q�������e ��		�`���<�\	�Y�r9���
?1�9�!������s����7��L���i��h�r�<X��~����|��)��;dd�K�����a^V��r�%B9�N"�
n�o�>�� �%
	*�L�#l%V����s����0����!o���
��0@���	�D]�Q�=ADK�k��"���f+,��g��s ������_�D����p%�1�����GPQ��% IDAT���Y"��
�',�������.$����8�����X[__��6������5�B-r2;�k��� �!A�H�e�D�SRR�q�q����h4
wVU��e�;w�sM�O�B��v���U@���oc��Y�%?�-��.(�(��+7���M"Ad$�+� Z�\{�R�8.%%E,���_�x1--�!�8~��������w��=���C�������oNN�\.766���v�������E,�xJt���L{�����f
>��J�:���j���Il�O��.��t;�5�� c���H����A�4�;w�G�fff�"F,���

j���q������a��7��)��S�N�������A�}��$��v��w�a�4��>����s��3���'$�
NX7���q��T>�0M�r��Dx��Y++�+V�E���Kxxxjj�T*
n ;233{���f�������������{7^]�:O�V���3�p�����P�q�`���|��T
Sf�~:��
�MRs�����!uED�����������7o^�6m�m�������6n��������4�-5cyyyJ�R(�����7���fff��***�8��}�xqf�N�=z4M6AP�J	J�����!BC�q��c`����g��J�-�hB���h��hX�m�fjjZ��]�v�|7[�Z���bq~~���������	����=z�5�jff�����j#""��)999  �6����b�[[.O�-6ET�C~>�C_���#���vES����0s&�A
K���K�h4����}��������������7���|�g588X&��[�����/�3���[�����D8p�oM]�v�r������)������Y�H��8��OUMc�V�Q���ISDD���g�YXXTi477�5kV}��VWj+V����277�{����W���O�>=n�8GGG�B����~��?�844�g�������������\�|���	�B�

�m��v�������d`�8;���J��q���C_��uTS�B��C��#%��� ���m�D�������7�����W������X\\\�g5`��@�����������������=������Om��H`�����1�p*}%����K������3�������V���3��t0���A�n��c������O>9s����`5P$VAA�������P(���^�r%��T 8::ZZZ�T��m���"���<&`J�w2��7_��_�O��!_���&�X�4V�����4?A�j�5r�H���=;99�������m[�n533sqq���v�V�g�����w�}���A�>TPuE����|�����Lz�A�1�x��f�|��A�]`X�t�������=<<���B����'�w5N�.���{7F�������[�C��.?u��A$���3	,�h
�QsB���Y�����w�7�����;��m7���AaaM`��-����4=A�6!aD� �������'DHM�vs���sZg�So1�juUP�1c��
MA���C��/��b2T�0�|g�������������4V��A&C�.4?A�Bh�� ���y9��=�,����~G��u�=�x�|�Ep~>������HJ��43A�N!A� Z?:��`�p��g���~J<�
K���:p
�w7�=;v���QYIE�	�h�B��	����4;f���������tV(F�)��'��M`O��x��� ���D+�%m;�nF8�)b����};��p��n\�������?����� �ED���-�ftq?����~C��@���i�1��������m���>��� XA�h4�;u�m���Hp(;,���x�n�`����m��ix�eP�$A$��h�����a`q#�,p,��?���y�-t22����Z$��C�	��#�\q}~�u!���F
�::���utDy9���__��":��� ��/$H`D+�~��C6��o���g����n���BG����Z����A�A����o8h;N_�-Zc\�o��,4S5��9v2~��&� ��BHPA�^�O���WIG������Q�*M��Z���{��^��A��\#A���1�[=+WL6J�4+��cZ���:X�9�q�9//�� �#!A,�h�>���j���;�x����|'u`�(��'7�!�Oc��� �3�F,�h�$c����%�.�KW0$|�������f��"8����� ��HH�� Z'�,2��Op�!�c|��������)�A�#$������K@��[r���!'p:df���CC!�S�v� H`5�������O/]�$��4
���a?����[r�E��~��a��f$�F�?x����}E���I3���rrr���;999;;3�v��=j�(GGGS�N��t&&&4mQ�F\�s��l�����q�8)N�lTCz���A<�����'���SSS;w�����q\llldd$��ZZZ�����?���	�����5�5���:SN�`��1����TGQ�g�
r�������?=~�8��>����~��8wN=�_�����J�
z��qz��Z�5kO`1��$>L�A�sJ3X111%%%�����-���L��X[[�t���sOw��s�+��[�9����������^�B�^�/J����h��&� ��E�����T\\,���j�acaaa�����{����*��o������SE<?���$N]8����Qc_4JO��^k<,�D�
� �����K/��b�v#�zv95���o����s����v��ptt,++�OA�%���1�	����u7Q�X`�:4fm�2��i*�x~i.������>���d��=�}����8���klg�q�S�x�0��7K*&2�W���Y�����u�����H��@�AD��1V��V���fr��TTT���O��
:@rr2�-55�������+�ya��UW�=��t=��*\��������gBj*�-�� �9�hz��y���{����J����p33����\���I�����a�p��{b��xx������9�6�]���� ��^�5��u3g�

rvv��FEE%&&X�j���O�*�v�9�y����Kp�����t|���e895�yy���R��%Q�
	�)����H`�C\!
gop�_�<��};�NEqq#�k����;4A<�����9A�~G�% a$�'�������a���[�����E�>H��<X�)<	����������V<���-��?���vtDy9����Cb"����4A�� �E��K��}�m�G����3���/�B�X�o������ps�� ��	,�h�t���0*����|�i�J-k�!?���Q����<��� ��FtA	�5p�x�^=�p���0��d��5f��I��q#MA�#�F,�h%�z��+�L�'��a���;=��A4�� A�r������s(B�%ow�FRW���.?A	,�hu���r_�
.��?`}&Nl����Cd$�f �xZ"$�c)�=%��Q�M��#���1.�q�
j5-AD��"������#B�o����qkt��F��6����c�RWA�~���E-�_���a���X,�q��n�,�ff�Z�� ��	����a�9� ~_O�����"ja�`��e#��_���RWA5�5�`DKG��J�w���k��}'���I���AM($(� Z8����}���H-���
�����y��>A	,�he����o�����|w{^��/���1��8����4�� ���,�h�p,B����l���5w����q�fd��c��$A��"����K()����������u��M0v�Q����A&� ������	��r��'�k�������\���������O.ZV�� Z�$� w�����<.�r����8�"����K�������Mo/j����RA5��Z]��-���[VzM\�%I�,RGq��?
d	D`g����s	A�����p�y�����<��s��5���]7Q1!���%X�$�~�
??fWDDwL�8�E�Le!��<��1o���Mz��}��=5�����'@D-+%j�3Xaaa� ���o���l�����AAAr�\����'*��<�#�3~��7���'}�����^^������"nrE1**j��e���T�T����*  ����qqqr�������g���P���gH.�t��i���������N���=1>?"����-������[o���{�I�� �w�9r�R�<q������S�&$$dff��_/Rk�	������fS�d����>��MEy�6*=yP��B�O��Z���-����{_|���SSnnn�����
���_VVG������Y���nnO�����Fot�f��3�""�KYk��A������{����>�h��qqqU�1w�'""B��V+�h4���>?lj1"v����C��q���������l�F�l��	�=5{�����m_����2o+��-���'''��7.\�����i��y����������������o���CjIF`��v��t�q@}������K������]����� "����$�r����!C�2$44�{���/�������WU�~��)!3���O������__���������r�������9s�]QK�k�.�m�;�m�v���fl�",iu�={��3F*1J����@FF���'�����.7��7:D�&-
�!g~�Sxj���?=��������v���?f���%7n�}h�Rnr����������^�������8p�@;v���������82�5[�k�p5:���K�����^����pi"�z��{����U*�\.www������E1//�S�N999������5���4Pk8b��~n��n�z�?�\�x��}1���a����bX��p�cOD-�k�D�R,�6�]�v)��>}�x�Z�ZE�No0����Bd���q�{\~iS�u���^�=�����i����F��*rs�����f�`YZ\�,Wj*��#|�p���v��{�L?�HM����#��T���1�D����y}��?�:m���������:~�O3�""b�E�B�"JKq���{��[��:������j�F�	����5/5:�n�{(Z���M��d���:��q��E""��D�	Qs�h/
���]?��:m��M������Q�|E���:1�j ^"$j�oG[}	�{D~<���xr�~:������\v6lm�]5\����@����sG��.�����]7a���hn�4��3�DDL��Z:_���Gq�o����}pD���=f������h�����x���y����
�6�������!�c����������DD��7�Y�����o�G����>P�W��y�)��lde����4��	��	������/Y�w����+~z����	tv����Q���m���C��]�s��,"K��C���(hD1k�J�gw���!=�!��������"�<HD�5%�2
D����x�����V���2/������r�gWC����r""&XD-��lw�c���wNY����4F��s4oC'O���������"�h�(W����(��qy�����62<��y�2���DHDd����]����.@�C%~p����=�j��������"j�)��"j%�S2���.����#��i��<��yZ��K��,"��`�lP�@.��������s��������mb�<����	QK���)v�9f}z����G}��NfB�����e������`Y�R9��^gT��O7��P`�s�FkO�`���D.��7�������e%X{��9}������*OX�t��������J�P�w��`Q��z$F�#�z�|�r~�Pf�����i�
~��1�D��%>*G��v������K�.~������K�lllbcc�����`���z��M.-A��6l���~?pe�&8&�j�x�(�
7�j
��`vEDd~�r��3�J��`�t��N�+**��}� �G�OOOOOO���5j?3j
�����a����G�����X����Z�j�6�_������R.:88L�>}������^z)))��������N�S(6n����j�	/RK�w/&�E����?��*c�����=`�"�������/����m�(W}�o��u�������+���z���M\��������#A���.������&
��A�������\���Z���XD�U��'O�5j���s@@���[mll\U��rTQs�������3��5��4����9���N1��6�.� ���rf�,��q�V;n��������y�����zxxT�����_{�#D4
�5#��!���add|���j�o*�z	#F�%���r�+"j�,X`mm]������w�1c+�-���;?��SU�'X�v+�\^��q�Ps�{\W����a��>��'���{���gO\��HQk�����Xn��R�E8f����{.66��*#G�#����o|||82��&�������(�� <�t����i�������,�DD�wo���Y�VRR���u��]M%o����Y�CBB�Je^^^dddJJ
?3j��������s��x�K|�����Mc�:�F?�{oB�Ex8�j���$X�f��vA�O�>������&N� ..�_�~����E��v�y��#|�[�Q��P����W�{o%"������&"jD��Y�w�j%�����	�m��HB����+����a��������jam���O"�FM$d���D-��������U��q������&\]����L�������"!B���?�����W:�
�I0!!�>���;f����`��i�6�Q\�R	W����t\����G��+8LDt�N�-��%��E��9��G,
�L)G��Bn�����G� @���E�:v��oc�|F����	�`Y�5k�g��`a"�(DM�t�P����l\]�1�Z�*�+"���	�EH:��gS�:����&�z��6�����=.1z�
�����3�}�K�D��
����w9Y�Z_���w
_|��M�$c�:��������X��a&"�O�,��'�� 
�@Fdb�Y�G��=����^n�JKC����ADt?	^"$���@ev%B������u�Jw�{��G�","���	Q�[�5 ���!�����F{>�<�o����&U\����	Q+r1��_5�?�'{�ge!����#/��~|�	V�NA@ZcLD�����h��<B0#���|9�WL�^���1bD�j����?z�`�����6Q�����t7���Z-"#����+�/������._Fx823`"�&�,��t�����s���^v����G��,���{�wc����<kF��J�5�`5%��s^}<W{��  0o�������O1lX�MKCL��`"���u���^1�P����f������S��l*��m��W����5M"�K�DM������
��j
��,�����?�	J��O?���a5�"�+"�����i�<��NVS��n�f8��a[FN�����>g������
�+ .0�DDM����������:(�QpD�a^�g��j���!(�!HO��g������4XY1�DDM�H0�"j6:�6Y�!
4Z�����5.������������V;r$�����0QS&w����_���6�,--

���� ,��b����H�T��D_~oU�y3}��O?���������,��a������YJ�%%�qqq;w�TTYV1   111...11q
�Z �IDAT��������������P�Y������@ls\��L�	[
�={����}������A"��g)�`	� ���^����;q��������i��erijj����;=�)@EqC�}��x����'����L��[����;��+]"��gA3XU���cccH�??������y/�@Y��%��������\�����Y
tr�Z��<_���������~0��B�RYu��p�z�u�V������]��QEp`#*
�h/v�;��X0K�oo��C>���+�����z���
ww�����"Q�"���S�v����df�r��G�dff�����/�K=d�6B���m�r��.��)8��"� �@
�Q������}��3��������1�]��{�q~��,ww����z�������,@��_�zx��=��PP����Eo�,|R�r���^u��A�bvEDT�<��T��y�E����� ##CzYPP��<fWd!>��B| ���^��3�?�:��	���}puE��u�0;K� >��%"��|������c��2***,,������8y��`�\HE*��H�1��'z#�j!>Y�%K�����Q���KDdA,.�*))������(������{]�ti���������������x=mq����_L����������H�8
���1aB��TOJ���\�����X��rn�SJE�No0�����Q9d�D�7#f����,xH�����_?�A���������8q��""j����"d�E��^�5�����H��zF}�S�p�}�+
6>/������k����'0�DDL��`Qk�V����~����?(H9�~�9`i���_�}KC����nDyy��	��a�4����	,j����U�B����r�QQ8u
/?�u�Z�%R��Nxya��:fW7b�d����	,j����H�\���H���(B&CJ
���/��{�����}P
��D�GE��]yy��0e
�JD��	�J7J�jd��y�( @8���u�����S�!O�/m��`���:}��
�OsQ""&XL��{cj����r�g��#�o���s��1�Y������p�R���������6(Q#��c|�����< 
~��6A;������
�`���z��Cd$��h�j6l�w�!:Z:0�DD��q��1h��T��
y@[����7aB �F��C	C���pQf��������)/�����Dn.\]�]5�DB���]9�;��I��L���M���\����_����q�;�ELLm�dg����	���Qs����D�rL/���xr+�����E%��U��F����}��� L����k����Ra�T$$�MfWDDL��Z+�Y������.1���b�"�n��BV������
K&��o{.�xq��@j*D�{��E>g���	Q��cx�^O��<�������q���z�����;�o
j������n#q�l��**
���`���*Q����E���+5
++�������h����]�8',�����}�;�v���r��o�E��} �m�LX���x�=x{W>�����`�R"�H��D�����a������;0:�j(.����8�t�<^�X�,�X�����d�F��Wc����%�2
D
K�DB
R�|�7b�Kx������>oWn���)S�|Y�oO�"_�O��!QN$ZT���E��U�i`gN7���������k7�Bz�a�.9H�aH}D^�~��K
�{tt����0��BN7�����(&��r��\*�\�ok5�D7���N���T@������PX�C!(�BN��AG���u���^�u�����!�!����#�"g�k�0�����(�J���v*���AHv����N*�L����tG~N�{i��z�Pj����+M=��F9��N�m����=�[�5;�Tz�a��>�R]KR���B@������`�G?rVa�K�Mv�L�ny�{�Q&�����]�a7Z��_����������|���Z�f���������T��/�����������C�;�DD-;�jQ�ump�S��(�P*�<r���P�R!
0.�An[�x�)���i�RJ������V!��A��P?'/��BT�[y��&OyT��@�-`���.Z���`
��
�������mk ��Z���\�x�LMhw8�+%������?f�Ci��mY���	��5���j���>"�<*�v	@���**���u��;����#O�*u��u
0n���}N��E���^��R�W���(�a�
�6�F���
.�v#����~1�
;�����R��X����KB��K�W���x4��cJ3�.�R��{{$��N���c�3�a�B&���5��������Gy�����9j�RSS���q��1D�%B��X����������T�kPtD����?�a��?���%b���p�+��k������w���cCCC�y���",,l��%<		I���������[��Q���n�W�+Th�0T�6�+��`Unp�*�B��J�
A��R���]�t2Ye�+Q�f�S��+W������r�B���`,��
8�+(S��C��Q�����B���P�$�p��@i������b[��9���b��Xq�Q�mc��t~@�����t�i�Dn�X�l��&@�r�w���,��`4
v����T��+A���V.n���r2]o� 7�XuCw��h���F�C��}�0��D�Gqe�L��ggl��d�_,;+����i�.���Vb�ip���}��ym
/�bB��s�Ly)��l�<�gg��XE��
���Z-�u��h���#�n���~�`�������*��:xe2�\�s5S�D-6���
����@��������B@+G!@!���h�(@EG��M�[�":�3 B������o���q�T�`/C;�$#����B�B�������
��}; /�/@/�
���	�&�T�A+����l�p��(�n��G�z�
��:qkKR��}��_g�Dy���B �3T"��]����R�e�@;�l!,%�B���68��E6�SDl���)��K�;
�v@�������+�)��z�#��%�"""&X�[ii���s7o�l4���-Zt���������!���Q��W���y�R���oow�g��U�������h����vU�����[����������*�:t�{�QS�u	����Q�X�U�R��Qxc{Sh����c$�S��1K_�=   111...11q��������������$M�\����=99���
prr��N999�\���;�r�J''�:�����Q�q-�Y:��$''���7��2�����d��mrrr�����h��U�s�4������Y:�O��1������N5��V�N��H�K����=�������N�8��� ((h��i������NKKk��KJJrss�]�f�������_EEE%%%��������"su�,o^�]������O�\�O?��v�����_�Bu<R��?f�xs<	7�(j���,�j�	������R�N��3�E�`�����+~~~YYY """�l=��T*��4w}��+4M�BQ�����g�T*�_�`���:��>��V����_�|�������~���tJ�D�V�Z��Q;%mUQQq����6?��s�:e���S����_�:u��F������~��e���:/��$�$�j���,�j��������R�NU=|���n��hmmm��P�`k�����1���o�v��2YH^d�3X���u���������opk�3XDDDdv7n��-[���4W+�`9;;����:\PP -�X���"""�}�������Cz������,�E?�P����N�:yxx���8::fdd����c#"""&X��c�t���x��0~�x~`DDD��������1DDDDL�`�Y7N�Qs���jO�D�����_=::�j���g]]]A����iu������ �\.���M�V^c%D�4n�5>�5���$\_�����0A�r��#���j��1|�9V{{{�������.�n4EQ���U(���3�K�#\�l��}�f�����"�����T������D�L'�Pcym�5��5�����'j��\_�����P���			���~~~�9Y���+���?�������?������z��%���V�Kpp����������wK?�8q���J�2mp{y��p,R#�[s��z
~�F:	��XVV�v�����U�V�����3���X��@�Z��{w���V��������YYY��dU�F��9�h4����o��Z��{����*��jyBB������B������s�bcc���H�~~~��L���X	'Q������g�?Q#���Zm�.]�m�fmm={�lWW��W�V�\�r�����w���������G����ZZ��[�����M�6����?^�Rm����+A
E�+}w=��t:�_~��Z���`��g4��+�J�^ozi0�r�R�����`0m\[���p,R#�[s��z
~�F:	�3�k��'N��?��`(((����M�-)))))��I������v��9]�v}���A�r�
cN�a_�|900���Sk����{���CBB���������������:u�V����+==�Z*
 33������k+���Fb��Y��O�HW�X#�|����l�,�T(u�j�������X��
		���g��>�����W~~�4��}��E�9rDz:_5�(����j���333o�U�K�<<<rss����WTT��qm�5VB�H�5>�5����h\�z�V��^�d2//�������+��������O��3g��95�[___[�l�^��7��O?��?9"3���-,,�4i�4�%���������!C.\� JwY�����������S����/�rvv>{���eNNN����dddxzz(((��d�����^	�5��5�����'j$*�*,,l��	����O;:::88DFF�d����
�K/������q����HR����.**2]�����~�m�~��,Yr����m�>|��-eee���;�c/^�n���[������W����u
�B�Y:h5�4��l�2��_�~����o�^^[%D�4n�2>�;��c0����T*��`��|�I��V�w�.���M��,�?�0�5�L&����^.^�����}(�v������t����h�q�����kW����@�r	�
6H_����=<<�r���sYY��k�������g�?Q#���������������n,}��Z
w�����F�gKsN���6��%Xw8�~��'w�����,����E��������O<�D�.]L[�t���x��0~���5�X^c%D�4n�8>�>���$��O���Ijj���`��mm
5��j����;v����u9��������������DDDDL������`1�"""""&XDDDDL������`,""""&XDdQ������b�����������:n�c��f���diU�U�P��������� ���g��15�������s��?;�b�
AJKK���������>i�$�I��2$;;����<�Y,�`���Z�III����o{"���}�����:f��oo��W*��S�ow�����WspY�6����������0b���������������F�P^^�u����������9Sz����5k�9::��1�T����\\\&O�leeu����]�v��Uzljrr�N���_�Ru�z=ARSS
Eaa�T�o�QXX�y�f�&M��������m)�JAL;��W����t�M�6U�y�H��
k�i��T�!((�s��wKvvvTT���gKy���Yi�:t����o�����t�D��)C@Dw��O���~[�P�?���w���999���z���c�h��.]�l�����z���������f��y��A�B1w��I�&I������Io�|���hJ��y���~������c�����o�igg7{�loo�'�x������C���R%5����)�v���~����gsrr���M;���,�1,,l��EEE������;���V���������T���x)m�1,g��Q�T())qtt�R(�Ngmm��g�I�u������y��j`��}�(�������FQg��5z�hQ������.m������#����L&[�~�T��������7�T*Q�j5���<i���{��7/==@zz�(�R������Lxxxxx��g+++����DSm��W_}e����-��S�L�j���5j�i����W�Z%��J�������8q���{*�p�V�9"J5,[��t��1,...��M�6����{�n�]F�177�������R�y'�d�DHDw�3L��9r�HIIIll��h��Mbb��+bbb�������-[������� �����u����"m|��Y��N�:%%%M�8q��m�G�V(w�+777����
Rm���R�4mV[[��ySaxx����z�����8::n�����q��a����z:z���[������4J����R�n���W_}�Uici�J�fHH���C��[�����1�"��!((h�������{�����������h\�z�V��^�d2//��*����TS�4n��%���'N����fn�]�m����mDQ=z������]{���?�p��%AAArss���j�i��\�z������{�+��r�>������H0 %%e��)6l0����MZDd�x����{?�������������C����J�
��%&&�y��`0}?N��G�~��� :477w����k�����j�(���S6 ..n������:�n���:u���7n����J%�M�be��www��1�U�������S�N��y���S_|���5k8&��`Q�����[��DQ���2-�`Z�a��I����K�}����+Tt���s��I	��A���o/�������ODD�tm�,jl��6��
[�|������j�z���^~��U�VI����z:e���[:t�^���q��eDD����vU���M��7j42d�������pL5�
���p���]����\]]e2���{��m�~��v��}��G�M�<��(�eee���LLPPP���w��egg'��i�&
�������L*����j��;S�&�8::���z*3�h*����;��h\\\d2���G��������������X[OklE��<<<������;�%++�R���{��W,XPVV����P(��z��-��7�Y8��ND���Uu5�]�v�������VVV����z����:z�hNN�i���G�������������U{���3g��t�����~�u������W�L-Z`���5��������jm��cG�)((0��Dz��`�:n*����W���7o~����a������ZX����q�|�R��c�H%111/��O�D�	Y�����;��������'���,Z����j��5&(�-�=x�����y'�X���,���/�O<Pc���h�y�AZsv@�B!�%FD�{�� "j^8�U��(�,"""����@����LMIEND�B`�

tps-xeon.pngimage/png; name=tps-xeon.pngDownload

�PNG


IHDR ��G�c	pHYs=�=����ttIME�
.8Q�v IDATx���}\SG��?��LP�����`���u�n��m]�}Z�@eu����\*�ZX�@�nWVAy�]���R�R]����*����1�6&�O	�:��/_7��;3w�$�;s����`0���a]�`0��,��`0f`1��`0��`0��,��`0f`1��`0��`0��,��`0f`1��`0��`0��,��`0f`1��`0��`0�X�a!�!mUw:�yG���c0���`0^|����K��q\�~�����w�m����~Z"�xzz����������/���-$IJJ���7�<K0`�q�Uv)�o222^|���7w:::233�Je}}�����O+�������C����p���UUUFEu'7d�����;v����m�����{�n���g�}�z�j�d0���u��Ps�����V~�������{�&$$����AjINNNII���			���������������������@(*�J�B������at����nZ��}������9���'O��Ke�F#�J����������!!!JJJ"""<==��O��������#^^^iii���III�����,^�8..����/�SRRh��������	&�����?:99���zyy�o"���{��x(����Boo�����������PTT��ryy�X,~��jllPSSc(
�111�
*
�L�:5  ��l��T*�����8
����]>&�L6{�lz��������pJJ
=


&��^�Z(RaMM
�J������vWW�7n��
��������8j�����n�������f�a["d0���{�N�8q����q�����M�.�����b��M����3g���q���mbb"_TVV�_^^��\�~��777���ZM�8���Fc�����{wTTT{{;�q�����������6++��~SSS�R�L&3n��q����C��-ZT]]�R���K9�����_y�#�i����������n���6D�R<xp������9s��nee%�*�����k�
6��r*��dB���/��An�l�������������i�~���R���bggGO��y�����9,,������r�p���oVTT��-��f�k���h����h4����[&���	�f������t���:�I,�`T\\���KKK��!*?p�5�F���h�T�H$2-\(�\��*��h�����l��]�<<<!���]6���"���5��7G*�����0���?����Yd������U*��q���]6��~X�r�P(�&�I��������|�rhhh@@@ff���H�������������
%


}��`�<p����gk�Z�����`cccx/t���aC�����e��2�?S�k3������h�T���]��
���^A���bqZZ/����~9��������.������W_u�l�Z�D�C��l�bhCl���������=X�����b�
����[���e�����������Uo��������y��Y#��������������W;::�X666�F���
�����C�8���������i�R�477��w�*z				���`���kD,�;w�����W.))���F����F�:x� �������U�VQ���������YC���P���E7��
W�L����/���L&�8N,������nnn���t��������V^^�C]:��,Z���>�a}��w��x������u��v����+..nnn���Qz�j�������A�9rdXXXff�����7L����OR����w�-<���g���9s���/5||��������>Sz���������		)//�SY�����P�^������b0J4�@ ���������M||��!C�GGGG��og�>}jjj*++�;�������R�7��?����	!�|��B����*..F?vvv�����N�D"
��a]��~�������~����V�����"���H��)�G�V���@�����M��{@�V���n��m��I
�B�PPx����1�=�@`gg7q�D�fP-N�q�0'w�9x� �d��Q~~~���d��������|�������O��"##��o�������777��L�b��A��b���^z��g����9p��������
/Y�x�J����1�������
�#�^?w����PS��Krrr���k*��a��������������������]Mm���&j>R��i��Q9�TTTP�S�TR+������S4�CwrC�R������?e�[[�m��%''S��Q�F�������64h��i��������Y���^������y���`��e��`X��u�����
�B�
����:ujIIIJJJ�>}�R)u����_xxxyy���Sqk�~uu5�T���RSSy9"/�T��9""����*��]*��888������u��5�:MB����������l�TJ
�
�����B�P$�����5b��Vk*��aF�P[[+���y��������_����*C��������X\^^�����`�TWWS�|Z ������{�w)7$44t��%��J�P888vxhh���
�=�LF���6h���G����*�0|��7���`0'w�!0�N�>=p�@�Xl�>,����6n�H�W�XQSSC�2������P�������H$���*���.\���;~�8����N�D���T��2��k�|}}���hp�.�]ZZ*�$���/5 L������b�B�hoo��"==]q����
k�����;�;�e�L�������_sssTT���!����JOO����HHH�D"�H�Tv)��]��766��������`���Nnd������E��v��_��������X�P�D��6����>����;�3g��u�K8��U1�Dqq��i�L��{������������y?w�Ol�kV���}������a=(<�Muw��:�����:h�'NINN��t4R<��0�9�3O?�tkkk@@@XX�������>��}������X������Tw�������\.7�Aj�g����l��K�.��'�a
srg0�>��!}��9u��[o����'M�t��)�������}�r��)������~����G��`0?}�!��`0�ya3X��`0��b0��`��`03���`0��b0��`��`0���/h�?���H$���1����������O���S����5����Z�`0�q7�q�hN������p^���w��//���������$B�iJ*�T���N[[}��`0�U�,*
�Tz��ACaff��c�t:�������
E}}=�q���
����3g��	

�����`0��uZ����������p��
r���d2�P��_TVVX�h�GEE����=3��`0V�/���t�R�~�������nnn�G�D������@(R���[ggg�e�+=��`0��sY������ ��������n.�8�X��`0��2K9�b`=������FBOO�7n�S9K�.���x��}]]��M�"##��`>z�����_�|��y����?~�#���g��={�<��������g�66pp NNN�����w�),���Eg]��6�
�C��kB�T��AI	$��!4z�}7������a�r\�����C��zo5X��
�
!��E����y47�|�>|�_}��_p�A�R,�@���������.,o�mn�������
8���`7���6����!�
n�����Z����j�y9�@����Z����~?z�w�z�)..��i����=��*999�/_~|��������w�={�<��K�>V������{v���T\�U-+��������x�S��Pv�����g+���s�{s������#����hn��-����=���+�}���J�#GP]��<���77`�Hx{�[����G��{?�G�0
iO���Q�����_h����z�5\[����x
k���d7{����h�=���FIJAF��I���o�#h`���(��>�@gg��S�BBBd2��C-������0�� ��P(���
va��/��
(�y�A�����4����{�1�0� �5�_�� ~�&RRPR�7�����kHH��/^xaa
A8��c}�V#7EE�p
��q�&,@p0�|������e c2&�������p�gc����E'�
�]}�����}����O���!�D�
q���"������Q�7�l�?���E��J��~8r���������X�111���vvv���J�����=3���:JKW����N���ZVJv���f�����v��:�1�
�4�����@�EZ
��c�<��O����,����g��,_����a������|��8��Aor]��C8|����.��	\\P7����j�I��.��j~x
��yzz666N�2%33������������������3�=3���:v�Q�z�m�u���"�	���<e�?�����R��-eU�������%K
�
^^]���q��@ 
l$�lxS������J���>l�>���C�������5�LIIIII����a��?F3��`mx!���}���� ������&�N�����)/Gh(d2h40�7?�!of����D���gF���u��@��g���y��=�@�a-�]�5r"�nP�um;�{x�~��ueI��������h��!�y!!
�����1�u%�|6f�������o�Vr�|��'���b0������45XKr�N�����������JG�~	BB@����r���D{;���S�
�(F�\0^����o��������F��anl�2�q0���m�����>��S�����c��������:;;?&�����#�XCBp�^{rQ��6������X������@�BJ
���X��c���y�j��������>�S���������
3>S_�P{�����,���8��{d�lll� ]/�o�>WW���v��w�^77��.�|���;���?�w������93j�(��w���;�K��H�8<�G�	H��B�������?�����-���_<h�]����</�����Y�lC[������3�8��8s��!����������WTT��Nddd^^=.++�H$������jB����###�ms�����,w�:������v�u�����{|xq5W����|0d���
�ue�[~����5���{�j�43O\�����l(QC=���y��=�^c�+f�%�U["�=��_�P(��O�����
d_XX�d������� �T��j�������W^y�B7�R�,X �H�`0n~�kl�Hm���E���[8��w���������x�*;����2,���&Y��B���823NNl�Z��,DUUUyy�H$�5k���}UUU��}�9RWWGW�V�\y��%�NB��������l����m���\�DB�������eee|�������@������M���www7�����1c��U#G��g����7O�4���������O���\������F�v�������}�)�U���s�0s&���$��"�����jKJ����������<��#�.P�����
^�@EZ����1��5��Q���`Y�}K����a���g����_��h4���&~~~����P(<v�XPPPaa!^G��:tH*������������e$����ih�*��������G�P���_�����p���AAA/^<v��T*���2j*!d������F��3g��?������yyy�%������wS#l��e>>>���G�2d��3g��IOO'�����(�$��x�X���_����������>!8w
���q��������{���j	$�hw���31{6��yT���9p��~�z=!���w����ooo�BAu������y��������D�g��BjH��FFF������>}�|��'����"IIIT�����T�Tjjjx�@ 0m��a������Xz�P(������"*���'������rS���$�:N�'�������'�Y������D�|h����Hbb���?w�5r�9�K���5l�����:T$�������>�����~���1c���WQQQRRb����e������D��C���m�VYY�x�b�R9q�����S5�Bax�m��/��������s�������?��N����4�H�R�+�FL�<��7K�����si�	����f��)�\b�f����h��ozg���=��}|���Pe{������IH��!C`��I���`��@���z4$8�\���#�H|}}===�T��'<�'�h��������$$$



JKK��MMM.\����!D"�m@��c���s�d6,�am\�..�X��4~�=���*7�����"�|����������_����b�����t��!d��}F3X������onnV�����&L0*a������y	]%��D,**rtt���I��@���d��/j4�Z�V�U*�R�;v�a]���111w(6�*,�a�������������������W�X�X[Wmh�������"V	%ik��
����Nf`=f���S�N}���8�;vl���
�j5�HD-���l��iii�T���ioo��t�����5k�$''_�z��9s��L�>���^��SyZZZhh�a-�|�PxG���C��B$��x�ilDy9��# 
,���?��j�j���b'���[����u��v�!:������99���8w0��������v�Z???///;;;77��3g�r�
��f��5`��@�����_����K�.�X��P������������K.�����T^PPp��[[[///ww���z+66���S,WTT���
������&B�����U��r������Wbb"��M����a}��������������(--���T{{��+Wx���V���Z[[��`02*+�h��M�O��wo������P(�UT���w������x��.�(nM��I+��I���YI{���:��j�O���R+����)S�����9r����7�1��������(��ogg����QVV����O�n���G�J$���B�ZmYY�^�7�Gz�ipX��aD��w�h�V�;Mx��:94��4�j����NsF��@�l�����x�p�k���yx�=6$C�X��������xY���������&7����l�X�g��1j���9{�~~�[����?��(����"��O���l`2�X��0��ZJ���!���_��:�a��_����T�9��^��-~K�sWw��M8�5D��K/�Q����W��o_]]�}������>���/o���A���v��a�����g���3
���g����z'�
!��y8��������'p���JCZ(BcH}=��2���{$&&�?����>t"##�d5������8�ZM��{wdd���m���?X&b]GGGHH�@ !!!l$0�!8vR���=���??6�/�����!p�nh����D�?�<�����DE&2��|a�F{����+��������UXX�T*]]]w��%�J���<H����?��3K�������wqq�@ �����gjjj�``0��kL��i��`����%mB��
������9HI�����1z6b���!�,��TUU����D�Y�f���WUU������#uuu>>>V�\y��%�NH/1�Q���������/���J$���(����~�iYY�433����/��%�r�GW������M~����f���j���#G��>�bo��,�� IDAT�y��I...uuu��>}���\��������+++[ZZ����	�u�V�Je����`Xs[^'m����"���[t��Cu��wLSJKQP`�b�b�,�G���?��)�x+�6�l����["�r�<((���i��
g�����F��h4MMM����}�]�Px��������B��F�9t��T*��sgkk����V�@	�h4����(
���KZ`rr�O���k���.^�x��1�T���e��H�o�o$�9s�����/_^[[����g��/22r����[�l���������G�r���B�p�����`0�G#�X���r��2deY���deA&��1��3��HH��as��%e(�A�0t�0��n^�������|H_TXlss��j���zB������	!���
���8::VWWS�y�����S9�#�f��M�������������}�����O!EEE����BRR���N��������6��������
�����***�r{{{B�*�:!�^/�Je2Y�)��]����}�|E1��$<���o{o����jk���j�������z���5r�9�'�������!���Oyy���CE"],���5�EH�~���1c���WQQQRRR[[kTHKK��3��H$:t(�p��m����/V*�'N\�`A}�O��|z�m��/��������s�������?��N����4�H�R	�I�0y��;�,�?�w�������hcc���VUU��"��
����_P8�����66?�����&��ii�YTJ
v�f������*Fq!J��s��������1�!����EN����2��@���3x&	I��3a�XFo�|���@ ��t=(p'��srr$������'�j�9q��g�}���9�OIIIHH����e3���.\�����H$z��Pm��=d�ypn��9x�`�@P]]=��_PBp{�~���.�9�a�"�����.���?<V�7���ATV"%����~�kl�h������������D
4��e��a��`��N��	B���g4�UXX�~����f�Z]^^>a���������UB���XTT�������@���d��/j4�Z�V�U*�R�;v�a]���111w(��G5x��������1w|?�������X��XQ�X��kX��c��-��|�/[{�\�����WN���p	����O�,l63�����S�N}���8�;vl���
�j5�HD-���l���iii�T��������n�����f������W��3g���������K�r*OKK

5���/�Sr1�NW~+1�y7l��������Z�-
�����`��|�I�@a�q|�����~dK@�������w���?RR�w� e(��Wq���Ll�@���x�S�������Y��k����
�"����k�WW�E�B�]����ncc#�H�������>�����Ncc#;;;OOO���*��T*�H$�<&&����/��R�@ �SMMMM��<33�����rJzzzfffw����O��
�T*�H$2�L,�S�-ruu�/qrr�/�,\��r�#�����u1d	�5�sww�e�Uw�RI��i�2O� 8H��B�Izy+��(�l��CoH�\��������CZ�v����a���#����`��)���#G�1b�C�
u��������3
(������Eutt����t����5����������2T#�h����2�^��+�t�lz�=��{5c0������~�J\��o��tu�������0�Cs3
2��R��Q��3p���@����`��`0@.����-	4�����h�aa��GR��vKg'��z�$ �	��b��`<<;��UR������x��o`����>��C��nf�q�R\����F]}�� ���
�G��`N����}�����Cg�������_�|y������������`<�l�����jV|���WW��c���u�������2
������pf���3�z�������WTT��Ndd$�����\"�p����V�	!�w����|����;��~0�-B�j���?�q�%�XS�UB����a�������p��e��}���q���[QZj��W8�Q�5�k�&��j//6X���c���
�b����cggGckFFF*�JWW�]�vI�����O�����,0��8�T:{����GGG5��{��Y�������9e���/Fj*n����/���������Ba������A�����q}����?�����^��`�CNUUUyy�H$�5k���}UUU��}�9RWW���`����.]��t!!!����C�Z���
`�����`nn�D"�����������qM333[ZZ�
[@.�{xx�6�������H���7c��U�V�9�����y��I�&������B���CCsEGG{zz���imB��%K�J%3�k����c���x�K�Xz��l���t=j�}B������!>�����Y����O�,:�oc�`|t`K�A.�999m�������������h4����	�������+
�;TXX���h4��J�;w�lmmuuu�j��F	�h4����(
���KZ`rr2�]�6((�������J�YYY&�td������F��3g��?������yyy�%������wS#l��e>>>���G�2d��3g�<y2��p���������am,������)������-]�!&6Xu��}���Z����i-�Z������@���h�����P�������]�p!!���[�PPGG���j�0o����p*�uD"������R��������b��O�>�|�	!���@RRUHJJ���I�����r�@`��a��u������X�P���SEEETnooO�7bccC��DGG{{{:��������'EE�0Z��Kd��n�_m6��K.]bq��!��(4�V�m����^O�J"�Yu�44���[<�+~��!���Oyy���Ci&����F����TVV�������M2�����1��D��C��KJJ*++/^�������`������'�Q(�����m;������s�hZ���>99����0��J��W��<y�o����eA~.�77�������;v��!�NT����
>����#.vv]�
��=����7��v������=|��H��hi����,4}���[����+�`W
h�$�����a�H�,�#t:]��/�\.����H$������t������ Z �q���)))			BCC���������i3���.\�����H$F���;���k�Y��K��O�<�[����|L}����9M�B�.Z��.�qdu��6���k���������u�?����Kx_�����~��K�`���I.O���������|���N����!�������
��_����V����'L�`T�����i'
]%���������EEE�������?�j�2������F�V��j�J�T*��kXWzzzLL���}mT|���
�����q�/�]�,d0��?k������8pw1o��������ub���Y�����U#KQ�����YW��b�r����S��=������C�Z
��!B���Mc����QS)''��������e@KK��5k����^�
`��9?���`���/��������Bo������;����&�� ��������/\�@?����
���:�����tpw���M���sr����W������!.��1�C��/�������*�7��]�����P(�D��]#����.Z��r��5www�D��o�w�}��'�����
u���yzzx��W�\�R�D"�@@�111����|�X,�J�jjjj���4D���'���?enOO�����?�o���L�T*�Hd2�X,��-Z����_����o��N!D.���Y=�0�_=�/!K�4���� qqw�W�����>	
%[z���'�I��X-��A�,�����?t��V��>}:mw�
W~~�@ �2e���}AA���#G���P}C�k�������o��������:::���t:������q��Q�D���bjv�B�ZmYY�^���4�]��yT*Umm-���F/�^��c����(���b�d��Y��i����J���	I/�G|�V��
�G��`��`<�tr�y�pW~�Ar��E�U*<�$ �Zi���#$,23��R��b0�����"�@��^p������k]}�4�u�Sa���#���Q�(����;#�^��{���][[����-�Wl~���:j�+��:x�m��Z���
L�d���s�R���`*�{s�_���ng�+8����w;...���j???[[[�������96m�d*��ys���2���r�@ �8�Fx���s������+�NG{�
�JH�8��G�G��O�d3��5����^���hl���x�usF����P�
g��W�A����3�2,����U��������a��/��rgg�T*�={�B�ptt�H$r�|��I[�n�J�����7o��i���v��y��5z��42�=)�9sfmmmii�@ 		15t�����g�Y��T*�'�|"�H��a0������qp !+WZ�:B��a�6
���2f#[���cB��;��h�vW�)������������W�B^N�V�^�����R�^_RRbcc�e�����W�NHH8p�v�L9p�@BBBBBBcc#����~�-�'�J��8�7n��p������+W~��W��?~�8j��M�.]"�?~���c*����R�LC0���e�����3V��hA@��U�@��u����<�Gg�&�l>�["4Z�644����/���t
�q���������=���)S8����>|��u����333i�B������/��x�bPPPll,�����]t���c��I����,%%%n������,��������N}��?������yyyyyy�������w���[�l���������G�r���r���	!;w�������/����t�w�N������������)+����sZ�G+Z"|�#$� �b������%K�;:v���K�,ihh�:uuu����$55��?<y�d�>}"""�N�ZTT ::Z&�eeeu�|���������L@�P���������6����}���'O�4=u��
����d�St�S l��������@ 8|���!CL��v���<�!�E���	�����xX�������|���CW_�����:�J{��[������8�r��!{/a�sl2�7����ccc���d2��m�$	�qmmm���Nnn�^�OLL4�Ko
ttt(
j�������K�,�����s'C$j�I�Rj����n��5""�������6n�S]]M���=��^v����|||��;G�>���������


4Ma�T*>s���"+����q+v=~���_�!j�-@�8�c?
�G�F4�vn/�S+�,��":��5vv�������[��n�)<�h��M
�~�w���"l���!3�z������H�R����k�.�T���noo�>��,iiiR��.���+W��t��/���o�!��5��~^������G.W(4���S��s���p�Btt4$M"���z�Dr�����;���k
�H@���������E���W� !��172�����_">����t�`5���Gq�*.}Q�\x8����Erss%ITT����O?����,,,�����o��^Dr�|����D�����BS�Q~���/���Z-���>���9�~

�%���N\\�SO=E?j4zl����{���5+++�Ty���/^������;v�;�n9T�T��;��aK(���G���=�^hv8��4��QAUJ���E�����X�11V�!J%0�n���
�0u�*B�>�g`�]/������U$���J%������*++�l�b�B�^fm����Z�=8���t�oXBB6   >>�������S�����f�g044t��1T���D7v����ncc���M�)))|WHMM��c��Q���T���1���,**��			���|K�S
��po�D"�/1���H"��-*�/��FF���F�c���u$&��5^�B��zi��=��!J%��z��v���@�e<��b������={�8880�����|��3�<Sz{d�.}����s'�������4E����� 4X����_�v������gCCCLLL�A������_�����B�Tjkk+�H�ry`` ���[[[�X����T*{P^�j�\.���������������&�X�����h\]].\hXufff������ ::z��%���z�^$��bz������+�~kk+�iQ����}�����@E���8��_�
&`�K�����F��8�����7�JCZ(B����l��M�����k�F�`����9s����������n��M�R���-\��7�����M��]�{v���m�n����?t���A��y���xa~~�����������T>z�h//���;::���t:�������4/7������������������*U��M��FQ������i��n�AM
8Zm/�h�N�n�����g�Yf�F`���C:�C+���lZ��N/Va`��MIIIHH����F����n��K�vtt�joo_�t)R�1G��x�p@v6���`�j������&|��5��\���o�e�)%�y0�u�3:�	��t���a����@��%y������������
,+
�����n��9..w��6b��El�0F�|�/�����+Wl�w���J����������v�y@$"��"�C�ue-t7�B
,sa-3XO?��X,�a�ZZZ�R��U�bcc9�KMM���L�2������Xo6;##�������&���o����u:��Y����8�������3:���f�����chM���1�������;w�4l����-[:::f��i�^������/���Kt���`�2���)���C����a���Z�F���3��O��il���9=�N�����h($ 8���r*<<��j�lHX���J��D��&���������xzz
77�k��Y�.��'z����k��A2��ccc#�J���J�"�|���b����VQee%�!3d�*��_�u���EY�����Pe���]j�1���3�����|�J%���`�b�����G�X�[o���Xm���Pr�����H�]n=�6>n��k�o�������/=Mx�e����7�f����ijj�AD����p����������/&�E���{R�MJII��AAA�FS�
fd����mw_;�TX^^N?���t����`X�3�������n�srz!
���k��DJB��c��"88�F�JHH:t(/�2|����Ik��1�&e��/���z}w�������T944��4������3$??_�TYN�����.\���Ns6S�����Yzzzvvv��<1&?������]�����������}�I�l?���D���X/�L*+��+BCI\���4��j'���"i`sW��!a��\��w�}WQQA���b�a�y�te j��)J���jii�~�:�q4/aVV�ILLLMMmnn�j�AAAR����������j�Z�Fs������0������e�������'�f�t://�;vDEE�����]�z�����:o�<�HD9w�\NNN��97"##
Gmnn����k������t#7*x5���^HAXP�05VDi)�|�������0
i�BE��E�y<�e]`bcc�w�q/��"�B�P(���bccl����W_����R9�������V��������������������4�����4�����Z�nmm��z�bqmm���Kgg�F��c�g��������'N�J�+ IDAT��s�������R����S�M@�����>�(11��K�����~�:��y� ����kG�R,�{����0��c�}������
f������I�����7�� �!0���
�
��=(r���F��q]A��F������f����u�`����!C


���9��H$&L�[W������:������F,������� j���zWWWZf�~�V�X����c����%���������#{zz�1�saa�s�=���7n�1.�ax��B����;88WWW�f%���M�0[9$`$�������((�;�X]?�����u�
m�������V��I{|�d<��,30{�l���������?�SUUU|��7��]!���+W�4��e�.��O�>p���?���
;;���~z������"xJKK�o^��]SZZ:e��;*��$S���)�1��W=xQ0��d�#;gU
�����9���c�X���?��X�����l�9��?o���\�E�;�a��|La3XJggguu����uuu���������g��_�/�Q�)++�d��\�R�P��G�������;w�K�'�|���.$$$<<<<<|��A��M���XX]��z�-�l��9�.V
�;������������ZW�
s�1�4��]������q���c_�zy���V�@�@D��
�Yqs�h����c��c
��zP��������888��1����:r�Hqqqcc���t�����[�%$$$��3G(������655���������S�J$ww���x�L6h��.����e2�D"�A�bbb�#�P(���������knnV*����5������p�{tuuMMM


�����������sh|Q___�^O%W�\�����@E�Ba|�V��t�����r9<=�M�j�1���?�Q�`����_�U�V	�juDDD\\���[cc�D"�OEDD��1�P9%%E"�|������_�R�x�S�L��m��Y����_aaaw�����|����(�N�����wu���NLL���PZZ:��u���w���.�k���|��9s��?��7��j�{����k=5n�8ggg��������g�(���l�0�L)�X���k������k��CX���~hlDNn�������7pcV�����IH�����a�fc�Gf��I�c.���]��S�����T�a�T�������3t��u+��'�=k�J��q��ue�	���9��8pp �F�oQ���)�J��q6$��=�����������T�a����2 5���� ��"��X���J��R��������������t�?��0����P0��q�9�3���9��� (-��~�&q��M�{6���
��pp0[�31�(�	�`�2,�m��=��["| 222C�pvv�7o�}��>|X����5��T�������J��t3f����y���n}�yaw���2���O�����%�:[�l����9s�%z������\$���a��F8��!�u�?�>�d�><�<n�����5g���{�3��_|�����^���7<<<jjj�rynn�D"*�J�Ryyy���722���ppp���������2d�j��j7m�4v��.k����{e�	1""����J�������b]���g�������v�Z����V���`<���z��I�a�V���P�-W!X���#;���@@�E7�&��r,_��l�1C����,G�������~���T8u����pS����B�l�r�-�'e����zDk4e��a�e-�������K��3g�8{�,�!��X>Q��XC��.
!;�&z=Y���)�Mpw'EEV�z=Q*����
\L����(���(A@9~�\���3$X�g��P(\QQ1t�P�HD����|�O��g�.\�p��m��?���Qbb"�D����V�	!�)��y����*����0����-ZDu���v��ezmAA����kh��L�4��{����������������{ddd�����5��k�����������~�����2�^�;��c�Xp231i��klnFH�u�]���,�j�!JE�o�������8��^���;�/3���w�}WQQ���@,744����Mb�3e��RIO���PS��5Y�r�%&&���677k���� �T�q\w��~��g�Z�F�9r��3�<�[�����nnnF��dffN�<��5J��yyy���#**����������4Xhkk��y�D"!���s999���;���M����y���T�TYe0�_L\V�k����pw��J����	��<��8p(C�'<
��,d� ��UU\M�5f`d�"�0
�!66V�T�Q/��"�B�P(���h��-[����������_mkk��h������������SSSsrr
�q\ZZul�RY�V���fffR����������S��8::��g�WW��7n�8q���}�LSS���yzz
��e������SO=����Ci��}8����$X��<�	 �R�g>��
wc�b����G����a�d����R!1���n�����<M" 7q�H��.q����"#��n���f��o0Q[d��!����I$�	&��+j�����%9B����o�F#��BaBBBPP5Y�z���+-�_�~+V��N���c����������c������===o�����{\T������o 0�
)��(��^:���hY��u��e/��ij����Poej�B��,O^2&<��xFI�d����c�8*������g���^{���������k��_}�����-V��{��UUU��y+1����;v���+
�hVv�IN���`�rH$���p����"y����,DG7M�L����@>o�E��0@�& 666������7�48p���?7V�<yR�TN�>��`����6l�[�OZ���W^qqqY�v-?hkk���0x������o!��\.�n��G������V����=�nZ!S�6*#~�[�e-P8�i$������9���u����P*Qc5r���T�&%fQ]X���
��aY��^���5}�tcI~~���syM�Ld����O����Si�l��y3/�8���a�1��3g�
�V�q����-���G��������������]�N�0�S�Naaa�.�0�	��;���e��5M�o���g.\h�����$�I�c������s5
��"`������QU>���B;C�
/�r>�0��|q�9d�zX
����Z��_�6m"""f��q����{���jc�����������koo������k4��_�1&����������>g��P��kW�;;::
�BOOO�L@,O�6
����X,�3g����V��J�YYY��>y�������5v��y��"�������^sqqi���={�o�>k����R�X,�i�Ds��F�;��o;����7����9������$�+^]	 ��|!b/����~����SO���IC�0=���������G��t����@0p�@�Zm;j���.\0�_*����	�H�R�����r�<$$D��Y����;��������w�7������&�����r���~���u�������5��S�_�u���|��#G��_o�g��u���A�T*�H$nnnf#��2a�f6�����1w�8����%����Ax��,e����l��������/F<zB�"���[�KV���v~������},A�$�Ab����>x�]����K��@
�����H����
�J��7�E'!i#6���81i>�NN4�HH��	,� ���En����w��
����'m�4<�AL
�P<���%�����0SC�
����5�����4(A<�d#[��������b�P�Z�����
�$�����J�o���3#��|����|�=�I��h@���9x���CBB���g�0������m�i�������???>^Tzz�����/�������)�V�2���Bbb��5k����V������|����wO�I� ��y=k����� �TY���ZW���������&�&-�`$8;��{�$%{nUN�����H$��������???'N�`���
�r�[��F�����7����K�P(,�������+�wC�I� �
V�������	�������H+�����g�_�/���,�;wh���{��a�SW?������Wj�5kVHH�=p������|��e�f��Y_#�=:a���#G�Yt�s��-!!!����2�!�j��7�r�s_�$�aP@����}��p���h~����x������j����{�vz�73��D����e���|�H���~�H`5
�o�����������3V]�r%00������`v`III\\��}��T*�RW�z�;wnrr��F���CCC�r�TVV�����	+//?|�0��lmm�����k�;��c����Y�.\���l�	�q��������,��f����+22�����I� ���f��������2|}�{� c�xO=��_��9�H���D�E&�����@F	Jah���� �)���W������b������*>��q�@bb"c,11���{�	W#J�R>�UFFoj�DAAA)))�	G+�dddxzz�����[QQQ\\���#
c����iP.�#F�T����K�.�~�m^^����*##�1&�H����,Y��~�:��;wZ�?u;i1t
A����p%k�!TEk�W"�23��k��fr9XM���F�4X������;�0=���)���"�7
�|��\.4h�qR���O?5VM�8�O}#�H��9�b������U�,��O�V�����W�0����������f�3�������7%%�@ �������^����?�&(�C�Z-�Mo���[��c=r�������������!b�p�����_�+*�kF���L<���*�}��c�:P(����CZ�B4"�V3R\�J%����=	��?���[���������sO$���.TUU�jc��L��J��gTT�L&c&�7�]���j�v�
���aaa������O�������'����
o��O�|��W���Xu������N�=�$�'��eQ
��1p �e�FI����~��~�o��f��h����/`� H$4��;�~	�z�g����x�y)��B @Az���F��j|}}��q|�i^���>y��R��>}�Q����pss��D"��:������k���z�'�>v�����ryttt+���&�h"�a����C]��XX+|RR�qcs�\��ED��k����'�?��-�����]�-���@����X���������-[�?/]�d�V��V(�����?�����`S�RXX�F�1&�����Zi�.���


�����	:u�T��;v���~i�!�=� Z�����'��}�5�z=�Z��Rs��_?l��W^�������/�w��.��
��m���}o�-cIa!""0{6�6����iX�t������9�K�.|yUU��s�&L� ����N�<�R������7�����X,

��dZ�V,��b���e���B����S&�����M�o��'O����
^]���W�X!����_{�5z��Z�z�se�fz�3�8��R(����g���*�$~�.:�������m*��#>��+~����?��{��R�x� ���B*�����������:u���#��u���������P�V�����������������H����Va�����[����;w�H$777c#yyyqqqf�Z�r�O?�d�r�������z��S�_�u���|��#G��_o�g��u���R��A4
[�I���'�jn�d��4W�f�K���i4`2�������Yl��U�M��iJ��5�w\�4�:{EDS�EA��5�u�k���ag������,$T=��! �����v�{D5~�.�}��n�f3>�o�->�Ue%�BDD��|=!4�� �E�h�����#]G��
�����L����i)���E�0���Q+������X+��1jD"���P#�E�V�^Od�"�G��"xymP��V����>���������u�����:��~��@��zg9�'b����'�d,�}iiX���	��������>$$d��y�
�����m����r���������c�UTT����������|�O����j��E�0a'&&�Y�����n��������N��w�����|�
��P��!0���v*c�/'��y������{������y3�}R�}��E�;��'��`�(D`�P�:�;;��"��A�iM����1U�i��~�A ���������1�Z
�/ ���k�HNN���{�]:q��Ba��o��W�\i�[�/f4^���� �':�����T_f����c�����������-3�i4�p�g���W2%$������3���"'�&j��Ks,>����Y�o����X~{��e���V9r�Hxx������O�]�#i1K�c�{��w�Nu}{V���$>��Q�-[�������	�����L�/f�������1�D"V�����Z��B����n��H
���bbAS.�`P2%�s���,a$��
4�4��};,,��8��^~��v����\��q���]BBB�����}��P�TJ�2..���;wnrr��F���CCC�r�TVV�����	+//?|����3����u���k�;���Lf�����q�p����dcO8�svv>~�8���d�L6{�l�\���ug����?g��U]]��_��!��\�Tn[9��a~��1q"�j��-��m������xy!,7o�����;`����k�4����#��|�����@#��)��d���666999UUU|�f��w~b�%&&��w���B�P*��f����n��DAAA)))F{��F222<==M�tqq����������B���C���`�1B�R���]�t���o���$��%�����D"qrrZ�d	c����v��i��e��u_!{����MD+���f�f��1����?Z�|U]��R������0����4jZ�H���.Y�<��1]4�n����hn!A�i���U(F%���hX'N4�6q���� ����,�>������u��e���+V������
6b&�v��mooo�y��-�z����}�Z����������������/�+�t:��e���j��yz��7b�Q�S����|��{�==�1���6�����Qz����,/�1�����v���������S�~��,>�q��qNWr&C�3=HD��� ���*�&���UUUc��1��yxxk`��T*�?���d23��p��5�V��kW���KKK����O�������'�&�z����IM�����n�����
t0��4��4�P(D"�P(���
����O`����CX�7n���������yO~~��HOO�D��� #j5<=�Q����)'�de���
���i�����,�������Q�k�}o��.+s��0-9r��a��,�}��F�,��j|}}��q|�i^���>y��R��>}�Q����pss��'_��4��5R�@����v�Z��������a��������\.���n�����.��r��q��y�$���>k����VQ=dS���@&Z�����\����UX%�(��S�����M7B"i��9����r��!oo����q�@�!m)�.�2+�x�s
�d#��`�8v�����{��_���������-[�?/]�d���q�������~ljX
��h0��r9���P+��e��������������]�v�0aB�N�����C�i�KKK�
�{�����b��������Q0X�h$�a���WW���@���1W�^0�����i���@���AXXs5����2rs�C����~��� IDAT��?1�1+/Gy��b(g8/��T9#F��mRWD3B��iX�t������9�K��TUU��s�&L� ����N�<��-����g��a�����X,

��dZ�V,��b���e���B�����@%��M�V��'O�|��W����;�X�B$UWW���k...��-||��VWW�%���
�):� �3�p������1>�^w �cW����oB,F=nM��E�2���@$�R	��0v��!��F��QZ��*	$B�ak0!A��j�,[�����_�)����`��y#G���������j�����4q�����v�����
�6m�$Zi�������M9~���5kf��YUU��{������@9|�p��
<qqq~5�os��---�>}zEE����:�W�1�C��C����w��������i�NG�� VWZ-����U�r�����	�f����?._n�W���s�*���\��x�������i!��D�(����nEEx�)|�5F��AF4;���	�>M���Il��"�����,�^	J����n����nn��
������L�=,		X���HNn��>�@�<�;������O��g�#~������c&O��$���(��TH�R�GQ�5��h�[g�����$���M�@�`�������Z��A*mu`���to/)��+
#5�>�?���6hS��#8b,LC����m���{�����g������P�H]-GkX���������<==�>��o��������p������7�h�����%)c�����/��X{������?��@1��'RS�V�y�/����E��PP`PWd2DE�@���!� %��4�@����@ ��c��wGP8�@6��.��BV"����G����~~�\�DK+��������b�
|lq�H����P(<hcc#�������.G ����n�bQ�l�������`ccr����1���mmmy�;t�`�v����?t��j�Z-
�k�od�@�xb��v��y�M&����I�L(4�8�L�b���f"Q�S\�\\�P����;0���-���A����?3����8� Z4�h�P$���iii����f�����WTT8U38::����e��Cb��YR'y�����g����4:����<�P(|||�$����|e������
v�O����l1�{S�W^7��4AX�������J�4��j��jyuUQ�VX��ox���b���H
K	bA��Xr��7���:������$�j+V�`�m������L���������K�V��������y�{��F�����gff���o��Q����7�:u2Z��������������F����_����Y������j���t�V��P(��\�R�0���a�����
69r$////�����/�����cyyy���S�T��T*���m����P���W�XaE"��^�\\���V�U�A5h�-��x����*5�yz2�YM�e�?�}Y��l#c�
���-�3s���e"k��&A���j-a��Y"""��O������^7�U���f�:�����o�i���\�P8d��!C�t���4U���k/^�v��7�|��7����v&��������s'�H�R�����
VPP`������_����Cf�����s���j�^^^R�t��?��3�'N����u����s��1���S�fee���g���7o��'0y��={�L�8q��m���;v�HHH������gAAAO��5c��5��r��j�h�A����&L@��R�2��k�q�;`()1�]���v�����A���l�X�� ,^�9s��.uEWg8h����/@A���� $�C0�4������	v�2�� �,^��M�6f��M�i���F�`�XIII�6mRRR6n���mm�s+��Y#\��77���X~[��0Z�����s|(Q�Q�_Ta��}���NNNo��������F��=�1Hsrr������������������(~�J�gGGGc�lll,&{f���5����W�^��O3��,Z���O<��+�?g�Pg�������#G;����1�u�}��2S�,+�>�cy"&�:���|%[)c���n���?h0�m�jn��Z,X�z��9s&�����}����~��������{_�|���eeef�zcF�f���k���nnn�z�2Y/���8�W$R�����{��?333���n���?�9s��~i�1��o�����~����{�5*))������R�L3'�2v��/�����GS��������)))�~���z>�966����/j���a�Dpp�\��/23��p�2������hA:u��)HK�[�7)��@z:�b":��-l��O^'"�5���U�H ��^^�w:-�����
>�QC������Xl4Y�6�����u�:;�`"���;w���|��]����	��ZV���-���+>�rii��<����4�.6l���t6S������8~�_���\"��3w<���>���g\\��M�rss�4R�F��o���5�0��������$}��7���0��)Z�H4v���c��h�=w��DaB�s������0Hq�p�(�b��b�F�����i~� �b14�F����X�R�}K��A~{$F��Ht1��B*�$�T�����
��G4��d��I-p��k��K�.�5��k��<}���m�����idZ������[��aZ�q��&����F����?^TYQ����P(�y�fqqqQQQQQ�J��J��>�l}��_���x��@�y���p���]�c�Z�G���-��[�� ��}�������2�� ��iH��#q�"22����l89���21��h��D�FX��Q\��B���Xu�.����/|����9<�m�Q��*��'�������`�PID������X�X�q\������_{�5��I��,]��{��vvvW�^��M���j�*~.R&��f�KOO�v�/y����}��'11���m`��}k�����o�������|���,d}��;l�P����*c�W����y���JKK��7���C��O�xra��
Exx]i��Xwt/dWKp���G� 2M��X���b��~��oo������	��Gm��N���6����|a4����~��CT��'1f~��pT������ Z��|+���W�\1�k���t�w���cGK��|�
���L�2e��E�����O�6
@�=�:���a��N������7?	�;�[i�1�s���������������h4����E�n�_�>%%���i=�;�{yyyzz
�B�@�W-Z��s���C��o_7Z)�L�����Y�x2������NWw���.��0�
�����k`M/���V��2R)�#���0�3�/������0�E��9���#"��/g��J�x��4P��&���������v��q�jOV��������K��]���Y=)�c_�uUU�+��b\_i���0�,//7m�Xe����g===;v�h�5c�����N]���KS�R�rss+**�:CO���Tb����|D)�Fi_�����5����
J%��[l2��e����3��c�JH$
��pk�Y��U�q�:��B���]��kO�T4��`�1{ �!?~~�w����C<�B�AD3~{�����E��u��1��<0~<d2t�jX�W� k&�sn�ja��Z�9s �C$���2��+�B;��P20�r�Q���a]����|7n #���yu���'$l��A4�������pr���������+���-���2��`�b�������j�`\������+�`�"��*�����=�1
�x7�X`���m�������r��2�@� �yHJ�P��^��Ib�5��~�=�L�����{��^���=QX��"t�V+����q<b1������Z;F9uW�a/�n�V,b���7b#���������X�
	4zXA��E(5j5,-���1�"U���G��sq�������ju��+?��Y��P���i���S��"#���r�'!�v�PU����|�W�j?������'N����0�#QE��"� ���-xy��W��[73�P��W���7�����HH���x99u�j&n�c�:+V�K��,�� (����k�69�QAe�qu'�b+�q�m��[���'���>}h�$�� ���6,Bu�E�
l��
(H��y��HM�F����fd$x��EiX���B
����������*Q��`;�6,*�u�p������"�$�� �����1uj�J
4B(�<5B�o_��Si)�����0���?~�|D���+�xPLBm�S!�������A�pcyJ
8�����F	A� �h<���2&i����<1����������*g�f��b��fRW�����3���dg��11�J%�n�� A�/��n���P��T�:���%�������
pE�=� ��(*��A���/��/#��$,�����>;f0t��w��J�M������������TB*ETT��9�|�b!����r�C[��������� �\!A,� ��`�F�szn.8�T���$�H-F�
;����0g��������{�t��Q�&��S�����+��+����w����$$�y;���8p�?oo��;
���A�C�������IH@��R�
 �G�$%�_?��i&��	�aa5
z=
pT*��
K+�\�Z��/����Fl<����3gB��{��� H`A�������N5D���%�p���j��x�&�+	��?��[�������
�����!A��B;�FI�M�$�p�Y^6�������������-.�$XAD}��#4iix������*19���n`EE������4�

m�n�������b�`�o������_�����*!f�(��C�!ms8+W�� �{ 'w� ���������&���ox���n��i;��#1{6�s�!9NN-���[1u*BCq�@c�F�����`�;��p��C�?osB!�oox"A<iB�,XA���Exy!&���27\2�v!=������QW�h\:�����X�U?�������;=	I���w�,��=�}{��"��- ���1���;�o����5``�-l�P�#�TV���p��!B�SY���q����;�4l�2����	���S��~/>�%i*.�����)� A�Y�� ���k����l�����6���\��������{{���E�������x��W�4�������1����G~ �����p�W�ed�AV A�E
�PUe�����.�b���F�^������B!��!�Lg�{g���I��&��k���".��7Aq��~�8��1%����gidD��i�K���	�x��jkMP"�Ye$"c��{�Uv6z���_�������hIZ������q���r����N�f4����G�-i�8g�mpopQ� V`1��Z-�q�
4����nA�?�C �L��T8:]�������N���vZ��aa
��E��S�����������|TVb�r�^
���k�_s��`������Y��*	Im�����:���ee!"�A4���8N (
����C��O�� ���0z4.\���f5Zhm`�2^����1h��1����:]c��?ii�8l�us����v�����E(
@@4�������/�)D�~�O]�=��VD���g�����M���

����[F������x��1����d��!���n�B�^��q������%������N�1L�����B�Dj���q�^a;������O,'&��c����
�hb�������������������9�����A�cId$����j��k�6�
�G�E���87n���Y����y�`�h�=��@�W/B�o���/����D$��*B� J������of&���q@� �^}���~����km��U(233u:��i��-[F�� ����J����B���x�����3~�F�+\k����e��s�
 b���Q(O�d�F�����	����AL$��w�^���E��#5C��P �&���^�3fL��}�������j���~9A���1&�W�j
��A���R��d���Qu5S��H�o(��f:��fS*�=�7�[��W�+�;��>����'�/L�0
=�h!a��������~����:�x��~�����p�� �����1n�Y����+��t��@&���t���8������9�8���M����p7���#�XXPP{]�a����3�������	��,X�6mrqq�������N�c�eee���w��1*�������E�}SP`��Tc���{�TK�4�������.\0/��b..L(d53
�Af&0���;����cy�k��O-j����=s�LQQ���ccc����^�:88��XA<&$%�]*�c�]\�y*+�e$���]����7F����yyP�{99��U.G�:��X���$�~�����m���?>%�f�����;c,  ���S�B���F��	�x�))��+���Z�:�4c8����������HLl������
��������)����]��m���
�0����Y����qd�D�X$h%8A�r�;v��	�^�d���{�����,XA<�$%A"A|<,-�NB��k��]
E
v���C�0��O���������/����)8q�5y���
R)��n�:������}��z�e��O�W�s�|vg�ZD�a����7��5��������� ���[�p������e��LE�jC8		��gO]/�F�����m���`P^��$�3{F�l����Vl5�,��u���
�;��������XH�� Z�����Mk4�I�P��@�!,�������V���gCa!���\�K��tj�fDGC(������&
iQ�zO�����-o����� �{QE�� �B!aG7� �'�����=����l��hDK �'�yI	@��P*�%�k���8���L���(,��I*��l��"��
NW
�
��P� ��^�
�`�_���g�A��\A��r�,XA<���C,Fu�Y��2�-�)��8]����g��i5^W�T8r3g��m���SR�q��TB*ET��&f6�g,	}b�����q��L�JH��
%�1')	b143�����{!
���:]���4����A�U~>>����w#3���JJ�@���z}����B����g�B}����
?������	��AS�A<��������pw75_������}�4l���$%=������p���7���,�VT���QX���������{'bb:�
z�f���+���l�l6f�C&�VY��x���?p�2�u3U1��yH!�UW
\]������UW{� 2�Q�p�0������JH����PZjPW�)@��Y����������Ad� ���`���.^��;���V�#]�C��JL����<�xK��b���c�J��b�0k��W�����i~�B�Y�.��n�'��sz��j�Y��x���V__(V��;s���k��5R]�����q�"������`���������u��������� �\:6.�I�� Z9$��x��j!@&�����+xp
8[���2��&o��T����qx�}���!]��||��z�5�����[����|	/��h���\��g��=�Q�P� �E��l�lPWu�1���Rw
� A���5HI�/ysy9n��R)f�DE�������t��ANN�<�fb��e01�PT�>`%(����(�
��C&�G��"�q��E���k����euUY�>>���op����=��3��8~��?`w"#!�A�y@�+~�S ��)�b�2��JD�d�"���1��C|+ IDAT�
��w��j�&a���7��D�����������q�������P]UV���w����
�<��a��w=���]�}��)�$uE�� �G��J<��r����MIH�@�����c����?R��{o��u�0������������b�	������p�>>��F���^��(��7��A,� qu���Q����2�E"2��Z�;�_�����������|d��h�������(�����#-���;�c
Rr�kg����P�]�a��#�v� �E���j�39���P������;����\b"t�''����
!���/��;II�����9S��e��d1�*���&��@=d� �E���[7��J����l��q��
xur��/�@��0�t�R6�PY��P��B�������P�����=��Y���]�?��M���tE$�� �
^{����[��������1��aTB����THL����(V�A���=�����1�2.�D�s8W���������O�R����Hz��HCa�x���
��j�UD����gn����������Y��R�A��+����Xa,����?b������������ u!A,� )"#[���&���
����f�*;��n%�7�����������L�TSu�y3������r��;_�gRW��Vrrr~~�������RiJJJyy9��IWW��B4�*�J��W�.��	�L���CP"#�BC��#���v
����i�"���������q[�^�'��M�D�� H`5��_~������cl��}}��Y�z���k��iSUUE�� �huUR�:�$���?�� �'�|��?���ZO��t�b��sg�:����1M�IB��N�t�*__�����5rR�P� ��0A��j.����B��'�q�������+W�\	?~<=3�xY�{�ZPWEE�{`R��9#:�[��~"�aapq1�2yrx������/�l��e��|�o�v5�o�-c��������q�������	�V32n����d��'NX�h�s�����gFOF��Y�p����JH(��+f����y���N5�V����i}
������Z�K�d��}���
����zaa5
����5�NC���9	�f$))I�����[&���k�*������}K��A<���b�P������%%��������v��EEPwG�;���tM�HYbb�x1�jDEY��2. �7zK!�KL�R�)S0,?��;��x���z���_m-�ZK.����D�V�M�����m{_�����������?��q��*�x�HJ�D�}��)���z5������Z�g�gm�zT_���S5��#�5�l[Q��"8:���\��Y'p�
���6�����S���R�V�W�q\�.]��������4����X����?v������];���w����@Oee��4DF���k��;uJ������?'�.>|������]��gSv$!^^���BaA]�.|��"�T]���O��'� ��F�B=��D� �Z����~>|8�mkk�������}��������6���K�,�3V�� ���@�W^���h���lDEA���eL*�
���W���,�	�^�F��>�ha!4xx������X(�4y~��{���b�
H�[�%Y��gK-���&��V'����5�����������G��w�^~�=�)H]����MHK��RS1x�����Sg�n��C�AS/����P(`g�&���
Waa(-���H!��+z�������U���^��qRW�h1=`�:����;����b��9s�Z�T*�����A�-qq�5~~�2�`A*.F|<N�P����]%����[�������?�P\ln�b`8T{�w�z��8��^P�>}@
�g���y�n�Ft�gKO�1�{\\��������6k���3g.Y�D.��' A<��������
�I�e���dN_���������s���5�"A#����8p��q��e|__|�_�@ox �����A<�<J����QQ�g�x���������l*+1~<0�O�������Y�K��&lKS]
��B%���=�p ��9��}����/����-r����-E"���x	��DOl�g��"���};;�������_`ooo�*0hPJ��>?L��1fp���M��T*��18�����q��\����`6�f���w�J�_@&2Cz�'�#1������������[@�_���9����
�x;���~�+?�lS��0���9��1�������g�A�jmB���oE�,{{00��+��^����q�z��k���'��� �>�T	���~|l��h�� 
�j���yy��c������q����U���]��{�^-��e�����U���������pwGx8���T+�#xFZ?�����
������1e2��^��`[QcA<�B��n(A-��M<��)P�C���B�6N{oo
:��������s��o���
H���������9`��;�'�]�����B�?�����_G[}�b��xx�@�:.�2����`��� ���	�h6L��>���8~��H����lv4�������������k�X!S�}������_�������U����@�����������������l�
��L�=���0��*A���	�hT*9��3��[HN���n�~���>�e�N��B�����>���mH9���$���������%��jT���5/���%'���C������	�DM*$H`��dg#:99	A|<��k������]��,��!%���MB����=�T���1|��f����B��%�j1^���������W8t�E��N�r��O7t�����&�c$X$�H`Dk��))��2-S��?r)}f��h��+�?�����JjB0��}~|�S8���BBj��h�2n�>|s�D��F�nh�|�����dE$���EH7� �& ;k���/�*+a|C-���a��g2����#~�q �����a���ku	g�b�A�������� $���Qe�D��4h�`�qx
�\}}��������){���Q��AM#���E��J��((���	11�bT_`[<�
.b��	�_��h���w����i�
�=�[3�
Nsm�v�/�`�X��4�����szJ2.(�6��pGG6�� �'QQ�� ZCP"#�bb��L�[�Q�(�q�����o������R���6��' �b@�	���/���f���-�-w�r��]���������GL8����B�=��O>�u�~~��A-M������:�c��U},>��o~�|r�4�+_��G�%��^^�jt����P@���\�(�.��'�BD�<���Ia���_c���Z��p*,������g����"�hah�� ��d�>�?����3e7$8�+���E�9.Y\�5�l�G����u7o��4��={���
PZ
p=�����N���wqX��w�������r���� �B+���p$F�S"��$��h��6� �B���T���6��8��d��L_�o?FT����o{9���q�*5-�~y���E�uN���8z��c���~�>~��P����s��@��h=B��	�h��psc!/rzv���
����U���N��:�D��k6x-
������C��,j-<=!!l�!���^z��w6UT
�����)��Gy�}��~���C�d���Sl� Z$��h������o��c	\K��L�iOR��dD"\��z�9�tHN��3��q�4�~�;��:���Fa0���2��8z�Fc����],p��{���@3iF�� ��MQ?;wb���7�[��ji4�6�K�-?	��1U�_����:��][q�"tO���`ph�1��S0��N�������=Wz`07��?��g���/?~~�<�xd�	,� L�}���GZ��C����]�����8�@%����C�7{���n;T�s&���v9>=�s��������N�����s��^�������?t��g_����E�� X$���G�d�?�������6u���NvE"�"�);�{
plg�w�������/��Epp`��I:p����<�Ss�)J��=������=� H`��"����LT�c�������7V��Sk.b�"�b7@7�X��G8�Nz���8^n���G������~�?u�N�'y�qG��p����L�L� X$���Au[�~Z�{���e��y�*O7��+w��o>n����\��="��^}��i�,qr]�*�{��g�Q��!�B������y����m��A�A<V�����������Tu	O�_�}�����}~����?N>=a�>��S#��B8�o���0l�}����m�?w�Y�.\���s���06\��A��"�E�]K�m��,�����^�vTN�~������]v�����!h�^�7���Z��}�)wC����#�(F�;��x{���w�q1�}�?�1����.S�$z"KTD��m�DD.y�E���,�C��Gk�u]Rn�Z�t�����B��esI���H�RMu�?N;f����}����c_3g��9��=�|��o~g��f��t�&�"�P���B��g�ldeA"sS����`��Qy/���
;_q������2�'l���Fq��ow���@�����s���9�����}q�{���Cu�|���_$���ZB(HP�"DI���)tu1a��ab��nyY�K�|�{;X���G����������#��^���}��#�*:t0a��gx1���a@��������p���X���GT�
�Q$b1rs����r�����WYJO��3���~Y�x����T�K�wKN�D�����9��|�id�Q�s�z;�������v
����Gx4�����<����m��1�S�^�tGB,
X�(���(/���8r6vM��a�"���*K5�u�������?�q	�����M����E.3��Y�� �T�hu�6,,���~x.���?��g7�,�RK����p��B(`Q���EFZ**^��gb[[X[����������-�o�?���?K��EF�`f�W�]�|�c�?�W��n{TW�`Y�"����>��v����N��E�i��bdg��ee�|99����^������D<��[�9�i�f���;d���O��
�s�����8����cP?�PUP�3g�i��`f�hD�!w��|���	�.��'Le.rb �n�L!�(`Q�"�7��*�������_�
��
����C}=��af��{V�2�Wq�9���zna,���c�4���^��+�<cl��9SK�g~�����{g���wq�aj����B_�vC7��G!�(`���d������
PS{�#U[������-uJq�������:��-Y�E=�C��]Y�M��������C��
G?��&�xS>�u78�j6��<\���[1�������p�{L�H����^����!�,�$��z��� �]��������>���9�������Z�P�]�e4F��)���O�C���=���j�a*p��P��~�����x�>n8|���HL���Ig �:r`������c~{�����B(`Q�"��N)��^���r���8�KP2	���;�P��0��np���\�R�Ix�~D��A8��^���G�����}��c�#�`�
�^��`<=+!���p�
1�iWt��Am�O�$�
X���M6��&��1
�Ts���{�������d$_���X>���1�z X�e���!0`�w=����x4.���C���F@���>bTPc\�~O��#������Lo��K80��I!���xd��f�Le%�
X���D6�TN��4�K	�����vu�e�p.�a9�/�"n�:�{�������:�i��\�����lOy��i�
��P��C~�c��:|��E��R�!)��N�+L��DcJ�w��������
�"�
X��_J�oI~N����~).K��������6n���ld��X X)H�kq���|@[)�0y*���E{A��.���Y�p������u'��t�LSUX�sV��2��B���"��v��~�3
R�B��������������W������2���@D*r��(n�w���z����0k��g� #C~Rv�ff�zTr��E��j����������f����/v���z�&g�������2H�B,
X���SII��z}��l�s+���5���k�V��pKR��k����� ��|��s��������a�`�#b/�0=��j�L��������a*����a�b��{+-��#U��u*+!�P���E��8%?p��S��X���;��
�������a�P��7��@�={p�.,]��'1c��a�fLZ����N������~�a9�;L����9���z�	���P��>B��E����l������&�j�tLOA��x��&S�l����E`��������1rb�z|]]b�����9IX��	D+���l��B���d_;�=0y:ZPe	!��,������4��6��X��,6�!-��c4��(D�`�g��Q]
KKTU�S'<z�r�q�P���^`����`�P�|��
T@�u���{�w�J���b�b� ��?���B(`Q�"7�����e�������SMAT�:7�M��1�����@n.'a_�����i���8UI6{o�_�|@
L�:��g;"FZ?d��*�|�`�2~���\\(TB,
X��##9�U7�P_�w�I���lH@�,��M���^��$�� 2��c�`������I�eW3�q�T�,�rl��[�p�cl��Y���p��,a��k��������}���j%!�
Xt\���~Bl,rr������j����$RH������X�-��������p�<=`�,^��������/�~F"�������\iQ~�!�i~p�_k�#�J�5�,JG_�A`b��cae��fz �BA���d9I,Fn.��^%*�W���'Mhf+,W�c����F������#�pt��;��hk��S�~]���=��Fz6&������H��Z=j>�W����{
��K/��|��V]��[�w���B(HP�"�78��cl���r��������&nN��(zx������^�sT�c�?�n�?�~}��Z)+[i���t'N��WP�!BV!���?Rgt:���:*�Q��d]'3�z� *B!�(`��d�;�8��lb��;�eQ[��D��"3<.\��p�
�<U�B�x��������{tw,w������hf���XGGn�����O~��[��A��v�IT�^�����v��>k��E����$AB!�(`�����������;���6�n�o)�������n�v�8������}ol�yL'@������E��p����e���1~~����M��L��>����KWzH�ltS����u�?�W����B��E��U��8v		����G�no��d�PT���[7����-���� iy���~p�^u�U�2�of�V����)O/�W�1E	8*A9�j�C��
�E�;ebZzg�j��"�����%%�B��?,-
7n 4YY�?�{��Ye�
$&6��X(D��8w����a�W5T��_,bma��	X��;���`�,�i��~	e]�[�:-����Y�
M��0h>����+k����}�B(`Q��?�J�'���j�t'���$UV"5W�BU[���������.���^�d��<��a�AN���R��L�U������l�,��T�m�l�e IDAT�h��s#�kf~�K�g��}�\��z�&�r0&�j6�B!�~StttVV�������l���wO�<Y__?{�lUUU���X
X�3�J�{7��Af&��0s&f�j��sii��@f&������T���?��.�BX[7��n����4y�F�;�Ga�*������E�Nb���������6���/�[U@��B`�����������@_�?;�!�B@MMM�����;v�X\\,������?~�xSSSb�X*��k��V�b�=����pu�����n�U����@f&��q���!6�q��0t��):[�7�1�N���)j���e))��5Q�1a/���J�u��c��x�\vPv�����a�p���j�f��
b
?`&N��6!�
X����wrrraa!�TUUu��m^^^�|��+W����@���D�M��8}�������ps��=�������������	���q�������|�hC{#adv����~p_�Vjx���#�x�>
;�0��.:�����(jw��g�V��������?q���>~��n=��L!��_KGGg���k����N�4)%%����������|>@xx���omm-,�X�S��+W b�b�:������z��1k����l�"!4Mp���OQmQ�[���V����:�Wc�]z,R�q>;�G��������F����WN��e~���R�nb:�|�B����Gg��jjj��m���7n�l����������&�ldURN�����J@E+V`�j����kNM��
�"Q��]��h�n�Cq�����!��a~��}�+������V��@�}�+J�$�K\�fT�@��$@{�z
�������.�����&�BZ��)���nu�"�r����c��}{���K�,y��6��VT�O��7�����QR�P]��=��[�&~��RF"�G�H�W�)G�cfC���4Q�xh�1D�����ix��G�_^}�V����G���BOk���>�#���)MBi�ZX,�������4���f���			AAAK�,���455}����z��k���������Z���=��'q�8��c�|��[���yTT��Ex��v7�GB9S�s�'n:J=��E�*�-��i���m������`�ja:��{������z�I����Q�]��Lg-]�Xtt����?��hT��0��m(]Bi��-[�����B55�������z~E�����?&&F�;����	&��0!!a��1���HK��kh�����9s��O�_����d���O1�Aq6~.�bV���8�
�*��&�R����oL�)��2AU2N�G����C�� *�B�g.����+��?�s�q}!��t�*���g�B�S�K�...���?x���B�T������`���W�^�|�����<�?��p����k32Bp0�ij(��u((@\�,��E��l��\����W�|lT�<�~��A��L�����|����@�+��8��r��I7O���Q�������������,��7]���r�;j�C�!�
X
�444���udS0,���NJJ
/--�5kVfff�>}�#`�xMMH��<11`o��31b��~�����\������0`Ydd��;��-����CG������1p�R@�_t��-��uc ��	�R��ro=���M�,���|-?�	��D�S<^s����8�f�J(KB���'��1X�]���8s�L			��+������06FI	���#�����3Kp��BB�y3P_���?$w���`zO�z]�������L�����99HIAJ6�f
Y�e�a�t�������7�cM{LQ5�^{�P]50�0q�����V�Vf���!����u��<�B����w~~8u
��PW�����k��b���GG�=����8<��r�Y	z���@{������0�Qo3�����kW��3�C==�{!��/
.�x�������d������"�(��SH�,E!��EA�n���&B0����K8{aaPU����'���Mq�<��7�WW��'1t���Lq��d=��IH���V�Y�n�����������i�u��Ipt�����k�
GN�?�-$�������'��=����83�V���)"�B(`)X��r���HLDH>��.V|���ao�������/�M��8\�	��9<�aX���^���+���l��F�1���N�0}:*+G{8�����e����~��!���dJ�.�<84O����y(�����"�B(`)v��t	'O�����9,,��?������fW}�L�.������4���my[g�������r�H'8��y���
 �@(�b_L��������s�7`�E�;x�p�{��a!#t��pw�;}Z!�
X���\����r��
�|9x<`���Q�B!��GV�E�n�3��1�z!:���N.��
�2����Cai��<��[��!�N`k;�&`��V����Ra����cu-�Fbd2����B���<��~~8z�jxY
����E�]O���m_�:,ld�	��	�B���!f0��$'#<{I?Y������`2�`��Y����_`�fl���?�{��\����B���$���������Yv��7��r�:�����s���&�1�(g�`73g��������g;�0�)��������%Uk��5�F�+���$t�?�]��m��
W�����B���$����	de@�>8z�5�(��ffp���E�#|1-Q�p���/�p�>���1�_�����`��X�*�m20k6�af[}�;�Y�<��'�B%H���]jj�{7}�b�\de�����	��l���+;�{��B�7��Dm�������!�g��1��@|<Xv��~*���G��c3F�92p�]T���5��G�|(]B!����T\��`��������={���V_GP�n\�����Y�
bW]b�����=�p<|���3NF"'�8�A�n�����8�[��q������
��PB!��z����]� ���W�"*
�`�����c��x[o���;�E�����7����C�����eSx��>c0����`|��^t��+P@��BQ4�7�,����
���`Ydg��E��U?�{w����V�������������?pAo)B�mFX��6n���4l��|��`��#�B�2HP�o���RL���@}=lm����$t��~R��v0�����eX
����Z���9B���vl�D%X�������b�R�"�B�6�QV��!!4s�4.qq����1!	#\��}�����_zw�����A��hhD ���pm�����B�9���t�B!m$HP�jVM
>�k�b��Wp6j��v��d��O�.t�����Y�6N�0��|�an{:k	!�
Xt\���;v���ew�1>>��x/���z��f����;� Ft�s�!N�d:M	!�
Xt\^��0��M��%UU(,Dw�z�]������\/�c��6Qs^z|^���x�
��'q�NSB!��_����:23������Gc�X�%�)~�#J{��tk����u3-+	$�0������t�B!m<`���W�_���"$%���D$��I�N���}D;�a�Z��8���C8D��B!��7q�W���^-3�
�V�6
�y�q��\���/KbV|��u����B� A�%�W��U��7>�e�Qi�Q����j�?�e�?�0�RkP��e��yI!�P�����&���	�Yikc�[�5����y_X���Zt.r�����!�
���=��������JW,P����G��� ���U������������!�B��k����7y��A�������}����F���a��s�;�m`���BQ�HD=X�EXlm_�x���������0�����"���2	L.B!�P�z��
df�Z���G��W���WE��\�^�������h��t�!��R��K���@$%��U`��]��^�����\5nka�����G�k�O�@'
!�����&}�jj�������i��C/��*>e��}� ,}��yu';n�����"�B�o�5���s_�B�G���QW���6W���U�2}���vB!D9#
r�jj�����,�y�{IQ�h�ya'�?��g�1A0�cE!����k����}���`����.���>�tq���rt����y�ct�B!��h&�wx\jj����������8~E��}E�G�l6����*sC��+���&t�B!��R[��w/�<��[�>���7t�d���~aCoCrG�[7��l?}�iG!�����xV���>�����m8?j���()�%��a|�PG�!��&"�`�CYY��S���H�mH>{�q<-
7��b�
Q����B!������-0��������g��f�B#�9���aZ>�iRB!��m�����N���5wC������ss����W���[!
E�)B!���HD�"|'�e�pT�$���ZT?���w�6v$�[��3��}G�l#�B(`Q�z������]�ws�<4��:��s�-`�{w�jC��6B!���>.,��$��}�������m��n�}|�P�-:�!�
X�~�q1�%k>H2����4(������8��D�dL�3�B���'��{�,����O��\���Vgt��*>���N{����!�B��64MC�q�gR�r�
�KGw���&Jg������6�N�$��B����!�����!1}�������~�N��C��T��4��Y-j���tBB!m4�L��KT��p6�]������xHg������&40���tE!��wEiV���^�D�4	�
�?�u�r�3&����gl_����*T�I@!��wK	/�]�={p�N��#j���-�-���6��~�7�xiox�#�nq�_B!�MG"�a��*���1o����V�]v`���A��l���5��~�OCZ�|"�B�;�T�(�~(�����Ds?�����_���1���l��'��\��O��:�3`�v����QQQ���m�=�|TTT[hieeeLLL^^^[hl^^^)+���(*+�U���F�q(`5�=����7^�
mU�������o�t��k!Q_9�����������N���'{xx���������GDD���FDD��������RY�����m���R�K�*`��w����>NU��r��c;��]Me�$��TTTd�Uz���m��:**x<^[hlif[k/�U���F�q(`5��QY��Y�u6���v��n�]T����BQ�6�����������x��Z����+:���-��!�N�P�"�B��Ak��������NHH��x���vvv...Ls�����TTB!�P�j�T*MHH�|����oo�y��6�	T��TTB!����K�����t�������jF!��V�U�`�������|��
���_[��������,77@bb��{����TUU}��)�����/_*wc��^�xq���J�w����m��TV*��U��q�={��X#uu�w����u�����illl;La�B!�$��,�������/����d=X�B�������������J�X����b1������:������B!�A��>`������R�!���1�y�������������D ��bMMM*!�B(`���U[[���X__�vn#J!�
X�B!��?6!�B����e��kl��Q��#���m��JV����J�r���'W)���O�_QV�	XUUU���<�a�U�V)Si�����k'?iGvv�P(d�w�����6�\!$%%���0cbb����rY��,�����a�a���e����2�r��S�b�
F���6��m�Q�[V�sX���|e=|�0��C�)eY��������X�*����	lmm�}���~�ee�����H$JJJJNNVQQIHHhhhP�v��������p7d������344ly�B����o�������`/^l��
]n�������v~����ZV�9s����m�Q
]V;;���'����~h�Q
]V�vd�9�}Z�b�R����������n��	@uu�R�5::������;66v��y��*_Y�O`n�CCC[�f��eU��U[[�����S�H��
�r����l�������|��9��rE�r����=0`���+�E�,�B�������444s;��e�������sg���d��Slff�c�������emhh366�-1b�������tuu�o\�++������_4-�	R��n����������w��)�mY��a||<;;;����39���Y�=y�$WTY/���3������������\!L�:u��q�-LHHh��
]n'''��aJKK�;(eYH$�������M6J�?�b�����������.]j�O����a��� ___�J�'O��<y���U&00���>=z4�0�WV�e��#�H�����������eee�����Y �E�V%	X����
ZR�1X��?�����*�J����ZX������iii������������4i���gw�����c��Q�,+�����v3�&�����kii�:u���~���-��
]VUU��/_23u�TUUUooo�.+����/��r��m-�	R��23g�wwwuuuss�
6888(eY����������7nHKK{�eU��UXX��������@$��rr�����{{{{���������*G����]\\���O�8��e]�zu��}�����i�B����k��-�J�K�,������/�Q,�5J��*����KKKY��������o���e�			�F���(�-+��G���e:b��E�q�������������[�v��q��q�&&&���J�LLL^�|�F���iII��/W�v�2667n\uu57&���*A�Y�]�v���[���������AAA=z�X�zuHH���s���l�B��g��k�����w��e


����5J��jnn�f�B����i��}JYVNMMMPP������l����a777//��+W������������*eY���7n�(�J�
������UI7�E,sO���UT�v�/}}���l�S�D�����r�H���^^^eee+V�PSSk��
]��K�����>��WV.g4y��&��e=s����[�[�o��-���[V���/+g���***��������e`mm-{��_���X�,��G�>����j��y��I����Y�xY�f������>}�)M�bccMLLdO�m�&�Wu����-W"�h����r~��������x�"����r�����}Be�"l�Q�TV[[��U�.+7�{\ZZ����m�����+��7����o����e���;v���|>���@���M����-+%7W�;/���������������?�\iVdd������E%;w����
����%K�4WVE/77(��y%%%�,�W��J�B����EEE-7J��*��^^^J\V����e�[~�(nY�b���!��ke��k��'�;/�������"11������C���,+�[�������t��Q~���+P�����^n�X���][[���!�|e*�o���F)eY[h�B���ykkk333%.ks����������J_V������
�K�eU��E!��J���	!�B(`B!�P�"�B��E!�B(`B!�P�"�B��E!�B(`B!�P�"��&EDD�Ah��������8B��zUVV�����=IMM����{zzRi�3y�d
��P�"��j)))'Nl
{2q��������4�QUU��0Mi=��! ��TVV��}[[[����c�����cYVSSs��m����e��������������{UDD���S������
�@��NLj��IDAT��-�t�RRR��������x<������~����_���k����2n�K�.ikk������YZZ����m���?���?�2���������IJJ244���on�������������/������C]]]��V644�={���*w;�7�(;��w�0k�,SSSne�0X�p!�R��V�z�!�H$���������.O������p_�����}����q����}LL�*OOO�S�N�������5@``���CEE����e+_�tI$]�~��������������~�����
{{����_�r�������%%%��D���moo_QQq��
�H��������y'N=z���}����-,,^�x`��]�[��s���.�����������
���t�������������B!��\�����UXX�\�"""���+KKK

����w������e��Q$�<y���'�����N]BZ�B�������p����455%	�TCC#33����e���777n9�/���e������w�D"�eE"QHH����3�,Y���7�|3r�HnM��\b���;v,�rVV�����$66V�	�e������r���e��~����1#���� .�\�z�[g��U<O��9s��y����������������K������^�z-Y���v�����w@

�����\ZZ
 ((�N]BZ�DHi������1�}�u����=y�$>>���A�����a�\ ����#Fp`/^�x�b�T���-;;���,333>>�[�����k^��]333344��<�T*���o��Y�f
��\����/CCC;w�lii������8C�Y�zuZZ��899��&Mn(99y��
���6m��={������eM�z�BCC�l###�Q������N��=644�uBZ
X����0���wxx���I�^�LMM�bq�������_~����7����PXZZ��5}}��K���___���PVV&��K<G��2e���Vd�R)��a�T���$���g���-�������l������a��-555�:***���<���6��&�u��-�:o�!3�Ey+111���+---..NJJ:th��'$$,\��������������a����>��3��UUUe��.]�����{J$���/h.]5���S^^.�����O�>�������X,���8�w5�H$
,�E\\\XX��g�d��cY6==}��u"������K��1�UBB,BH������}��/,..�"�����Z������7o�����BnaYYY��}������\����j�*nyPP��������Q_��9j�(�T�=
		���������{�rHeee�w�^�r�G}@~+���-�&74k���+W���z��7g��������8PWWw�����]vvv!!!�5������j:i	iuh!D7��C�,�.[�L p��?.
UTTLLL�{���>��}���|�
�"##��"##���X����H$255�z\dMMM�B��������������zzz<O$q������~�z� w�e�bjj�������>p����w��@�N�8;;s��m��m�{aII�lT�l+�������mmm�������V��k��|>���������K���,����m���������L��l�2:u	iU�&P!��Zzz�D"�M7���GFF�x���G���EEEYYYu�����n�K������?���s�s��[h``���p����W��6���jcc��cG/^�8~�x����M�v��������������kd:u�4s�LOO�k��q�$�_555�������6������kkk33��V�������-|�]�
ys�|-��%����Ei�:u�4c���IByWh!��z��)
`"�����F�|��B(`B!��"���Vn��YIEND�B`�

tps-i5.pngimage/png; name=tps-i5.pngDownload

�PNG


IHDR ��G�c	pHYs=�=����ttIME�
.B8�( IDATx���{\SW���9��:*����^���%�-����c� �t�Pm�������������^��X�TFE�P/����R������� �U������[�1�K+V��~��uXg�}��99��}YK`�� � �h=\�	� � H`AA��"� � �EAA��"� � �EAA� � � �EAA� � �AAA� � �AA	,� � XAc�A�
���p���/O�>��K^�dIQQ��������ysjj��d�H$j�z��e���� ���+22R&�q7�Z����GDD�?�r����;m�������)))1r�|���2�Ln���-++0`���S��c��o�n�Z
��������{��7�,Y���;������}�V���c���s������o�%"XA8��?466������[�2$%%e���ws������������������,�������O?�����?���@yyydd��	2338�EX�S����c���}������\�����������^���
??��������s��W�`g�DDD���&r8��l7n��K����%--�G�6���6��7�l��}hh���}����xX��l-������{�%����L�V�����u�
��`0��E���(++�7J�����X^^��s������{����U�t�
6���q��i��UUUN�8�����

b�-_�\*�rcYY��x;���dn�h4���-������j�7nl�P6�M��;-� ��Bs��c��]��
�?� � ������+��X__/\c��uC���TUU���WWW���Dn�r��Z���aaa����\�tI,�������3���r�������wn���u�j������������nVms���FWW����BRSSU*UPP��q���O?��������&55���T�?���W^�7VWW���G�����.\�v�K���!6��`p�[SS��}{ooo�w���S��Y#�jPP�T*����nf/..0g�n�����cG����������K�,����N����o�,���/**�/A�
H`��O?����e���L&��t��>�l6+��'�^��_������'��9�1���#%%e����w�>h� ��l2�***���V���M&cl��Q�{�!11155����l6k4�J�'�TVV��w�c������������o���]�vYYYs���Y�w�������~���o��p��!>>�a��+��2k���5QpppEE����;��a_C{-�T*�={�4��������\]�l���Wo���+���-/a��-����n��������|����v�����B�N7a��{���D7�R���p3;�dR�����,�CM�4h�8����DUm��R�����M��{G���E��2�L
���2�J�������/�J�z=���8q���[E���jh�������T�T������@�P���w���b�������y��������I��w��M���4q6t��=SRR����ow���e���oQ�'N���(�^��?�������u��H$���d�[�'�DEE���������1�u��i��9r���A�I�R�\^YY�G�����a�����\�[8))i��
c������KKK��pZ������W�v�����E"�x��'��;~�8o.��M&�����[���B�R��������&����+--->>^�V����������#��"�����7N�<y��i�n���`�����y����3��:u����;vT*�>>>>>>�z�����}��V+W66�����ow��a�����X����W���^


A�T���h4��^,���������sz
>C��2555���p��Cg� o��Vhh�C9s��IOO�H$�G�6/�����!  ���|��I��w���-���eff�\����OkYs���l1bo���OAs����������n��������_p�+�J��XCCC[B�`�#
�h v��w�P*�]�v����(�D"��d������c��?���k�����������YYY���F��~�ZHH��`X�|�}-Z��j###[(�N�		��*o�X�J�}�?����cbb��m�f��UQQ�W&�l������W�����~��1-���e��)=z��zxxt��������*//���R���C�-[������x"�D���S�(�9NU���`�Xxx���K�KDw�`���}�x���}�����]��L&�a���+�~V��`���W@@@eeeHHHxxxxxx�n�F���s�[�Q������y�[W{���/����A�bbb�����SO9����{�h4��O�����������555����VI����]�v���_�b��m��
8�CCC��U��m����o���V��;~�%K�|��7��z���E}#������t:��,��?����S{XX��!�������P�~����X|�UFF��a���|�!/��~������`!�.�Z�*++K*����MMMF�UPP�����];�J�S�T:t7�F���%�AAA<���`P(<�������R*�����B���,�������8�!P*����a�O�~����U{���\0����a����B��	/((������,��=z����L�mu��|����fsK{PP���?o�Jz�v

8p`AAAQQ���~9��#�c��"��]VV����8�F#?��
��7������4++@ii��x��$��JHH�J�YYY�&ijj��o����w�K!�E��:v�X��]
EBB�����`�B�v�Z�B�[�paYY�����8L/^�j�r��k��:�N,a���G���G�Y�p!�.,,���
�"88��}�������.\�P<P$==]�P,\����%+��������SZZ��B��j�MMM�*�9�;qV�Xa�]�92{��[��������x���r�V{��y��`��D�?������QQQ���wh������@�Rl4Ba�l6���O'~6����&00���oa�RI.������B������q�F����_~���|[����


k� ���sF����"??���-��_}�U|||ee%�s��)555w��}�iUw9��f�FkO���]����wx��\T�4��Bc���EEE<T�m�����V+_�H�/�&��CB�������o����|>�t'�����i����~��������Vi���7k����Z�������|����g��E���N�g�nio����?�8a���6|������k��&~{<x�����=z���C!'���94DHA��PAA	,� � XAA$�� � XAA$�� � V�[�����?r�<  �����[VV6`��x������[���'�d���� � �N�q�xB�>}�$$$����v�Z]QQ���W]]���8w�\�Oe��%�F����J��	� �h���!B�V�R����goLOO?|���j=v�Xii�V����!<<\��VUUUUUi4��� � ��
���I�&=��c��5k�������� �T��g��3g�GEE����>3� � �8�a�m���>��#{��S�����?�JeCC���7�T��^^^��i���|� � �G��Rme>����
���d2yxx���� ��"� ��V���V���?^SS�`����|��/*g������t�Y*++��[����hS������;�:u�={�������T*]�`��s����p?��R����HL��3�����=K%�t��$�[R�O<������i��j5���k����2�����b��D"�Z�f���G"���q��6���_�n�_��W��D��w�}���-�����;G�����g���Oa4���N���8y���.?a�G�]��n���QW�s�S�>�����`Vt��6�9�����AD�����C(�BBBt:�[,��1$$$((�����,**����A�C����s��
�Q����	#B:�w�p��0�S���<�.9����d�+������+aR�����a�:}��	�)4���"������������W���W�X�111S�L��d���:�����>3� �a�10����,���c�2��;�-��������_8�zI���p1���R����`T8��b�*��wH���`-!u��	��3g��F�������92==]�V���p{fffbbbTT�����}��gFA<L|f@H�'�\�8�g��������%Z���{�S�G�-[��x45A&-�"��R`� ���������grrrrr2Zob?AA�w��3���;&t��V�x�O��L�8_���t)���d�w���+W��/��}����V�A�� ���/��<p���7�����^Y+r�����c�����u��f��`�h����H`A����w������A������(bc���'��x7"�G#3��7P��"�������W�.]�)�8���O>�$�C��'�<�<�����t	�!a�E��*O>���
A�����:���D]��b��XD�Gxh&6�����lm���w�����u�9�>�v�������Q���;�m��������b.\���z��g����:SF�����"�'N�wot�Ce���[^+�~���]k�=�� �p�/u�f����NT���HAh-!����c�������!C��
������l�m0�J� }�����e����322�n�6m���G���k�Z�����x��2��y�KR
D�Z��g=�������0'�_��������x������Gm-���LK�VQ���v�^�Z��:,��C�L�����FFF�t:OO�;v�T���&�{w7������?�Gn4-Z�T*� ���8~��x�s��&���k�O����cP(��������P@������"$��%%%�A.�O�8����������������#}K�.={���j
		�QU|jkk333��1C��YYYJ�������l�"&JOOohh��	���Xg_�jooocvv���c�-[��W/�W�\�~����;u�TYY�k��]VV���h___��3f�X�d	�+�x����(+�H�4�{z���!]�\D/�+Vt�����]?�d�J�'R>��P�bcc5M�����Y���~���={��L&��TWW@�V���[R������&77��c2�����R��o���������22����L&?K}}=�����J���?�������+W��h4g��9|��J����p�*cl���S�Lq��7n��,(//���(DFF������>�   ������C=z����r(g��������;d�$�xHL��/�z�"���`�a�	%^���B-�����@�Q�����z�am�r�������fc���=�1�����j����Gii)w�>}zxx8��>r�|�������xldd����B�h����E�cyyy��������b��Feee�]"����O<��=�����V���w���q���c�_�����$//O�T���l6F�a�0��01�M��0�=3����n��y���S6l��aZ-���Z�A4D����={��r�g�����K�.(..nll,((h�f���a���|[.�����XPPP\\��{��������3gFDDp7�V+n$%%m�����S'O��������'%%t=���hG1b�m/����a���1�~�p����]�kmNq������851�.���}���u�C�{X��,��q7s��Z��P�=���j}$��j���bcc�z�R�����]M"��={�!<<<999!!@hh�F�IKKvZ������OGGGs�S*�bn��m��Mk#���� ����X������{@�����w������x��_����pV+:ur<��g��g�e�����X�V�O�1�{����������W�������5C�u(�������8|���V������yxx�X,��7�fsPP��3gL&Smmmmm��h��t��?����cbbn����[���������4��#2r�k6w��Y��H��.����?��r"��J$'S��"��j����8@���w�������Cd�eff����_���4.��z}SS�h����lhhX�bERR��?�`�����7����K/���=---44��,�|�Tz[���!�n � �h	c��Z-��}]��V��?r_M�����������w(/��Y7�?�z��OmH�p������+�j����L&���7n����s�4q��.]�H$�:DFF�={v����>���%%%���������}��W�=''g������~~~���&L����X,
�������U"�ddd���1�����-[�����������bU���o1>(v�EGG���������*
�����>MGcc�xH���� ���dBh(���le�A<��dp������|b7}\t�&E��PXH
����������TWW����l6�3�rYl��A"��9���-''�W�^O?������}���dj��!���/��������l�b�Z���P�C�)��N-&8�1��f��-[l6�+��IJ�A����K���&a�\�Pf�b+3������{<�$JJ��8N9�c�ee���� �EA���I8~��T�
����h��=�^9����?���������>���*jjhn;	,XAq�}�~�`�����!�`��g���Rz�S�~��{���1\��������#������y�����M��{wee�����k�hl���;wn���wS�.<x�> � �_�K/A�C������02�����|�S�������;�����jjPX�9s�2H`�v$&&2����W�DFF��j�R��O�>������;wFFF�M��M�v���{q����!!!�D"����477��@��GR|}w)H����_�;�WQ.���5A@I	t<r��������C��X�z�V��9s�����d<�Vnnndd�N������c�J�jjj�� U{��������U���������K$�)S�8��fq�������T
&aV��3a�4>^�����=�'#6�]�}����B?N�H��SJJJ�\.�8q���[II�c�=VQQQYY`���g���Z�!!!�������fff�1c�8,����T*��������G[�l�����744���@ll���������������c�.[��W�^|/�6����>�S�N�����v����\�����������������iS�>}�F���G� ���j��*�w:�X����[�.����p���w��qq7��
�JE���AC�����X�F��}�5k�����8qb��=&��d2���P��o���T*=|��F���� ��L�����T����766zzz��f����0���������R����LJJ���+Wj4�3g�>|X�Redd�����n�:e���q�f���`�������lq�@dd���;����<<<:��G���*�B�t�2{���"
<A����%���#G"XxI]
j�Om����������N[����"=������`�:{�����f�1�g�������j���������;L�>=<<��E�\>y�dn�BJ<622���M�P�k�n��E����<s���s��`�Xx����2�.�HZV��'�p��qqq�2�k���]yyy�����������N��l*�*((������f�[��EA���1�l:�bA��x[|G��Wu2���/��
�q#�����a������={��r>XV^^����1v����}����������������xM�U IDAT����c��m�\��gO~��������{�=�N7l���3gVWW_�!�����y�g�}p������������j����h4�#��1��������}i-ILLtqq���*))�iY�t�|��FA� @�����ruCZ�
������+�JN���t�q�\���G_�������tk�O��Cs3F���g;�f
� ,��Bs�Z�Db�Zo� Bll�^�W*���������I����0h� ��!<<<999!!@hh�F�IKKvZ������OGGG��=�J��f�w��m�-�����J���%Iiii��}o����v��{�8��q���W��H�.A��Q���HBP_�]T��)~1c|*����g����7l����6Q3>�PV�c�Z�>!�����z�rssW�^]___[[k0��P�SO= ??_��QB�Z�����������|�l6�9s�d2�������F�N7��/����c��'���G��{���������+�(����[�(� g|����d	� %f�;6)^n�h����93o�� �RK��"���������A<xp����jkk��r��233[����_ZZ�Jz��gz�v�d@CC��+���~��gS�N�v�1f���^z���iii���7������n�Z�:D�
D�	�6m:}��k�������C�x�aEe2���}eDt��@��q8tjj0n�G77����C�M��\�����T*���/^d�yzz��3�1v��Eooo�R��c����[����>��C{���2����������F�Q.�K$n���a�m��A�P�T*�D�����������t^�/�9�/NOO�Y{n���o�L�J�T*���
�5g�OOO�������pf��s��rA<�s�c��oc��3����t�k���CC��"������Pj��[HP.�{Euu�����f��1cx����7l� �HF���������W���~Z����>��d2�Z�PJ���dbQ���[�l�Z�c��q���C��Je�N�Zj {7���l��e��f{��W�:���[hz�F��x`���M&tW�Ld�#����W/�(����?����55��ES�
��-�d8~��_=�B�AA����Tb�
 �����=	�t�'��__�?��s�9���tX�
�~K-I�AA��O*��PA~X���o&���!�F����B|<5��-$h��o����+�:����]�Dc�����s������]�p�����A��)S��77T�:)[���B.�'���}�+c�t�Q]�(-��	��=$�~;�RTT�+|"##�d5�A�T
���O���Z����;###��n��M;z�h�_2c����O�>� ��H���x����^{
�bj
R��;���K����p�������E|�)����N�I�h5V�^��jg���+|d2�������������hT*��l��d{��i9{��A�Ri4�����[������"��e�<��������B�����k��&���������
N�&$@���|�H������� ��'N����VRR��c�UTTTVVX�t���g�VkHH���!�>������f��!fee)����(���}���-[������


b��5���Y
������v0fgg�;v��e�z��{y��_�~����:u���d��k�����������_!�n�:^+�����N���L�A<v���� �P��%8����y��[�Ni�����E���Z<��@=X����X�F��}�5k�����8qb��=&��d2���P��o���T*=|��F���� ��L�����T����766zzz��f�Q���L&?�.f�J���9/0))�W�\��h��9s��a�J����PU����[�L��`7n��3,XP^^���-P�����s'a|�A@@�����C�z��QUUe_��#�� B}}����A<���!( �����i�|�^x����FBZ<o�f
�|�Z�Q����:{��������={6c���_{-����Gii)w�>}zxx8��>r�|�������xldd����B�h����E�cyyy��������b���eee�]"����O<��=�������j������v777��noItt�����=�?~�������1�h�g��M�nj���V3�����Y��G�S���L&�bA:��1c�K��k��^������,%�����l��|D�
�>��g��<���������K���

l-2744�;�o����={�

����{�=�Z���5s�����&&��j�III�7o>u�T@@���'yZ���>))����>���hG1b�m/����aA�/�%^^^!!!�����m�8q�M���B��hl�8!����C&sn|�	|�U���?]
]Cq#��#4
JP�����K�0����W'�@I	
�2-��@�)H`�>��j���A���X�^�T*}}}yW�����^�AONNNHH��h�������V����������bl�R� ��m�6m��{�K:�|^�#F�T`1v}����z�9�<y+��]���[� ���A�Ng!+)�^��<�e!��f3�,�L�XJQ&M��|t�9X���j��c�w���?�i�����z�������Z��0t�P��z�)����G	9j�Z�0`@^^����en7�fsPP��3gL&Smmmmm��h��t��?����cbbn�����o��v�n�������qK��|��F� �{Lh(�Z�������O�����7�,�����>������J���O�I���������� ����;w�l�P[[��!2�233[����_ZZ�Jz����I��%{V�X������?�:u*���1c����K�pnOKK

�?/_z�.k����!�n �Nx���O�:u��i���;�� ������]����M����?��c���4����������O?9�)�����%X�/���g���j����O&�yyy�7�����N�8qb�.]$���O�"##��=�p�B{�������������������������w�^WWW???oo�	&���Y,�BQTT���*�H222���cnnn��-���������KLL�v�����������KJJ|}}}||

����t��y����Q<�3`����X///???WW��>���Am��T��WG�Va�<�c`6e�������?��ww���b�'�l���=���^Q]]��~��<f�^1���56l�H$#G�tss��������O?-�p{�ce2�Z����wZ�L&�jnn��e��j3f�C5:�T*;u��Py7���l��e��fk��;��@�_n���r��p#�hkX,��`6_��.@�B���|����H0����|������0S��GJH��"� ��tX�
�~�H�����\X��/�'�^��E��I<�8���UU��'1�p	"$� �|�1�����k���0��0K���t�a��3���{��E����k������h�/�E��h����UUU�<��=m+$������df����A�M���?�$NH��L�@K�)!A=Xw�It��A��N�:	�P[[�V�]]]A��_�^h��u�Z��_����3WN����D��)��M;z���k+���[�n� ��5�;�VaU������&a�;�P��Ui��OX��y���"<�<��G
4zW���O|���}�����_,�J��<y�V����V*��������i�P���z����G�...�}���/��l-#c��"g���+///,,�H$������		i)t��������V2��-R*�t�����FL6n����$��OVK����|��L "�����t���N
��	��4>�h�tJ!�
�l������;3��/_.�JE;[�����;�l����e655-_�<!!a����x3g�1!!!!!����Z�+3�o�����R�Zh4�{�9���kcK�.���/�9r��q��u���=�;r������F#?��h�I	��>}:�h���)�A
�~��'�qv��?_d@�i����}�
c�g��h�������x2�Xp0����"j!AC����l

�����������o �����bb���>;r�HAJKK�z��U�V%%%����4�L&������~���F�f�+W��h4g��9|��J����PPP�_�~�'88��������u��)SZv}��1c�����������������;w������<<<:��G���*�r/^���};W{A�����N���y��a��vX�E���w}=
��O�\�
�|�������#H"|�g@� �
��Z�NII������S,777���?~�����WVV*�����y�����������5jT^^������������N����)++

��j�L�'�1���� |���?��C�]�/_��LJJr��:%��u������J$������eQ/^��79|��o�
���	���@���}���kWu���8.�-{��������|���6mB�>����}�`0������m��Y�T
�p���Y�feee�l�����o��@ss�V��Z@DDD``�D"III)--��};�H\��T*�o\]]7m�`���2�l���111����
@���y��Sg��p��I������OJJ*..>~�8OSx���Q�����#n{���w���|�n�BTGt�/���<����@�8���/ww�ZzT.�B6��-����5Ce���/����m�(�(	��Lnnndd�N������c�J�jjjrss��9������~��������/]��������#""�x�
1�:Wc{����~�m��bcc�Z-O��������i�N�>-.1U*�6�M�T�&�2�m��[d��Mj`p���A�_������k�&)�]���C[4������>�v!(�yYz��8Z�#F[������T*���������l���b�n��)88x�������'On������QQQ��?�������M�6w����'[,q� ��?����E����Y�f����L�>}��m�1���~����w���W�3g��L������Z�������������N���	g���\���Fq������@���h��S�i,�(��x�5�����;����C����0i���@	�I`�������L&����@����d��}����f�EGG�����GFF8p� ��������w��U�n�*���k���|���������������Z����z�)��/���^��|��%�O����rqxnf���X�V�� ���6A<���q=/�<�E(�z�r�t�����a�R����&�}�IA�~�;��J<���!BA������+ww�.]��;wn��E,,,tx����������eD���Eii��Fqss[�r�Z����=~�xLLLrr������n_Bnn�J�ruuU*������0`���WWW�BQSS���n��l�������������������:�B���a2�<==Eq�IOO��� �_���SRR|}}m6�\.W(|WSS����E���F���E566��� ���Nw]i����WW~����_���7�u����yA		��
B�rrrZ�,<��c��m�B��i���111j�:++k���F�q��-�g�S~~����oV�[���l����z�����u8p�-�D��
d2�Z���1�������������[�l�Z�c��O��a�v�S:tH�Tv�����#b��ZJ�;O�C�r�h��y�7a�<�+C�?�'_�0�M3��oO�8q�1
__t��XJq1�}K2���#�-C[�E(BrrrB��8���w�����aaab
�D`��?����aWSS������"� nM~>&M�O!��1�AM�I?3�{�Zd��#1|8�Nu,H�CQ����D[��w�q�kD�pss{��wZQ`��0
j�Z�0`��������F�O�t��s���[� ��u�� %�j��"��w����`���p����_n7�`p�1�=�����Y�X�E[������B��a�T*��e����AHMM}��w���o���7�X��I���%K�8L-���#O�{���Z���'�d2A*++�����NY���c���i�x'�����;�}�v������qcss��q��E+UWW��g^z�%>�I��X ��l�T
#���q����
�a�����\�:����Fh(\\�����x���c�
��S�>@���h#	��F�\.�H$<�LLLO�WWW���W"�xyy]�x�M�"6l��5����u�����qqqQ�T���h42����k�B!�sUT\\�s������
��_}�3�"g����������X,��O?�tUU��h"��Q��N��tWA�Al6���������#����W���S���W_9����X����>���fj�G9a�J��~����<�7=Ox�q����������?���:D����G���y��������7�y��3�Rrr2��h4��-y��'���K�v�g��

�399�f��	� ������#���]�c�"��_od�^��a��b���P�b����/g��Q{���vi����Z-c,!!�g����G�rp�=s�?i�����N�g�����l��9[,���@���~��7��~EeffJ����n��A��9('�D���?�={����y�fN���y�����333�`����������C���_���1Am�={���������X��y|9�����~��1�.dG�8)+4�efR�>�B���\[���������S(b^B|��C���#G�t:�������K� ����'11155����l6k4�J%����w�>h� ��l2�***xl�S�N�.F�����8U���#F�p��e�Z����m����h?��������466N�>].�3�N�<���srr������k���\\\^{�5�a�hk��1��������>�C��d��\e�xV��3�t
la!�
�&}�q�&h�����"A^|�EZ�V����������q����*��}��q;��������d2yyy���������������n��	�����'�;u���mll�)�
Eyyy�N�,��d�m�g>��t����k�
v�x!��wOMM������fA�������&&&�p�7�����Kg;��tL�Kq]�5�i>��
_���/�9<7C~EQ;�SE�f&�G��������:����V`��i��=�����C�:)��n�hH<�PV+ 
&�Ez��������$�R�:t(Q]q5���y�c��/�0�L
�B*�&$$h4.Yl6���'/�C�.������;v��T*}|||||z�� ??�������sss�u���|��m������Y�}��������_ZZ*�J� �����"8|}�^�����
]���G��;����WN�:r�����`��'O�!�DF�%��}��'�����}���?��va����/]����:u3fL��]?��C>(�Hd2Y���w��i�o!B����������+cbb
G�y[g�������!�������Jw2A<��g�IA����	��o��������R����������|9(g+A=Xw��b)--?~�h�������P^������^�Z���Ci\6eddpA&���K���|���^��'O:u~���+++CBB�������u�6z����;����������	g���<���������wHuuuLL������A�%�FW%�"=�������I(������aL��#Gn������P���u���	����;v��	���555���[�N�0��_N	S�N�J�^� �( IDAT^^YYYuuuo��c���p��QJ����{��)AAA��us�����T*y����>�\*����L�2E&�����t:��Z�����7KSj���gjjjhh��f{����v�z������l6n9�<� O�F����W����}�����-2�$X���x�e�����R�X�k��o`�&x{�&�Q	X�/�C���-s0���FDD���{yy���(�JqWDD���c������J����+�H���c4y���#Gn��y����D�V�������]�,Xe�Z��}q�SfffbbbTT�����}���z������N�k���b���S��:uj���f�y��];v����{���;����9�!�|@@�������&�g�hS��!R��hq���[����m�:���x��s�M�m����j�x������*���S���~3�_WZkU� �A�>=�w�����q�!����i�����~��5�9���UU����hV��P��~[}��Jk��A<��Z����cz���Ag/�
Q����7��[gwL�.��$uE��%�`A�,}���w������|�K���]~��WU���p�(�~�IY�����0a��� �u�,Y��>�9��;N�>����<h�Z'N���TUVVfgg��J�Z�c��
x��wxQ{��N���;vl����5i��q�����q����V*))1r�� ���1���W�]�����x&2�^+���w��}g����0i<=��
A���h���{�|�A��aIH���[^|��~��o3�._����SVV����T*%��h4�~~~�v�����s����WVV���O&�����/�3�����<x��3Z,�;w��#""|}}�F�����deC@@���[��A�v\\���+�
��������� ��cX�7�`�������M������}� 6qq@����C�>����]c/�%{n�:������@}}=7�5*<<�������c7n����"g^���d���h�[��O���?;���u�TUU8q��3"""((����f^l�^'��c����������1�||c,/�:)k�
JMJB��=�>!!!Z��{��EEE={�����-��9q�����7o����S7�:JLL��H���S[[�����+W�j5w�aN�����3��DEE������999o����f���yV���~{��%K�,wu��a��]�,Y���'N����2�CMMM��������j�ZXJq����7x��UX�
��'`)#w���k��]C_k�#�osR\Qe�!�!��:|��wEEEs���P(�?.����x��#G�t:�����K.k222Du���Z__o6�5�J��f���w4h��l6�L����b�x///�jp���G��05�j����m��-**����~���?����666N�>].�3�N�<���srr�<x��l6��+W
u:�C�U� �����C��:���^������12���|�������wR\a!nHUH<�P���!..N���Q/��"�V��j���x���7���������������d2yyy���������������n��	�����'�;u���mllLOO������S�N��d2yxx������~~~-�k��6l����w������������:�_�������O>���|����;�poB��I�����@������wi���
(@73A<
�����}����b�v�w��0��?�!p����i���(@q�������"=z����IJJA�T:�������|H�1��_�L&�B!�J4
�,6��������C��������;vT*�>>>>>>�z�������{���[W>77��g�u������]�g�Z�n��G��f��#�����!��3o�4���j���f���z��x����_|��8����'3����A���j&O�/~S�F��4}��'�����}���?�f������.]�����:�3�k��~�!�H$2����;w�����!RXX�P�V����*//�q+�R���C[��G��	@@2��P�;� t:��su�E�H��-@��Mx�2a�x�s���Bs���[>��/��i��z����RZZ:^LdTVVv������Q�����O��z�j�F#�q������ K�.��UTT����zAN�<��������			�������;w����w�qQUk�[��-ME1���xi%/�K7�K��6��2;�� B�e����$�����8A��z��$�Et��h�������a���~�cf��3�q��Zk?O0����4�W"[�~}?y��]����O�����������k
k2_IDT�23���T�@q1t�pT��T����������G�?�]���`��_������� ��qvv?~�����9����/�{HOO�Z���/888����\�����'NBh��#Fxxx������+��;�l����P(<<<4
�J��3�pppP�T������z�^�V8p��Ggee�;�����]���zK�T���=���:t���y���g��2e����B�J�R���!�[k�����X�N���W�`�Zx�wQ-����1bl����KJ��l�@�;�W���m���GYm,**
tww���W~~�y��9s�;v���Z�vuuuwwW*�:���^��������4�z�jtt���k����&���������Z����.Z�h���6�WDD���G�:�������e����v���l�2S��K��"'���tJ�������������W��	!D�(�@�(1�{T=*�f��n��c��OT��)+��b�t�UV������U-���������z�u},Q��������X����y36�7B�(���;?5��z���ii�����N��bF���e a�k��/�����Qn���z�u},Q�0�?(A�D�$L�j��4g��=e�Lcp���[
�����A�T���j�w%���C?
�����0����=^
������������~��Zw�j\	Od�������?8{�l���������3O�PZZ:~�x9_���~��i�'�|R��i�s�/6���c��K�,�v�Z�]�6m*..�0aB����H"��c�P@���@X�H	��e<� �
���V��ZmL4�{���lAx�E&R"Xu����rm�s���+W�ttt��:;;���(((�����
��1C��>�f'&999|�A�����{cbb�{+�?~zzz����H"��#�m3���$a��4o ��/�ok0��	@Z>��^�$'�btE�p���K[O�
4��� ���)S���#F��9s��=&&&,,��N���344t��!r�.^� 00�fW����5�MY�k�v���IDTw$	qqx�QX��J(��2����+����������%���X���G��ZD�������������G�n���iW^^^�>}���Qr�:3g���������
@�����DDD��^x��������w��AAAV5jn����wo�����r���i��+�7n�����<r��\Uz�����/7�I�Z�n�{�n��/�h4�?�����N�j���|�DD��������C\����j�&���_��1.�$���[7�?7+��`��W^y%555;;����C�?���i��5k�.��?~ll�K�>\�V���:@����o�u���?�N'C�Y�`���c����$F�N�PRRRPPp������6��/�PZZZy*...  �*�*--�����-[���.]�d>Gy�������t�����]]]�����h47n����Y�Q�Z�r1�UX����V
�`�e\N�Z$?��������U�vI��!U�k�jGbb�V�}���$IR���V�2�9r�\�&::���#o���9#�Y�h��^�wuu����+��"�x��:���`�+III�.]������~����m��+W��_�G��z�������k{�1��:u�$��vqq���r�rKZ�F�6��7XR��1x0?�D������$D"�r�Ww�����L;<���z�m[��B������?��O,1��+��+--5�E:u�d���gO��999����M�h4��7��=���;v��G����r�e�+�n�������g�&M���T�|����������f�2����'oY���2����2�m�`���������t�!�E1��
����Bq�):b�8|�!���F�3l��l�*�Z��{w�u��)����������I�&������S�R��Z�UuR���{��{�=�� �GG��}�����f{�V;���!3��&"��A����X�?X�1�-��`e�����Mz=BB��Q|��u�����D6q
V-��&M>��s�������`�������V����7X
p��ir!BAAAv:��g�����AAA!!!!!!;v

�����j�~��~��U����#"j\�����>����C�?}���3��g`�v<� ��]�Z���w���Q�
G�j�o���;����
k���������_~	

U�T			YYY:�N����>y�d�NT*����F����*�J�R�������+
�F@�R=��3U5���;vl���]�vo���R�,++{��g;t���85B !
������������01	�e�����_>�G�� Zy��V�I����|�X�v��jWWWWW����������C�t�������@www�B���_VV&��3g��c�*�)��V��-v:B=zt����=\�z5::����C��N�=a��-Z�w�^�o'""����V����������-[&���c��e�Lm�.]j~�Eu�J/��������dQ&D�HP
��6k�Z!D���zX���;vX�����(�E !�6��j���Uk��q�t�0_Q������q*��*����������Q+�-�]PRb}�Z
���<����
$���1
���������O_��Q��*9����]�������p�8�o��$�����9-�)F�Y]n�$����?}��������7�+7{�%}n.�������u��7��'c-g+Z-{�����"�:������=���|onn��u��S9����?���GQRR����6m���'����������kRGy��K�,�v�Z�]�6m*..�0aB]��'N|���L����~de���X-p5V���@:���v���U+	�q����d��Z�d����3`�[����rm�s���+W�ttt��:;;���(((�����i��1C��>�f'&999|�A����{cbb��=��?~zzz]��O?�T�R���Z�6e�'"�k�VUL���>����\�e����~@�V��<�8�n��> 6����K&�j�"�s���
TPP�Wn��)������0b���3g��cbb�
��b^o������C�qvv��e\�x@``���JKK������lyS3�7n�P�T����w4s�L~$������4	2������a��f�^��=o,$���`�n���etE���������%I�$i���-[�4��������$I��5����:���3r�f�N���!�z���/_n����{Y���q�F����\\\������Z��M+�R��q����/��I�A��]�$i�����/7�I�Z�n�{�n��/�h4�?�����N�j���M���N~:e����2��4Q]��`��E�;xG��OD	����=��rI���+�bu���r~�`��W^y%555;;����C�?��5k�>\1����X9�2>|�Z���c:t��w���o�u���?�N'C�Y�`���c����������JJJ


�9���a���aMiii�!������������K�.[�l	�t�����+W��?���K�g�vuuB�;wN��l�����q���'Y�re�&M�J|zS���7��������U��!m(�J�4�$0sFa���{=M������t�u�#G�d�T�� �����M������:99EGG!���G�ij6r�H???�������xWW����/Z�����j���J������d���x�6mrpp0=�gKJJ����:?��������.����h���v%''���u�f����%o�RVVVTT�����V���G�����������)���va�fz�]������i��n+..b�>!��@|�W�����H<��Bt��=&V�	!�t1��>]��[0}���f�����M\�^�;VZZ��c���:u�d���gO��999����M�h4��;��=���;v��G����			�;�r���6m�xxx�}���SRR*�q|����������f�2�����4IMMU*�
����}����[b�E����C��������q�==������Jxx��Lt[1�?�
����6�������6�k����22p��q_J
*�������,��Z��{w�u��)����������I�&������_�J��y�������;t���{���8::���w��]6�k���u�.O��P����c���>h���>>���U�����Q�����I��x�f�wS��kQ��W��/��Q����iI
��0�|��|??����K���Z:�M�|�����������yf����=z��W�Z���o>������Z�@PP��N*���gnnnPPPHHHHHH��CCC�������_��_
.���������'T*������+f:%�zp��q��:0�f�s�������-!'��a�uWr�hs��7�4��HT�`��7�x��w������|F�����_~	

U�T			YYY:� ==}�����888�T*???�F���U*�J���Ie���W(��J��jQ9�����c�V����k��[o)�����g�}�C�x��E{��eZ���_�7��#"�
f��������F]��MA��>z�P�g������2��������������`�S�NU*�W�^0{��!C�DFF�T*77���|��1��#G�n4IHH����S*|���rNN;�����3g�y'�w�^�dIXXXii��o�)w���cJa����W�� ����)���������&M*))��{���[�]�n�����9s��X�����DGG�o)**��������II� �!l��l��tf���X�R�!�s�u�
���sK5$�6�5�[�o0_��p:l�����N}�	���Ld>��+���#`B�t��w��e5���HK���<��mHT��D]����Z;j���������F��H�m�f������y�W�n��,����Ty�Z��m�:-
��<���,!��S�|}}%I���0���|����S����\��{����\�����`���TQ\\��W_��{��������6��!;!!� �!�Q<ZU����i�������`�V��@�<��[�[k����F`���xzz���'''���+��'N�0aBzz�V�MOO��7����F1�5n��u���?0�~�z�S�%IR��M����]�VU'&�����5�v`���������/������Z�q�$I�+��e;vT�4���Q6[�t���rr?���$��>����6�B��f
���X�f���{\\�����;w������]�����������Sg��y����Z�TX���x����$S:�'N��=;44T��bcc���KKK��������j����������)2���/^���j��}�}�Y��I�<==g����ys�(������.H^������[U���)��F�~W8�! $	�	�l8��P*yz��X�$EGG������/G���w�HJJ�W�gD``�Z�n�����Sr���_~�|����)+++55�������wttl�����#?�r& IDATM
��k��j���n��������L*w"I��/�,�`���G��X�J;w�|����~�
@ZZZ`�5'N�pss��(WB����z��!�5�!�~��'�x�m�����B�-Z�\���)S�����7yXN������j���)�AA�k&bbU�~�/Y>�~����w+4����uU��������Y�%���
��?^N�$�&UUb1�x����b��EEEo��v��x!��C�
E�~������];�R9����k����{������f�2��U�������y87l�@�T�t���������d��e��-��u�U>-9yzxx�^����T�����o����K���^)���	8���4g��.�l�i��qIII#G�\�n���'��_u���{���������	0M}J�����l�DDu$)	nnpp������,��2��78����v���������HL���=���k���V���j?8h �/`!��3g������V�X��ys���v^��}����3g����z�b���5�!�T�fEF��<yR~�y��-Z������}���j;�*�,� �����.X��i��������k�|N�6M~l��������5i��f�g!��)S����u�fz��jB�~l#Q�Z��L���@h�j#�V�����yB�.k_���������W�0\~�yy<��SQ���V�n�����"�nnn�V�����~��M��e�ya�9���o�9{��)k���k�n�L�.\(/{�#�Y���+�JOOO�iff��k��eR��������v:���������;w.%%���C.\���QU{�NW����/#���C���vV\\\�������l�����K�J�4A�6�t	������P*��h{���#3�v6��`�@�?��7�'O"YU~Q	R��m�������b�;_�������`u-������v�a��:u��-[>�����SJ���];���r �1*,,�_�@���:v��f5A�g3
�H�d0L�Lqqqtt�<s'����X>
���:t�N'��>}��?��2e�)a�����R�����_���#���h�Je@@@U�!���T��]�Zy�#�m@�<=��L������#A�=��4��6i\`�b��gu���E�j���S?���i�<y�o�!�g�����)
�r������;y�H&��Le���%%%���>��I�&rPe'���Ie
�����N�:u��N�S���
����e�����|P�������s;v�hZ�n<���e�"Vo���JLDl, ���D$V��N'��a���O>i������F���y��F`��9S�RI���K77�g�}V^�t���7�x�K�.��5;q�D�jz
���������H�Fc^q��O?={��8�1�<����8c��N�10S��%K������N*{��'��P���be���/��H����RS�W�����5�����?���n�������!-
�&��vGw	6B"9�����U�x{�N���#����x�E�_`�XfsD5����edd$''��-$,+++**��i����k�f�V����|���^^^�~��y��	!Z�j%�Z���+���>��C�E�e�O����%O����t"���aC�V����i����s�����2�E���l���������
��{zzzxx(
wwwy��y���kg:�U�V�CL�b�����������??!�(%��Z��_�/�@���������M��*�<�w��������k-N�$i��
M�6>|�|��(O�y���~��c�����7oo�����o���������2�_i���/��������.�f?�����G��m��fzUeee���M����&���:TRRb�b��j��/T*L��8�%"1�U~_@*A��u����V�%��	F` �O����n ����n�o�����VB����������S�&�j	���/��������K8{����<�B��f5�	%"��.)	��j�}��l& �2�����]�����L$'w`����&h XDDt����X*0�a;$���Y]�H ��1�)��
���qw����������c,AH�����VC��~Q���HB�B�����s4h�+V���(�
T�X����R�V���,""j�RR0c��_��!,�v������8q��I`�Jl���t�K^�EwBHT��D3�P""j���+�� GN�n3��3cI�
.\?�kXHEF��LL���n����q;|C�B���7�9�vH$A�-��:k������g0@���Qf�����'F��i����8�cv`p����n������g�&`�nHM-/�c
�$����X!����""�F�|VGtL@BBl����.��K|�z�,��s\M���1eJE8e0����q
�l���T�(B��f��?�t���������B�s�0���8
��`U�1-���)B""j�L	t�����vg�=���j��=v�4�6VE����+��`�%3..����Ht���{������z��h����?c���
�w�,b�EDDw��D���q�I
��xW��1����]oP�7`�#/S�Z4����,""�3h��LiH��r�++��K���Z���|���;����tV�!XDDt'�sW��LIHr�[�%!IMm.I�?����NN��,9
��ii��C����N���^��#qU5?����[�{���0o�lMEX�q���h����<���`,""j�������
�<�U5���W�-lX�9fL�&''|�%F������{-��������N������p:�������A9�������S!�T\l���#�}�����P�x��T�!�����A;vT����m��s��fi�������b�0j�������<�B��zI�$Iz����M!?�-"Ix�UDF�� �Q<j��A�a~O
��+���O�}�o�:'AA<�T��$I������B�����1�����n��|��0�����������p��$a��'�9����Bdea�H��II����O2�r����?[M

:t�O�Z�a�t��T��B�*c+@����zU�?��y� >���{�*<�����1�I���O?u��!>>���z������R��Z��%""�wj5&M2>>��������.�R������[������d))�[�j%�;v���y������������,**z��g���bbbx������`���dU56�p8?w�`(+���@����V����[l!�����u�������'O~��}���|_k.{�<XDDw�� �����i��g�JAJ,��qH@X���<�����+�|�)������|��D�H�^�^\\�w�����n��������>x���E����VKK��?��Lv2`�!mf�q�<�������C���b|<3`Q��k�#������(''�+W��={��������<�����F����j��G���n{:�v���>S���1��
��i6
��O�>�z5O�c��%%����
��zk���G�9u����sg��	�O�>���������?�����V����&?
4C1�N,��V��p�H��G;��<��6����-�n��
T�lO���	!|}}���o���w�3g�p����n�O?���g �~�����>	��x:�=^�77x�)9b���&h�Zd�a@@@hhh��}_�����vy��#XDDtK��c�~de��Q%�v�+	R�!(8�'�y�FZ����9dq������'�j���o��v�����_���d�v""j ����G| �,(A� �|D��>((�����������xxX���];�d�-�m3��E�DD��7���'��	�X��^5���Y�ug�{����n�i�$�U��i�u:I�A��X��i ""jP����f���j�VBYUt`�,�"��#k�ly-�\):,�F�D��5����5�1I�"9	�j�h�V���`�0�98gF����`�"�`Q-�!5��|E�*t]�U��I����A�c�m��t_�����c��N��i�y��?c�"$"�;I\��5FWbv�Y� ��B/��
���2��`?������v1�""��n�j<�����U��H;�aPL��w��������C���y��		P(x��vq������b~P�������E���0cz���	�q�$._��[���1
���������a��T��i���`H���3`����h�>���-Ri�RR��j]3�""j���H�~�_��r�v����-�R~|��)8I��U�q���z=��y��vq���4��fU<���w�n;�W`��0�����vDQy6�/�D���TZ&V�cD����N��Tx����i�WM��B��'�A�0b�y�������Lu�S�DD�p�)����O3��7oxW�I��q`���i�$�8�)6��Ft4Z�������\�Nu�#XDD�p%&V�X�pL	{N���c��1`23Qrp�0���k��rr��L��i���!_���W1��	�V`�(��wLyM�Eo�T��EE��W��EE����3gx�	L�@DDw��Az��DW��X������}�d|.�e]8z����0�""�j����Vb��R�K�:������"�w?;������N1��@�gb�EDDw
�Z-&M�T���3|�0;�$!I���o�v�\��/5�+v������T�%��1��!b�EDDw��D(	�t����p��9d��G��}��1�s�	O=��;��c����<i�XT7�����"������D5�y)H�7#��M���������������m�ICI���Z�G��/_���kz����������b�$HDt'�������S	��x]{	��%��N�?���0f`�v<�DDX����4$�%
+M����C
�"999$$D�y���#Gzyy��t��Y3��nL�@Dt{�Va�&���]�!���U��}��b�Q0<cFc����i�.l���~eP�
k���T�P��������������<�#F�@Dt�����f��=	Inp�]�qrt 5
��[���N�!C�[��B�����������/_���4##��y���aaa[�n��������e�$=�������:II���I�~��G��W/�Fii�1�����k���z����^2m),,�P~������`�{�s�������Al�e8���*���R�q<�|���a�6���qm�g�oGNBC��h��<�w�z��]�g��������7�XPP��y����[�neeeV��������SED��!-
qq[R��_;����S8��W�A<�m�'����^��EHN���t:0�g�$IR������.��M����SC������={:w6f�m��%//������@Dt�HO�BQQ@,b��?@�����(�8~�GB���'����������mL�Nu���`�����7m����o���7�|c��i����������.��x/.Q�g�8�Cr�'pp4�2�.5�t�<���g[�k��+Rl�FQ��V��	M��5WZZ��w��a��HII)�_������~�+"�FA�Ne�+��g���$$�����s�`�\�����c�X������
9w�z��5�w{��U*�*<<���Q�������ADt�����[,��@S���r,��)��%K�e�^�I���\��X�^�~~��CwD�e%""����CBBBTTTXX�V��O����nKB`��,�����'�GA��n�����~��w���<�;,�`%&Z��"�R#�_`3�5��$	������4�1��T{����x�Q��_�/d����e�8S�����r���`Q]��;'"��Rf&���J@�1%����������i��z���Va�ph4���(�����FWT��QC�h4����r��@���  �a_��j���(/�h4���V�I�,�9o��'����z �6sj�"$"j�������"89�]��+��#:~��K�B��7��k~�qq�����@^�u��:$`��'���������Y�U�,S����y�wl���]q�=M320z���,�:$����-�V��I�-��-��N	B��fe����S��G�n}�(�J�j�\�EDD�Xf&\\��i�Q�x�Nt )nh����t����q
�o���E������C��S�DDt�a�hL�\�]]Wt���~?����y=�~����9������?��qV�L�@�Hp���n��4t�nq���������h[���������UR||PXh]ed�����XDDt+�t,���Dd��=�#,^�B�����;����`M/\@�<�����nry��QIHr�[?��l�b��W�q�������-]��C�^�����N��k����V��	+V`���-A
E�L���=A�1�9<'A�X���7f���iV_Ux��8�`�EDD��N��]a~�.BQs4�6�(	R���1/���;�,���h��bKf&~���z$����n
���+��~�FWIH���w��\�o,fIyy�~���N��k����V
 ,Q�k��`�
�j��C����-����gk	$	}���om��oS$���7�����	��$$T�)y���w,�o�����|������SYY:ee��G�pJ�8?H�Hp���n���O�'�Xl�A�~t%a����m{���[�������:���@�/XDDt$%��
}�Tl;��&�E������;�9�_��|pj�@L�������P(x������!��� ��b����}���j�)������������������Ed$BB,ZC��J�sN�H0�""��VT����f"�a<\m��d��2��$]��"�����6(I����7O;�g �)B""�o�V���"��m:��?P@�1��"��G���b>$@� I0�t�ut�����?��""������0��Hq�[���1@
C�=_	�fCn�P����s�X��/P*y���q�������GG[SyVol���C�����\���/o�Yt��q�p�}���",��<�T��G���������PXGW)H��o����c-�\�����?���pq���*'��<�T������s����������$ppp�?����^��7�n�={�2b�3M1�""���N��:�����!����q|6fX�h'�kg�/-
��%&��H����n'6�;HD�J���n�'��_e��Q���������{w��Zr�5t��F�z�E�DDT��j���L�}��5�6
��9R��?��+A���_I>��=g=R����9|E�*��"w""�'�rs��J]6����2�	]�K��<=@yt%��z�1|���L�@�,""�B`�6�[���5�?(�!}8�=���������1l<<0v��1
&M�i�[�S�DDTO:u��e9V�	�V`E�#XNp��/G`���]�c�.�q�*$	��[���*�Q}\�NDD��`�^o]e!K}����(F�����2��i3��m��^�,�+9��R���n!XDDT����UP(�����j���!��HCZ���KH���Z����A��8EHDD��sg���u�	�I�����c�,��2��
��f�uy�|�_��$��Y�s���u@�4��"w""�sYY((�����}���0�{����\����s����Et 5�FVw��������[oA�)��Pn5VW���J$ ���X�����q��$xz����Z�k����t�p�����VU��I�JPb?�;_��
{�-���f[�oG�6��{��������'
��=�P��3 IDAT`�EDDuH�\����y���$$���38S��R�N�3c����6�R��33�s�x����������K*^�2��8������l'v������13��7��x��bc��;\�N����,""�;:�v��H��a�?����:�8_m��~�U���b�#U���Hp�;���[m��D&�EW7 ��-����6-���mx�E���))��etE
����i4x�	���H�El����>��o+���q:�m���$@���5pr�X�'����NQ]����,F�3����
o�����5�BW������X�?���>��wd�AjH�G�����|������1[	e��-�|�����q����
�����B����W����+j(�G�������8y�f������|�wFg��`p�����'�]x�U������<���~�0q�u�x�[H0�""��'����A��30�)����j{PC��&�������&���:?^^��@�`��_W��~~�!e"��U{����~��!]����������G�����h__,X��N�`��@���#J5����,�����X�6d�ous�X�3gV4��������3O
$�`,""�}��C��NS���EX$ $�+�,7X��%K��RL1%^��p�$	���?,6��5�p�#XDDT����CP(lgo�Iv����uG����Y������K(,D~����SC$�����j�[

�q��\
R|�[��*�Wq���]���������a�<��E��$��1����#XDDT[��,�^o���S1u�U�I�Fc��x�9����f�����F��#�Za�hL�l�����
+�JJJ���Q(�F�2m���?������������Q���,"�[����8p}�Xl��_��	���� ]@������q���?6n��������VW�,;�'��t�RXX��K�S�N�k���?�h������G����@���f��1�""j�.��������`_�.@�it�u��jP��5WRV��fk�OO��W(���+?5�����z��KKK�������.�]�V�������������<�#F�@D�0#:Z���I��x�&�dao�U��^��WZ=7 >�eex��!�c���D
CC	�6l�0q�D������_�������'o��u+gDD
���P*��nA���@:��&����&�'�����a\-"I�>
���j�����0�yj�D,!�����&''������p(������`0��!q�#Q]������%Ha��}5�^��5�5Urs3�x������-u:��0[�K�p��1�e�V<��sg�	&4o��f��
?UDDu'*
��u�����
"+����R@z|cb��k�az��NgK}���o��*4�L����c���j����-Z�����~��Mu���o[m,**�G��������X���@5��fo A�����/{���';�U|��\i]�@d$23y��f��7���z�������^�
������;?�����	�����i��iM�?:DD�l�L�n��
@���U��W2���k�����}2x�7._�Z��,_��_�h*��R(��O>����~����
�J����������/�4�h0���CBB��1#;;{�����	�4�;Sa�����Q�����&]����]����������������v
���}����:!>!!�P��D�e����;;;�m������^�={���SSSS5�^��2e���Xe�c�EDt�����"(�:��B,�F�fl�a?�p���K���ukyf+I��}���~��A��1�5���L��=�jB�w��������h����+""�%.�B������8�'q��]���vGz��b�r��w�����q��E�i���T�w'j��Z#��k%G������3pwG~>:w��\C���x�g#��Wx	�)�t��A+y��F��+����+OI��!�m9�U�wNDD
����������D�����WxHbs��`6%(I��F|������*�+j���������<��{c��>����,d9
'	Rt\a�T��9���Jv9{;�_Q�'�6sj�"$"�7��pv����2�fo����������1-���8~�{c�z��Z]�QR��"D
3����%K�T������\	e��+Zh�?{����1-����<�(�j���Y��|__FW��5�`�M9u
���������1�Ld�CMS�����E��KX�>���!Clga�$��fD
,�����3m;��C�<j]HD�����������������W]\<<]Q�������jH����b�F�
Vc�|��I�A���?����W�����:oo��a�t�����W�(p����jJ^�n3o;��8�?���@}��3=F����{!(?���+����x��FHp���jj�l���.��ij�k8|% 2��^t�+�gQ��Y��up�B #�[#-
qq|��3Pc
�8�EDD��C����
����<Y�����+	�!dz��$�Z8nn����#�2|�NHN�Q���A�""������
j���
@b�X��W)Hq�������f�~YY8YY���������5"����z����iU6HCZ�j���M������3%�8n�
=
??��		P(�[�F�S�DDd�|��BQ�� �����38S�>���/��\���������C08J%T*�:�������w�|�;S�bj
G��a������nW�u{�uy�x`n.�6�}�U��#/���u,XDD��3p��u=@scy;P
��8�Vi�Y1��`���a�b��qqX�YY�],XDD�^�w�"��w�,��W�I����"�OH��`~�	_m��s���4	����P#
$�����lGWz=�����i�*�}�O��n��v[���h�5���j��I����(�����8p���7B�k�
pw�F����\zu
��c�Y�u�c��6����z��
aa�6����d�����0i��1��A�fwjt8EHDD�n�@�Nx�m<���f��
Fp
'��jHYb�rxz�l1{~>�����,��3���Hp��������P{����?� ����Ld�+���k��M�<=�}��11�h`Q��8�EDD��N��M�p���.$5�5O�`f8��k��/��Y����p����XEEh����w��Hp������7�@����v�A|�O\�:
�j]��s��w��}�!!4���0��f�R>�~~�����k�"""��6x�,\\�\�n��B:�}���~����O8;K��Gg�O����RL���"89U4�����`��j�1L�@DDF�m�;��� ����f���<���m�H��'�M����1p N����}�r�CFW�x�5�`�=���HH��������Q�,���2�������3 
������]1bV���D
>��,"�;����o�EW��o_-���R�����k�zx����":=d1)IP(�IoJ���5�`��~�
��������;�����k�#n���Y�����A!�����������l�j�Q�""�s-\h-�Itu
��C�����
#.������u��%%pr�$YGW$	����O�������]%$ ?AA5j�
^��obn���.�*��x������*���QR�
l����B�>}�����S�DDw�:��|t��h��HE�J�n�����0`�/��5��~..��O���������A.�H��	��ADt�EWz=
�������P� �)nz�y�����];�o�7�vYGW/�R���n�p�#XDDwyU�FSM!g���	�fa�M����������:�\H��P>Lu�$<=-�V���q���� ":Z-����*�	H�?N��>xa�����wl�QQ8t�7�xe��6�1�b�ED�0����BA~��T����+,�������X��p�V	���b�Z�����%-/���D�~�e,XDD���spu�JU�T��{t����{�����8*������&�|����E�!M�P+}��Z������v�JU���.��z���*���Zk�V*V��z[��<����[�EP����Z�@E|BBr~�F �<����W�x2sf�sf������R)�Z��Br�f
23��W�/%��B��RS�����vdW�`H��+�[����Ed��l��'����O���1�9��8@�"}*]�,B���MWee��n��4���I-�^������t��FA**��������*-�R��,�����%!�$���4BH��fL���'a����h��^��/���!9��%��h@]F�FBBO
@BB[o�'���K�����@*�J���!�JE*��:�xy�y
���q���E���?��?��W[x��5P����Bz�DH!}Ji)�OGa!
���������]��~���k�����zN9y�N�BP<=q�*���������g���	�"��>�`��5P*���`c�������AL*Ptd��y��c�G�����&Or@C�M����|�iv`�(�ZE�����`BHoWR��������;i�����(Y�(@@������f�gY������������-��h��!'�bG�j"A#X��+���Y�J�����k��lwv�x5��1����Xc��h[��2��4�8�S��,��Uz:��� �>�~EH!�Rr2bb )	�Nud�(�Q*��P�
�Gi����I��c�}QX�����P7n�P55))�X���,B!=Hy9�MCq1N�DP:vAc
�$ aV���Gi������9x��������f�������)f�������H�<�DH!���5��AX���}#A��a��m%(y����/��m]2�!oRbb�?��^O@����M�+�P^��+QV//�"}�`BH�PP��P�T��d-�c16
QU�z��00��S�D�o�>>�K��7��A��w�����ARR��(OH�E#X�K����������6E�w��c���v�^��_I	���  ,��������c�������s;�S���Y�����P%��~��'G[���^�B�
����chh�����BR�?���{�,B:��#G"""�9B�������5km�nd2a�Nx{C���|TVB�y���Y�"""ZK��d #�A�@�����m;x����Q��uF����g�����M���@.�N���_fvE��_,�DH�6��H���N&��F�J���IL���������
���?�ir��Cc�;�c>�$$E#��[:z�C��������b��<Y���C����o��*)����������G�@�iBH��������� ���0w.�Fdf>�]�]��c�a=�o�v����R+�%oo��|�B���ba|�����r�1�e���������J��@�~�<!�`u��w�j4QAXM/%��!&��������};�F�8��H��'�8������R�F#�S�R���W������s}��oT�2c�T0�W_"?g�@p�:��� `�|����W���~0H~�z�%���H�V���%�bhhhPPPXX�@���k}�5��9��1q<bb��cxt�W��;��H.B���|Vy��u��"�?�i1�bhC��gqd�������4K��}��=����n���, T��� #��������Gp���;	��G�yg���:y�dPP�F�`������"�<^��V���Zm�-KQPQ�������������z��@���|��3�C���T����s���_v�"���N�0���b����&��@�!��W�n8|�5��o����?���s��E���{wJ'���~Q�%Q��[�@R`���u���o��t�Y_�l��4C���C� ,~��A���)��+�k���^A�-����o�d� ����o@�S^G���'7.33�����&%%Q�z5Ggf����MP^��3�(X
n�����,C�V����0�f�3l�f����F���mo�v���60z�
2�C�a�0��5NDx&<~z��`�me�Ri�46�Y H������Ay.�`��^,�����~�y��KH~����'��	|������X��"���`��M�-����Z��~K�M�6�]:���X,�s^z<��Q����;�rl�O�s�����W��;3X��CH���X8�_���3����xG������*����G���q���L�f~����O�H��7
�'����6�G�jT����7
���o��������d�-��w����;�^�3@�t����?�0fn�j��bH%B�p�����4<)�g�3��EDGpp��x���`�lgo�!+BzO���2f���_��e��`h�����KO[;g������Qw-���������nb���~$"��P��������u�B5>X0�:��vr�+���;���v00���b���Ap+d���2I#�{,w<�e� �[���K�*�{l�K
�(���~q�1����o>0���D������Ho�`�!�&	L"�bz`�0u�v>e��"0.��:�`i)����,�$�"��#H�c�������������G�jt$��5���F��`�`�.��S��+���vo �	~��^����C���f����&H� w��/�F�qee���P�cz�@*��(����PW����	�`ip�{yc1dr�)��BT�fa�8��k
�T�!J�������h�+���0�$Y���]\~<,�����[���(��(��Om�:0t�?����IL�[N��f�a�\��{/�\}��O{xi@����	����x�A��K��>�UWW������o�l��W�6�����������;6�1N
���O1/���P0�`��p���b��HL�3c';�<����������	�����^#�\���a=��m����������4��B!�����G�`)����6V�������Z���g�`B!���8����Z�YK��	������"��UUU�<�-22�vB!��=�9X'N�g�>���O1#�BH'����cz����G�R�t:77���"ggg
!�B(�z��h4fgg�����iJ!�B	!�BH�D/{&�B��'X4��[bd){Q�h���F������Ea�}�"�����;��������{W����(�����������l6;88�>eD��zyy	�h�,��)d]��7���w�ur����=���<��"b/������(
A
ENNN�:��qkhhEQ�������VNUt��F���<������<Ujo������6Pnn.�C�����X,S�L��j���DQ




�^9�j�IQQ��-[
�����t���6lx��g�9���[)��u�_`� dee}������o����5k�����'�?��N�b�Hg9p���3v�����z������������vu
S:t�Z����Eq���&L8}���S��������a������>|X�TGG�v%��P������H$M
�F#��'O�����R�J9yL/^��+�Zm-���			����}���S�������n-T����ooc��V�Y���bcc��'Noo��0=n��=����O�p���OUt��F�������X,�c`+[�S:T��DXPP��SO}��'k��ILL������	 ((��	

-))i��<&|�c���C|��^�����R������g��S����9N�T6}���������?q�/�{a%�e��9���Z��
��q(L����������e��S����������
z{{�7yhW��e	cL.�_�t)++���j��%����9���7IDAT���d���f3�`��<V����;�L&3�L�QE�^9��X�c�X�����K]\\>�c��V:������x����'��)))�����_��q(L������M��g�3f�;V�V�x���^7�4i�����c���/�J7o�<a���&��P�,��������NOO��m��h������/���������WN���k�����RN!�.���[�b��d������[�v��Z������U^^��h��������q(L]&>>^"�xxx���OUt��vf�9<<|��q7nlhhho�����.._����\�F���0b�������
E���+�T����6�S����Q��y���?W�Xa�X����E�^XI'JMM���y���Cttt:��466zxx$''�>}��������{��z�����W�Z�������v�Z{[�S:T�K��.]z��e��N�tuuPTT�����q��IW����j��!sww�WN!�.�~���m�l�`����"b/�������QQQ+W�tttl�CQ�������3�z��q�Z?U�q����3����Zb6�e2Y+�:To�e�����g�^/�H���yr�a�^>n�����/Z,'�Ozz�B��N&&&J�R�������r
Y�Q(�_������\�`���c[�D��G:�R�|��7�m�����8��m��}���;k����Z9U�q��$r�|����M���z_����EQ�R�qW���b��v*�JE����V��c������b[��`����S�����?p��������{a%�����qqq�8��*..����OUt��9�CTTT����w����g�� iii2�l��1j��Zn4����fsDD�m��I�D�:y����������C��Vk^N!��qEEEZ��h4FDDX+�{a%�#:�8�n����S��@�����,B!���n�#�B��B!�,B!�J�!�B%X�B!�`B!�P�E!�B(�"�B��BZ�g���;v���K����s���=z�'�����555�sdd$���Y�fQJ%X�����#3g��	-�9s��#G��/���Bc�L&���&���	!V555�/_������O�~��%����sbb��M�l�����[f�9,,,88�k��=s������������-Z��O�8�����riiirr�(�����������{USSS+**x�'N������+�.]����m�u����'��@��=|�����OOOk{���#GN�<��V�.���n��]7n�4hPlll��������d2�s�%Z����;��7O�R��|3X�d	�R���F�!?��tyyyyyy<���qc@@?��3f��eR�����!!!�����������=|��T*]�t��y����O�0��������'N�P*�.\�J����~~~�����o�����:$$����>u�TEEE~~~YYl.j4��������/*����l�^��
�f��9m����WW����������HMM�]��;�k�d����{O������544:��O?���K�,����3������������z�����w����wrr�p������k�cg��Q*���}������h�]���B����t�B�?'$$8;;�t:>���TPP`�Xc�/�1c/�v�Z���b��s�R�d�)���7�
s�����c�m��y����&�"�����O��+�C�B�HOO�.�1���@����
6Xbmo�-�Z-��=�$!!�g<�O��uV�^-��u)����7Z�




������y����y���������WBB���������\�`����$$$��KH�B�	!�2d���?��������ogffZ,k��#G������������������|���K�.]j2�233���Z�Z�.((����u�]�:w��Z������/��"��$�J���&��_��i��u����|}}���+++322���Kk������8q"�W^y��LZ\����l��g��7����Of�����g]>�����z����+e����b��9�����u ��sP�EiA4MJJ�B�=z�J�***j������{��u��i4eee^^^z���k��/��o6�����
�I�����o���moOV�L&/�d2�����n��1o�<^���l���,����b�l��������H$����(6_S{�������Y��}�B���Ei�������[��������L�2���YYYK�,)..6��-���!  �_����\�l�m}�Lf��|��+W�4�����N��N�����]�kC�:UUU���F��q����u:]iiiiiiQQQRR��I�����Je|||�O222������g��c�����6mR*�m_���z
�u`�JH�����	?��������Wrrr�?^�5k��~@*�zyy����������III��*<<|����<!!��������w}���7����dJII��7n

jo���k@�����s��U�^x��K	o}����y���Z��:����}����g�a��I�
�2eJ��+((h����bJJ��`�����nC#���7\���2�V�X�����kkk���$�B������e��y�f�f����jiii...�1�?)�J�J�G\��*����k��������1f0���EQT*�<��7�������z�;c�g!*������Xo?oK����7�?����
 44��'&&���/���Y���.�V����;v,�B@�������|�R���������z�n����_zII	�M��f���b�
�u	�QFP!�<����:����Q~8����&���i����;r��#F4����r���3g���f^��/2d��	�^�z��i�����d2Y@@���C������x��g�"��hll<p���l��������6X
6l����������M��WCCCvv�C�beoA�����3f�Z�nR�h4FDDX�����4/������4�`B~��
��[oY�NBHg�{�!�\w����!��`B~�Z��G!�`B!�� ��P�/��5�IEND�B`�

#218

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Tomas Vondra (#217)

Re: checkpointer continuous flushing - V16

Hello Tomas,

One of the goals of this thread (as I understand it) was to make the overall
behavior smoother - eliminate sudden drops in transaction rate due to bursts
of random I/O etc.

One way to look at this is in terms of how much the tps fluctuates, so let's
see some charts. I've collected per-second tps measurements (using the
aggregation built into pgbench) but looking at that directly is pretty
pointless because it's very difficult to compare two noisy lines jumping up
and down.

So instead let's see CDF of the per-second tps measurements. I.e. we have
3600 tps measurements, and given a tps value the question is what percentage
of the measurements is below this value.

y = Probability(tps <= x)

We prefer higher values, and the ideal behavior would be that we get exactly
the same tps every second. Thus an ideal CDF line would be a step line. Of
course, that's rarely the case in practice. But comparing two CDF curves is
easy - the line more to the right is better, at least for tps measurements,
where we prefer higher values.

Very nice and interesting graphs!

Alas not easy to interpret for the HDD, as there are better/worse
variation all along the distribution, the lines cross one another, so how
it fares overall is unclear.

Maybe a simple indication would be to compute the standard deviation on
the per second tps? The median maybe interesting as well.

I do have some more data, but those are the most interesting charts. The rest
usually shows about the same thing (or nothing).

Overall, I'm not quite sure the patches actually achieve the intended goals.
On the 10k SAS drives I got better performance, but apparently much more
variable behavior. On SSDs, I get a bit worse results.

Indeed.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#219

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Tomas Vondra (#217)

Re: checkpointer continuous flushing - V16

On 2016-03-01 16:06:47 +0100, Tomas Vondra wrote:

1) HP DL380 G5 (old rack server)
- 2x Xeon E5450, 16GB RAM (8 cores)
- 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC)
- RedHat 6
- shared_buffers = 4GB
- min_wal_size = 2GB
- max_wal_size = 6GB

2) workstation with i5 CPU
- 1x i5-2500k, 8GB RAM
- 6x Intel S3700 100GB (in RAID0 for this benchmark)
- Gentoo
- shared_buffers = 2GB
- min_wal_size = 1GB
- max_wal_size = 8GB

Thinking about with that hardware I'm not suprised if you're only seing
small benefits. The amount of ram limits the amount of dirty data; and
you have plenty have on-storage buffering in comparison to that.

Both machines were using the same kernel version 4.4.2 and default io
scheduler (cfq). The

The test procedure was quite simple - pgbench with three different scales,
for each scale three runs, 1h per run (and 30 minutes of warmup before each
run).

Due to the difference in amount of RAM, each machine used different scales -
the goal is to have small, ~50% RAM, >200% RAM sizes:

1) Xeon: 100, 400, 6000
2) i5: 50, 200, 3000

The commits actually tested are

cfafd8be (right before the first patch)
7975c5e0 Allow the WAL writer to flush WAL at a reduced rate.
db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ...

Huh, now I'm a bit confused. These are the commits you tested? Those
aren't the ones doing sorting and flushing?

Also, I really wonder what will happen with non-default io schedulers. I
believe all the testing so far was done with cfq, so what happens on
machines that use e.g. "deadline" (as many DB machines actually do)?

deadline and noop showed slightly bigger benefits in my testing.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#220

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#219)

Re: checkpointer continuous flushing - V16

On 2016-03-07 09:41:51 -0800, Andres Freund wrote:

Due to the difference in amount of RAM, each machine used different scales -
the goal is to have small, ~50% RAM, >200% RAM sizes:

1) Xeon: 100, 400, 6000
2) i5: 50, 200, 3000

The commits actually tested are

cfafd8be (right before the first patch)
7975c5e0 Allow the WAL writer to flush WAL at a reduced rate.
db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ...

Huh, now I'm a bit confused. These are the commits you tested? Those
aren't the ones doing sorting and flushing?

To clarify: The reason we'd not expect to see much difference here is
that the above commits really only have any affect above noise if you
use synchronous_commit=off. Without async commit it's just one
additional gettimeofday() call and a few additional branches in the wal
writer every wal_writer_delay.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#221

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#216)

Re: checkpointer continuous flushing - V18

On 2016-02-22 20:44:35 +0100, Fabien COELHO wrote:

Random updates on 16 tables which total to 1.1GB of data, so this is in
buffer, no significant "read" traffic.

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ï¿½ 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ï¿½ 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?

See attached scripts (sh to create the 16 tables in the default or 16 table
spaces, small sql bench script, stat computation script).

The per-second stats were computed with:

grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300

Host is 8 cpu 16 GB, 2 HDD in RAID 1.

Well, that's not a particularly meaningful workload. You increased the
number of flushed to the same number of disks considerably. For a
meaningful comparison you'd have to compare using one writeback context
for N tablespaces on N separate disks/raids, and using N writeback
contexts for the same.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#222

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#221)

Re: checkpointer continuous flushing - V18

Hello Andres,

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ï¿½ 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ï¿½ 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

Well, that's not a particularly meaningful workload. You increased the
number of flushed to the same number of disks considerably.

It is just a simple workload designed to emphasize the effect of having
one context shared for all table space instead of on per tablespace,
without rewriting the patch and without a large host with multiple disks.

For a meaningful comparison you'd have to compare using one writeback
context for N tablespaces on N separate disks/raids, and using N
writeback contexts for the same.

Sure, it would be better to do that, but that would require (1) rewriting
the patch, which is a small work, and also (2) having access to a machine
with a number of disks/raids, that I do NOT have available.

What happens in the 16 tb workload is that much smaller flushes are
performed on the 16 files writen in parallel, so the tps performance is
significantly degraded, despite the writes being sorted in each file. On
one tb, all buffers flushed are in the same file, so flushes are much more
effective.

When the context is shared and checkpointer buffer writes are balanced
against table spaces, then when the limit is reached the flushing gets few
buffers per tablespace, so this limits sequential writes to few buffers,
hence the performance degradation.

So I can explain the performance degradation *because* the flush context
is shared between the table spaces, which is a logical argument backed
with experimental data, so it is better than handwaving. Given the
available hardware, this is the best proof I can have that context should
be per table space.

Now I cannot see how having one context per table space would have a
significant negative performance impact.

So the logical conclusion for me is that without further experimental data
it is better to have one context per table space.

If you have a hardware with plenty disks available for testing, that would
provide better data, obviously.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#223

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#222)

Re: checkpointer continuous flushing - V18

On 2016-03-07 21:10:19 +0100, Fabien COELHO wrote:

Now I cannot see how having one context per table space would have a
significant negative performance impact.

The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases. Thus you're more likely to see stalls by
the kernel moving pages into writeback.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#224

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#223)

Re: checkpointer continuous flushing - V18

Hello Andres,

Now I cannot see how having one context per table space would have a
significant negative performance impact.

The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases.

Possibly, but how much? Do you have experimental data to back up that
this is really an issue?

We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB
of dirty buffers to manage for 16 table spaces, I do not see that as a
major issue for the kernel.

Thus you're more likely to see stalls by the kernel moving pages into
writeback.

I do not see the above data having a 30% negative impact on tps, given the
quite small amount of data under discussion, and switching to random IOs
cost so much that it must really be avoided.

Without further experimental data, I still think that the one context per
table space is the reasonnable choice.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#225

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Fabien COELHO (#224)

Re: checkpointer continuous flushing - V18

Now I cannot see how having one context per table space would have a
significant negative performance impact.

The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases.

Possibly, but how much? Do you have experimental data to back up that this
is really an issue?

We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of
dirty buffers to manage for 16 table spaces, I do not see that as a major
issue for the kernel.

More thoughts about your theoretical argument:

To complete the argument, the 4MB is just a worst case scenario, in
reality flushing the different context would be randomized over time, so
the frequency of flushing a context would be exactly the same in both
cases (shared or per table space context) if the checkpoints are the same
size, just that with shared table space each flushing potentially targets
all tablespace with a few pages, while with the other version each
flushing targets one table space only.

So my handwaving analysis is that the flow of dirty buffers is the same
with both approaches, but for the shared version buffers are more equaly
distributed on table spaces, hence reducing sequential write
effectiveness, and for the other the dirty buffers are grouped more
clearly per table space, so it should get better sequential write
performance.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#226

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#225)

Re: checkpointer continuous flushing - V18

On 2016-03-08 09:28:15 +0100, Fabien COELHO wrote:

Now I cannot see how having one context per table space would have a
significant negative performance impact.

The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases.

Possibly, but how much? Do you have experimental data to back up that
this is really an issue?

We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB
of dirty buffers to manage for 16 table spaces, I do not see that as a
major issue for the kernel.

We flush in those increments, that doesn't mean there's only that much
dirty data. I regularly see one order of magnitude more being dirty.

I had originally kept it with one context per tablespace after
refactoring this, but found that it gave worse results in rate limited
loads even over only two tablespaces. That's on SSDs though.

To complete the argument, the 4MB is just a worst case scenario, in reality
flushing the different context would be randomized over time, so the
frequency of flushing a context would be exactly the same in both cases
(shared or per table space context) if the checkpoints are the same size,
just that with shared table space each flushing potentially targets all
tablespace with a few pages, while with the other version each flushing
targets one table space only.

The number of pages still in writeback (i.e. for which sync_file_range
has been issued, but which haven't finished running yet) at the end of
the checkpoint matters for the latency hit incurred by the fsync()s from
smgrsync(); at least by my measurement.

My current plan is to commit this with the current behaviour (as in this
week[end]), and then do some actual benchmarking on this specific
part. It's imo a relatively minor detail.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#227

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#203)

Re: checkpointer continuous flushing - V18

On 2016-02-21 09:49:53 +0530, Robert Haas wrote:

I think there might be a semantic distinction between these two terms.
Doesn't writeback mean writing pages to disk, and flushing mean making
sure that they are durably on disk? So for example when the Linux
kernel thinks there is too much dirty data, it initiates writeback,
not a flush; on the other hand, at transaction commit, we initiate a
flush, not writeback.

I don't think terminology is sufficiently clear to make such a
distinction. Take e.g. our FlushBuffer()...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#228

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#227)

Re: checkpointer continuous flushing - V18

On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-21 09:49:53 +0530, Robert Haas wrote:

I think there might be a semantic distinction between these two terms.
Doesn't writeback mean writing pages to disk, and flushing mean making
sure that they are durably on disk? So for example when the Linux
kernel thinks there is too much dirty data, it initiates writeback,
not a flush; on the other hand, at transaction commit, we initiate a
flush, not writeback.

I don't think terminology is sufficiently clear to make such a
distinction. Take e.g. our FlushBuffer()...

Well then we should clarify it!

:-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#229

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#228)

Re: checkpointer continuous flushing - V18

On 2016-03-10 17:33:33 -0500, Robert Haas wrote:

On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-21 09:49:53 +0530, Robert Haas wrote:

I think there might be a semantic distinction between these two terms.
Doesn't writeback mean writing pages to disk, and flushing mean making
sure that they are durably on disk? So for example when the Linux
kernel thinks there is too much dirty data, it initiates writeback,
not a flush; on the other hand, at transaction commit, we initiate a
flush, not writeback.

I don't think terminology is sufficiently clear to make such a
distinction. Take e.g. our FlushBuffer()...

Well then we should clarify it!

Trying that as we speak, err, write. How about:
<para>
Whenever more than <varname>bgwriter_flush_after</varname> bytes have
been written by the bgwriter, attempt to force the OS to issue these
writes to the underlying storage. Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync is issued at the end of a checkpoint, or when
the OS writes data back in larger batches in the background. Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
<xref linkend="guc-shared-buffers">, but smaller than the OS's page
cache, where performance might degrade. This setting may have no
effect on some platforms. <literal>0</literal> disables controlled
writeback. The default is <literal>256Kb</> on Linux, <literal>0</>
otherwise. This parameter can only be set in the
<filename>postgresql.conf</> file or on the server command line.
</para>

(plus adjustments for the other gucs)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#230

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#226)

Re: checkpointer continuous flushing - V18

[...]

I had originally kept it with one context per tablespace after
refactoring this, but found that it gave worse results in rate limited
loads even over only two tablespaces. That's on SSDs though.

Might just mean that a smaller context size is better on SSD, and it could
still be better per table space.

The number of pages still in writeback (i.e. for which sync_file_range
has been issued, but which haven't finished running yet) at the end of
the checkpoint matters for the latency hit incurred by the fsync()s from
smgrsync(); at least by my measurement.

I'm not sure I've seen these performance... If you have hard evidence,
please feel free to share it.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#231

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#230)

Re: checkpointer continuous flushing - V18

On 2016-03-10 23:38:38 +0100, Fabien COELHO wrote:

I'm not sure I've seen these performance... If you have hard evidence,
please feel free to share it.

Man, are you intentionally trying to be hard to work with? To quote the
email you responded to:

My current plan is to commit this with the current behaviour (as in this
week[end]), and then do some actual benchmarking on this specific
part. It's imo a relatively minor detail.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#232

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#229)

Re: checkpointer continuous flushing - V18

<para>
Whenever more than <varname>bgwriter_flush_after</varname> bytes have
been written by the bgwriter, attempt to force the OS to issue these
writes to the underlying storage. Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync is issued at the end of a checkpoint, or when
the OS writes data back in larger batches in the background. Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
<xref linkend="guc-shared-buffers">, but smaller than the OS's page
cache, where performance might degrade. This setting may have no
effect on some platforms. <literal>0</literal> disables controlled
writeback. The default is <literal>256Kb</> on Linux, <literal>0</>
otherwise. This parameter can only be set in the
<filename>postgresql.conf</> file or on the server command line.
</para>

(plus adjustments for the other gucs)

Some suggestions:

What about the maximum value?

If the default is in pages, maybe you could state it and afterwards
translate it in size.

"The default is 64 pages on Linux (usually 256Kb)..."

The text could say something about sequential writes performance because
pages are sorted.., but that it is lost for large bases and/or short
checkpoints ?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#233

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#232)

Re: checkpointer continuous flushing - V18

On 2016-03-10 23:43:46 +0100, Fabien COELHO wrote:

<para>
Whenever more than <varname>bgwriter_flush_after</varname> bytes have
been written by the bgwriter, attempt to force the OS to issue these
writes to the underlying storage. Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync is issued at the end of a checkpoint, or when
the OS writes data back in larger batches in the background. Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
<xref linkend="guc-shared-buffers">, but smaller than the OS's page
cache, where performance might degrade. This setting may have no
effect on some platforms. <literal>0</literal> disables controlled
writeback. The default is <literal>256Kb</> on Linux, <literal>0</>
otherwise. This parameter can only be set in the
<filename>postgresql.conf</> file or on the server command line.
</para>

(plus adjustments for the other gucs)

What about the maximum value?

Added.

<varlistentry id="guc-bgwriter-flush-after" xreflabel="bgwriter_flush_after">
<term><varname>bgwriter_flush_after</varname> (<type>int</type>)
<indexterm>
<primary><varname>bgwriter_flush_after</> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Whenever more than <varname>bgwriter_flush_after</varname> bytes have
been written by the bgwriter, attempt to force the OS to issue these
writes to the underlying storage. Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync is issued at the end of a checkpoint, or when
the OS writes data back in larger batches in the background. Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
<xref linkend="guc-shared-buffers">, but smaller than the OS's page
cache, where performance might degrade. This setting may have no
effect on some platforms. The valid range is between
<literal>0</literal>, which disables controlled writeback, and
<literal>2MB</literal>. The default is <literal>256Kb</> on Linux,
<literal>0</> elsewhere. (Non-default values of
<symbol>BLCKSZ</symbol> change the default and maximum.)
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
</listitem>
</varlistentry>
</variablelist>

If the default is in pages, maybe you could state it and afterwards
translate it in size.

Hm, I think that's more complicated for users than it's worth.

The text could say something about sequential writes performance because
pages are sorted.., but that it is lost for large bases and/or short
checkpoints ?

I think that's an implementation detail.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#234

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#231)

Re: checkpointer continuous flushing - V18

Hello Andres,

I'm not sure I've seen these performance... If you have hard evidence,
please feel free to share it.

Man, are you intentionally trying to be hard to work with?

Sorry, I do not understand this remark.

You were refering to some latency measures in your answer, and I was just
stating that I was interested in seeing these figures which were used to
justify your choice to keep a shared writeback context.

I did not intend this wish to be an issue, I was expressing an interest.

To quote the email you responded to:

My current plan is to commit this with the current behaviour (as in
this week[end]), and then do some actual benchmarking on this specific
part. It's imo a relatively minor detail.

Good.

From the evidence in the thread, I would have given the per tablespace
context the preference, but this is just a personal opinion and I agree
that it can work the other way around.

I look forward to see these benchmarks later on, when you have them.

So all is well, and hopefully will be even better later on.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#235

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#233)

Re: checkpointer continuous flushing - V18

[...]

If the default is in pages, maybe you could state it and afterwards
translate it in size.

Hm, I think that's more complicated for users than it's worth.

As you wish. I liked the number of pages you used initially because it
really gives a hint of how much random IOs are avoided when they are
contiguous, and I do not have the same just intuition with sizes. Also it
is related to the io queue length manage by the OS.

The text could say something about sequential writes performance because
pages are sorted.., but that it is lost for large bases and/or short
checkpoints ?

I think that's an implementation detail.

As you wish. I thought that understanding the underlying performance model
with sequential writes written in chunks is important for the admin, and
as this guc would have an impact on performance it should be hinted about,
including the limits of its effect where large bases will converge to
random io performance. But maybe that is not the right place.

--
Fabien

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#236

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#235)

Re: checkpointer continuous flushing - V18

On 2016-03-11 00:23:56 +0100, Fabien COELHO wrote:

As you wish. I thought that understanding the underlying performance model
with sequential writes written in chunks is important for the admin, and as
this guc would have an impact on performance it should be hinted about,
including the limits of its effect where large bases will converge to random
io performance. But maybe that is not the right place.

I do agree that that's something interesting to document somewhere. But
I don't think any of the current places in the documentation are a good
fit, and it's a topic much more general than the feature we're debating
here. I'm not volunteering, but a good discussion of storage and the
interactions with postgres surely would be a significant improvement to
the postgres docs.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#237

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#1)

Re: checkpointer continuous flushing

Hi,

I just pushed the two major remaining patches in this thread. Let's see
what the buildfarm has to say; I'd not be surprised if there's some
lingering portability problem in the flushing code.

There's one remaining issue we definitely want to resolve before the
next release: Right now we always use one writeback context across all
tablespaces in a checkpoint, but Fabien's testing shows that that's
likely to hurt in a number of cases. I've some data suggesting the
contrary in others.

Things that'd be good:
* Some benchmarking. Right now controlled flushing is enabled by default
on linux, but disabled by default on other operating systems. Somebody
running benchmarks on e.g. freebsd or OSX might be good.
* If somebody has the energy to provide a windows implemenation for
flush control, that might be worthwhile. There's several places that
could benefit from that.
* The default values are basically based on benchmarking by me and Fabien.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#238

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#236)

Re: checkpointer continuous flushing - V18

As you wish. I thought that understanding the underlying performance model
with sequential writes written in chunks is important for the admin, and as
this guc would have an impact on performance it should be hinted about,
including the limits of its effect where large bases will converge to random
io performance. But maybe that is not the right place.

I do agree that that's something interesting to document somewhere. But
I don't think any of the current places in the documentation are a good
fit, and it's a topic much more general than the feature we're debating
here. I'm not volunteering, but a good discussion of storage and the
interactions with postgres surely would be a significant improvement to
the postgres docs.

I can only concur!

The "Performance Tips" chapter (II.14) is more user/query oriented. The
"Server Administration" bool (III) does not discuss this much.

There is a wiki about performance tuning, but it is not integrated into
the documentation. It could be a first documentation source.

Also the README in some development directories are very interesting,
although they contains too much details about the implementation.

There has been a lot of presentations over the years, and blog posts.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#239

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#237)

Re: checkpointer continuous flushing

I just pushed the two major remaining patches in this thread.

Hurray! Nine months the this baby out:-)

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#240

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Fabien COELHO (#238)

Re: checkpointer continuous flushing - V18

On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I can only concur!

The "Performance Tips" chapter (II.14) is more user/query oriented. The
"Server Administration" bool (III) does not discuss this much.

That's definitely one area in which the docs are lacking -- I've heard
several complaints about this myself. I think we've been hesitant to
do more in part because the docs must always be categorically correct,
and must not use weasel words. I think it's hard to talk about
performance while maintaining the general tone of the documentation. I
don't know what can be done about that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#241

Jeff Janes

jeff.janes@gmail.com

almost 10 years ago

In reply to: Peter Geoghegan (#240)

Re: checkpointer continuous flushing - V18

On Thu, Mar 10, 2016 at 11:25 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I can only concur!

The "Performance Tips" chapter (II.14) is more user/query oriented. The
"Server Administration" bool (III) does not discuss this much.

That's definitely one area in which the docs are lacking -- I've heard
several complaints about this myself. I think we've been hesitant to
do more in part because the docs must always be categorically correct,
and must not use weasel words. I think it's hard to talk about
performance while maintaining the general tone of the documentation. I
don't know what can be done about that.

Would the wiki be a good place for such tips? Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#242

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Jeff Janes (#241)

Re: checkpointer continuous flushing - V18

On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Would the wiki be a good place for such tips? Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.

That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#243

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Peter Geoghegan (#242)

Re: checkpointer continuous flushing - V18

On 3/13/16 6:30 PM, Peter Geoghegan wrote:

On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Would the wiki be a good place for such tips? Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.

That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.

I think there's ways to significantly restrict who can edit a page, so
this could probably still be done via the wiki. IMO we should also be
encouraging users to test various tips and provide feedback, so maybe a
wiki page with a big fat request at the top asking users to submit any
feedback about the page to -performance.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#244

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Andres Freund (#237)

6 attachment(s)

Re: checkpointer continuous flushing

Hi,

On 03/11/2016 02:34 AM, Andres Freund wrote:

Hi,

I just pushed the two major remaining patches in this thread. Let's see
what the buildfarm has to say; I'd not be surprised if there's some
lingering portability problem in the flushing code.

There's one remaining issue we definitely want to resolve before the
next release: Right now we always use one writeback context across all
tablespaces in a checkpoint, but Fabien's testing shows that that's
likely to hurt in a number of cases. I've some data suggesting the
contrary in others.

Things that'd be good:
* Some benchmarking. Right now controlled flushing is enabled by default
on linux, but disabled by default on other operating systems. Somebody
running benchmarks on e.g. freebsd or OSX might be good.

So I've done some benchmarks of this, and I think the results are very
good. I've compared a298a1e06 and 23a27b039d (so the two patches
mentioned here are in-between those two), and I've done a few long
pgbench runs - 24h each:

1) master (a298a1e06), regular pgbench
2) master (a298a1e06), throttled to 5000 tps
3) patched (23a27b039), regular pgbench
3) patched (23a27b039), throttled to 5000 tps

All of this was done on a quite large machine:

* 4 x CPU E5-4620 (2.2GHz)
* 256GB of RAM
* 24x SSD on LSI 2208 controller (with 1GB BBWC)

The page cache was using the default config, although in production
setups we'd probably lower the limits (particularly the background
threshold):

* vm.dirty_background_ratio = 10
* vm.dirty_ratio = 20

The main PostgreSQL configuration changes are these:

* shared_buffers=64GB
* bgwriter_delay = 10ms
* bgwriter_lru_maxpages = 1000
* checkpoint_timeout = 30min
* max_wal_size = 64GB
* min_wal_size = 32GB

I haven't touched the flush_after values, so those are at default. Full
config in the github repo, along with all the results and scripts used
to generate the charts etc:

https://github.com/tvondra/flushing-benchmark

I'd like to see some benchmarks on machines with regular rotational
storage, but I don't have a suitable system at hand.

The pgbench was scale 60000, so ~750GB of data on disk, and was executed
either like this (the "default"):

pgbench -c 32 -j 8 -T 86400 -l --aggregate-interval=1 pgbench

or like this ("throttled"):

pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench

The reason for the throttling is that people generally don't run
production databases 100% saturated, so it'd be sad to improve the 100%
saturated case and hurt the common case by increasing latency. The
machine does ~8000 tps, so 5000 tps is ~60% of that.

It's difficult to judge based on a single run (although a long one), but
it seems the throughput increased a tiny bit from 7725 to 8000. That's
~4% difference, but I guess more runs would be needed to see if this is
noise or actual improvement.

Now, let's see at the per-second results, i.e. how much the performance
fluctuates over time (due to checkpoints etc.). That's where the
aggregated log (per-second) gets useful, as it's used for generating the
various charts for tps, max latency, stddev of latency etc.

All those charts are CDF, i.e. cumulative distribution function, i.e.
they plot a metric on x-axis, and probability P(X <= x) on y-axis.

In general the steeper the curve the better (more consistent behavior
over time). It also allows comparing two curves - e.g. for tps metric
the "lower" curve is better, as it means higher values are more likely.

default (non-throttled) pgbench runs
------------------------------------

Let's see the regular (non-throttled) pgbench runs first:

* regular-tps.png (per-second TPS)

Clearly, the patched version is much more consistent - firstly it's much
less "wobbly" and it's considerably steeper, which means the per-second
throughput fluctuates much less. That's good.

We already know the total throughput is almost exactly the same (just 4%
difference), this also shows that the medians are almost exactly the
same (the curves intersect at pretty much exactly 50%).

* regular-max-lat.png (per-second maximum latency)
* regular-stddev-lat.png (per-second latency stddev)

Apparently the additional processing slightly increases both the maximum
latency and standard deviation, as the green line (patched) is
consistently below the pink one (unpatched).

Notice however that x-axis is using log scale, so the differences are
actually very small, and we also know that the total throughput slightly
increased. So while those two metrics slightly increased, the overall
impact on latency has to be positive.

throttled pgbench runs
----------------------

* throttled-tps.png (per-second TPS)

OK, this is great - the chart shows that the performance is way more
consistent. Originally there was ~10% of samples with ~2000 tps, but
with the flushing you'd have to go to ~4600 tps. It's actually pretty
difficult to determine this from the chart, because the curve got so
steep and I had to check the data used to generate the charts.

Similarly for the upper end, but I assume that's a consequence of the
throttling not having to compensate for the "slow" seconds anymore.

* throttled-max-lat.png (per-second maximum latency)
* throttled-stddev-lat.png (per-second latency stddev)

This time the stddev/max latency charts are actually in favor of the
patched code. It's actually a bit worse for the low latencies (the green
line is below the pink one, so there are fewer low values), but then it
starts winning for higher values. And that's what counts when it comes
to consistency.

Again, notice that the x-axis is log scale, so the differences for large
values are actually way more significant than it might look.

So, good work I guess!

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

regular-max-lat.pngimage/png; name=regular-max-lat.pngDownload

regular-stddev-lat.pngimage/png; name=regular-stddev-lat.pngDownload

regular-tps.pngimage/png; name=regular-tps.pngDownload

throttled-max-lat.pngimage/png; name=throttled-max-lat.pngDownload

throttled-stddev-lat.pngimage/png; name=throttled-stddev-lat.pngDownload

�PNG


IHDR ��G�c	pHYs��n�u>tIME�30�{�� IDATx���O�%�a�S� �Y�$�a5Yd@��
� �f5C�m�n��Y�@"pH^��D#�,�4dQ�Mp�t��W���D�V�L�kpDg���eqz������U�V��}��z����s���sN�*������,@�@��,,@�@��,�����m��*�P�eY�J��i��iBu]+
8!-X����(��6>e�M��O�x0��i����T�����]9����M�i�����|������~�4MQ1CTU�*���S��T����q{����*g*��s�k��E���[C�t�&lK:�s:)�t
�[ ���-�eY�uZ�,��;��|;�m�w�|kSN-�N3�3�p��|��*�������@<����+�%���\��*��%����[�/��B;�H��>T#�`�R	��y�OeP�(}.�>�[������~�Tka����uZ��N��r�S<I�������u2�N�6��.��������F���Vx�oy���_������g�g��o�<11`�_����d�?<�>���;%�����Z�R�7��V�<���-��"���=v���������E���?z;��~���S'��F���u��y�D����N������:�z��U>S�������HC����Kv]�����(�[��W�>�FV��Su���8�����9�������)��Z�C�^����4M����(�����������_k������{��^�{Tot�
]~����]��=8��7����V����iyw��s������������G��E�=���j��{�r�F���{�0�X�{TCW����W�)��7�&���������G:�E�[��?��>Oi
����e8^h����G��+���:�������X��r��������/���&~7��+�T!|��_��u�%�~#om<_a��{�47���:�[;Z8^>C�����>*�7���]�����UU�`�g�����m����M_	��{����,�����*�����y�Z�H�~;rWc�5����S��V��K�[�;]���*�A����;�����nc�
����?�)��7��4{78����K�$J]*7f��'�$}�N,�%.h�$������|bI�/��������X�1�B����/�Pe�i#{�X��h,��)����k�]]]�=���uK��u�E8p�����@��]��CSJ�+]������:���w�O~��a�+0+U����s�z�2�g��3F.}�d{����S�u+��K���G�����A��/���Z�tG������:���������a��U���<�#�������g������=���O���lg7��x�j�R\�w���>��y�z�U������X�C�����j��B��H����V��V�]K���
m���w�	��r�,���� ���%�������z;����O�����CCP�����6�8�B�A����f�v���/��a]����=�w�j����Si���E�7����_Mn<R����'��� �B�����)dd�{�o�����E�8&N24�}��c���j~k����K���z�}����p.?����9`��d��8��L}{��:31����ix�l����p��v]���o��~w��j7���3�7?d>1���?�>O�+m[S\�o5y��~R�nd�������8��E\N�`��yq0D����kxV��C�9��+�y;���!�y�9}����1��w��e����7�@�w�{���zU/�A��i���,@���,k������G����.�B+6u�"�z�l�r�30�D��)�����(z��2������������65�@��A+6�D5��<����k
�X�\C�������������?�����J���?���O^~������*�p���#�������y�1��n�WS�{�uc��������wJ	R7��d������t�8p�{�����)%0��HU9�EK~%o�7l�������7?������7����'3��X���+��_�]��O>�[�Wf,�g/�����������Gq�{>�=���jhy����qr	����\h���%��N����x#������w���������wJ	R7��d������t�8p�{�����)%0��HU9�EK�To����'�z����#)�������������|7��}�������_�������G�/f,�;��!����?�����k�uc3-XeYVU��!���OQ������M]�~������<3��i�#�,yP=�J_o~!���GW��������&�����0����4WM�����'�k�����;���v�B��w{�o}��eY��w<������g��&��&�������(B\��M_��V+*�ey}}��f���[���8|�������y�1��+uc��,]7������_������;�wJ	��s��6ZC��������\=��
U1K
%��������������������S�YS����7�����?V��G������-���T�?���
��}���nT�����YB��b{U���N�����/���M�[���������A�4�U����r�B,8����U����du�|u���w�zU���w�huf��Gc,��V�X%Z-9n����l�M6MUU7��(��(Zk�����6�d��q���������|�������7��o}������������������*��;����~���5
W��^]Ms`���4������0)��������SZ�|b�����#��z����#�_�ZUW`��\��"|!(�us@�&$e�tCoZ��G��f�K<���GZ�v���$�<���qC1H������<��w�y�[���'RE��,������Zgp���pm��z���VE-X/D���'������R���z[}Z[�8�	�7�������V.�n6�C�3�������������6����<��_p�j��UQ�L�V���u��]���$f�V�V�q(�r�Z����[�������7����N[�^)o��K��RkY>c��G�{"���?�a���:���(�/����S�6�����W%y��;�m�J�:R��g�<R�v� �[H}�i��khwy���i��<0�c��w�����o�7>�Dn,�V���,��+X������}R�Z��=&X�p��'0�2��;���z�cF�g������sh��1k�DZ}�(&��O_��Kkz�����\������iZjy�HC�cSM7c�7�{d����|ywt|K>�|�F��=��)�5=Q�|�������n%Z�A�;d��W`U�$���F;��o�2ih��NGn�kM�?6�5���#��N$��2�8�=�������������[���ZS0�~<��v���w�A�96P�����w��k�U\�������#I{�ioX����J:��3�w�X�F���){�5��?*?x?|�[����������|T��~y���������%>r�0�>�X�6��4W�H����w�����L��t���=z�?�o������w���V�A�����>�fz�U�q7F��$rXkm��%�95�}���;�!V��:#�A�+���:�o�;��p�h��&����iA�ZW\���eF����'_�V9�Ex��4{���u:�2w�VSEQ��(�.9���iI.�38tT;��� ��l��X������������V��*���'D8�N[�Un=�'�o�y�a�rk#eG��+�H^&#G�Y����Uzb`���X�<Z���V+d�i)M�����AZ�?�0����E�I���lh/#��	��Vy3f>�U���n
/����H���wG�6�f�H�:9���Z)]}��������2�V,��5�]�S7��V��j��.o�NZ-���!����^z��
Xy��Z�=�))�����bSY�I;���G�Pp��*F�8S�FO�q���je���[�+u��!�?�%=�x%%��+O'n=O&=\9�Z����Z�)���G ������R�����V��Ek���������\��x7r$��I���|�<yt�:��/�@z'9����������KX����m���TZO�
Y7\��|<���?n�(���Q#���%���k��9����NX�G����O'���r����?6�%]m}j�V���VVvIa"<�nu/����7��M�O��-�)��}rN�n�,F62e��V����u��F��w�2���c�[��pu������������z[DV���oJ,�1kNO`7����y�`�Y�n$���W���{o:�i�K���	HW�K
WF�o����c�R�B��M�e��<&�N9,O3#�*O!��n�Wj�z�_k;i.��-��L��T�-�Vd���=�#��8��O�5��H�%)T,�l�4�Rl>�K�����G[��S,��P�WiR�tj��<��'�o!?����9l�0[+�J�������W�2R��Dn,�4C�O���U8����~������U�h���Q�����|7��~y�Q��t!���~�u?n������[
�xe~zq4R��q��f������w��0�4z�3t��-��g�������W)]��'^smm�g�;�Z3������p�g��<���������o����ntg>��x�v��<�0N���6�!��Aua��o�"��,����	p�����\��C]�3�k�i�O������`���l��w��tBx��i�v��H4k�(V�],�8]��A���"�-
�v�U���'jO��w��pv,�'��M?<���Y�N����,8u�zx��U!6_��������p6��������U�T��Sj�|��IZ������``{��������\�^���$]=��~�J�
Xp2���38�4���AN��sz���\YN#v����O���,8��9��Y��������AN���e��
Xp�tuF�2dK����KWg�9����
Xpl��H,8��z�����_s�,8n�:��W��vN���o�U�|%`�����
��|%`�	���v�W�&]�����~u4_	X����(��(�����KNWg6�=ds_i���LW!�������?�u]�u�4M�4���:�t�:M�~i�����^|��������,�����,��5���\*��LW�5?:�t�:5_m �9�"�[�b��K���-X���B�gf��U�uUUq�U�����R���\?�sHWo}]��h�*�
XM������g��������S���|�d��b��
XS�������q����F�KW+5*�m�z�h��������,��<�Ul�
!�S����/��]����������+ ]mW�-���7�+�
Xq�{;�4M��!
��� ]m����{�Ou�RaxSUUeYv�����p|������A�4E�;�+���f������ ]�������(��NR�U0���@�������
�j6����fNWfE�@��M>��+,���h�B�@���D���f�7\�D�@�:��+,������jNfE�@����+,��9i�B�`E�'V��b�r� +MW�w������w� ����*h�B�@�:Pk0��+,�����������fG�`��'6Om.]���n-�u������Z�MKXa��>x?Mv��t��zz������8�[���l�P������B���i�+pZ[��p�J���Gz�VUY�1HEQ�u�9[=�W!��$�MW�sP$�5riVLQ1Q5M����,��G>�j+�����)`�����)��1Tvi��=]m�8���P��^��;X��u����:F����z{��T,�F���Z}���n(T����x��G_Eu]��0%`�h��P��>A6�-X���8�=5k�=8�mu�d+n-�����,���s���9X�����JW���?M
W��~S�b��o�2��<X��g��|�����	r�H4k�(V�],�ymh<{���C�T���g�����������Vl��p��p%Z!`�
�h�zP=
!�VX�]>��j���D+,V��>��*��
!�VX�T�'����n�2�����������nD�`c�<�]�B�`{V���
��Ys�U>��h���6���J�`{V�p��	�C���D��;�m�
�
Xs����*��O,��X��>A����4\���A�G��jrv
W `l5]�AW�IWfa`��j=��fa`��6�����0��[��GTUQEQTU5e9kKW������p���7�+W\__/��BeY���	!�u�Z^UU]�����+���	��t��AW�[qA�h���l�*�"���i���+_�V\.`��J�U+Zn������4MUU�����&`e=�1�VX�[p�{���*�P~*�3���s���M�J7�j���0v���RK;�]�1UU5�R ]�jH���iA��-�4�������j��h����*�*��1>*i.�r�����/�}��`������
!�C�����;��,`�����&+X"r,�EX�e�P4�3���<�]�e9�����t{5Y��q��CUU��'._.N\`��`R$�P���������T�����0]�c�D+8>������ ]��q��~u%Z�� ]����Y�'������p>N���+X�[��l�U���+X-X���������������A��Stu�t�����+��?]�o�-_��h�
!�����JW `���
�
��������te`;��,�s�U�v�����t�z��t������=d,�3w�n�V��$�����Z}��Xg�y�a������#
����t��E8�����+���I�5\��p����wj��M[�Q9UU�������M��^zr��7�#]S,��U�e������?7M������������n�(���A�������������	�����lA�
N�f���#TY�1Q5M����,��r���HWp�n-�����]�������^�0�[���+8o�	��`���UUUU��Xu�:b��tgl��������7��z�e����p%]����y��"l��wpUk<��� ]����i�1XC����!]�o}]����m�iB�|���,c�Vl���`�/�O>,~�����0�J��QaxSl��6S
-����,������w�+��H4k�(V�],`�tu�nA�
.*`y�3 ]-�q�����t�C�.��o|�<�_]�����)_��@���������]����p���87����X���R�U0�
,��@����VA�X��!����
��jo�&+�
��j�\e�+@���r�hX�t%W����\X!��<���t%W������B���]����i�5n=� W�)���W}|EBX�A�x������Oq��0�`�����Y#�,�T�MOWr�48�t�^���_��+P�,��t52������0���
����@@��!]��{�U7Z�U���:��`����IUUEQEQU���adR�����ULWw�W�y��t�1Z���i��������M��e�b;����TO�V+���8�<X1N5M��UE]�1T�a������������OP�v�D�����4M!5Y�e��80]�
W���E+�����T�u*.L�t\?�4��
�t�[IW��C��-XUU
��������������U���+`=XUU�e�4Mj��K��G_l��e:�S�2�
�n(T����`�P��x�!�P����'b�Np%]�R�q(�������!����S4te�+`�H4k��u�s�]�q���t��[��C�~�/^��iLW����+`�q���CqH��]�Z���9X�q��|�-
W�R�h��Q�<�Xp�R������~����tl"�v�h(IDATrx�3��t�������/s1!`kX�b���?���+@�8�o�������B� `��_�����~B����nA`sn)`U7��+~����g!�������[��9�"V$� ���y�������w�����L�-F]����U�����G��q����P,�F	X��=n�}�<���/����}B(��[��%�X;K
W!�w��w��1]�_{W����*��P=��J�,���}�w�W���|�|�y�a���B��]EX�D��=�&=��+@��A�O0�*>G� p6��,��'�z�M��^����w�p�C���}��i�������++�lX�R��>
X�;
���.B`�tu���?=�Y� p��C0��-8�ptgM��4�[02/ `LJW7v>OW��2��)���I���wk��7MSUUQEQTU��������<n��W�4��w����*����t}}]�u[�������i\	�����-x�|u�[���_� p��!&����BEQ�u]�eky�����j�x�<]�;5_��D��!F�n��U��^B�b��8����~6/�t�y\;B�P�OeY�8�4M���<�'�h����>�=OWi���A����(������@+8��g)]M���G�p9�1MC_���S��R+IWi<����)���gC��+����4[Q�j�^O������A�l����_��\������`��jh`{�Xx`%��/�2���	X1H�	E���V��4���!]���>j~1q���t�s�(G��0����_8>�`����-�����oKW�f"����Xyv��$�����
�����G�p��������j��[�*x�3p�,`0]��-<@�����+��������<�7\��-(]X�s�4\IWB��g5��.]X�g�����!>g��#�
@��Kw�D)]y���n��+�
@�B�o<�t `�M�}��
@���4c��t `���q"�V�r� ������.]X�l1HW0�J7JW\���
i��-�-(]Xpqf�[P� ��-5\�����JW{�����*��(����,&z�<{��ij�����������nA�
`?������*�P�e�i�B]���UU�u�9��!,z��E���t"���c��U�m�i����?��H�*���,���|�]�xv@��5r,8�i��v����~U��^B�h�����#�
`��v����RUUU�4�~����89D���
WA� p9��(����$�g����x����S��g"��[�����f��4
��jh{�1������
���_���V�����s����<���f��o4�
`9�ro��J�������2�U�U3F�`Q��	X����VXq�{Y��fA�����B�G���
�(�o���,���+�`q��MV�����{���0h�8N�(V�],�8W�l�U��7��w��N9<�����n��L
pZ�,W��FY�4\X W��,��������V�2�@����U��j�!V-fa���t;����9}����N���rff�`�bc��sU�p `�y��Suvi��`���	���
@��3�U���F�,8�\u�N��h4\X W����Us�HA6��O�[��J� ��r�A4\X��\~����� `��r���Y�V�X�F����U�Y,X���n"W�N���+N����^��U��X�)����:_RUU�4!��,��v%��\��V��=UU��$������t1�;5��r�hpQ�
K�G��LB�wTE]�1T�a�������\��)�������dXy$�5rG�.1f�5MSU��?,�V�����)Z�p	���S�����	������s�Vg�d%Z�9H-�]��4Zm��J�������KK��{�<�_]�h��&��0v�
`m�G%m8`�u������yG�4�]��<���,�������h�X�>��u�r�h��V��D���p��hU=�BhE+�/'�TUUY���W�i8�t�F\�S��P�`o��dp�hs�`��lY��-������*k:�4p<���T�\B����Z�����C��*,�e������*,��=CU�4V	UXl��O]��[��P�@�b{�*��:U�����p��S�z�t5�X�@,�!]g�Uoc�@,��G��#�������8��������X\b�
!��u���,�B�l�I<����
UX�������v�
z����C�<YnHWS:=Y�����
X���Y��r�\�����i���]z�FY�u]+
Z��P7��F�
���n�2VvNW�����O~��#�
,fHW������C����������:����*��(���6]�M�����E����7�����?{��������pU�q�~���������adw���������+��G8}�SJ���1K	��=p��q���x����)k��3R�~�-]7�����/�]����_gdw����P�ec��u=��	r�����\h����~G8�����w�O�~�������?<�Fv7����K����G8}�SJ���1t$K��\/���������g��NYs|���8r$�����^{���Z7n���:#��3��i�����:�d��Gs�V�]��p�|����B���6��S0��	Vl�K��,���^h�
��������\q�i�����aEBx���{�����%������O>y���:���������|����?����0���_M��y��YJ���j�#���)%pH�:��K~����u��������~����:#Uq�H-����/]��x���n�X����.����^���>���'�|B�-�j`J��s����=+7>�_�6��4�P���hV�Bg`�[k��g��c�=t�fXqpU�k`CRSQ�lS�Nxdi�Q�m��*5u�n��I�4��4C+�RF���J�����I<��	-�������WO��)�^=Q�f�Xu]���n�X]������X��7��3�����^=Q�\z�D}PO����
���W#?��Q91��h�������0t���a��WO������~k����=�� ��7R��� �I�U���r��%�{���a��WO��DO�/�C1<���Lw��XC����q���h��WO��)�^=Q������Yybp�����R�uMUU������v����R�Qh�������������%VO.J��+�"���.V�Z�2U���\�w���WO��)�^=Q�Z+��"���������%rh9����^z�D}�r���A=!���qJ����&�
I���������K���S.�z�>�'jK�g���{��i��<.l�\��S�s	/���WO���=.�zri�5MC�fa�)����g�I]����,���@��,,�\�C �.��'�IEND�B`�

throttled-tps.pngimage/png; name=throttled-tps.pngDownload

#245

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Tomas Vondra (#244)

Re: checkpointer continuous flushing

Hello Tomas,

Thanks for these great measures.

* 4 x CPU E5-4620 (2.2GHz)

4*8 = 32 cores / 64 threads.

* 256GB of RAM

Wow!

* 24x SSD on LSI 2208 controller (with 1GB BBWC)

Wow! RAID configuration ? The patch is designed to fix very big issues on
HDD, but it is good to see that the impact is good on SSD as well.

Is it possible to run tests with distinct table spaces on those many
disks?

* shared_buffers=64GB

1/4 of the available memory.

The pgbench was scale 60000, so ~750GB of data on disk,

*3 available memory, mostly on disk.

or like this ("throttled"):

pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench

The reason for the throttling is that people generally don't run production
databases 100% saturated, so it'd be sad to improve the 100% saturated case
and hurt the common case by increasing latency.

Sure.

The machine does ~8000 tps, so 5000 tps is ~60% of that.

Ok.

I would have suggested using the --latency-limit option to filter out very
slow queries, otherwise if the system is stuck it may catch up later, but
then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the transaction
distribution is expected to be the same, around 5000 tps, and the green
run looks pretty ok with respect to that. The magenta one shows that about
25% of the time, things are not good at all, and the higher figures just
show the catching up, which is not really interesting if you asked for a
web page and it is finally delivered 1 minutes later.

* regular-tps.png (per-second TPS) [...]

Great curves!

consistent. Originally there was ~10% of samples with ~2000 tps, but with the
flushing you'd have to go to ~4600 tps. It's actually pretty difficult to
determine this from the chart, because the curve got so steep and I had to
check the data used to generate the charts.

Similarly for the upper end, but I assume that's a consequence of the
throttling not having to compensate for the "slow" seconds anymore.

Yep, but they should be filtered out, "sorry, too late", so that would
count as unresponsisveness, at least for a large class of applications.

Thanks a lot for there interesting tests!

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#246

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Fabien COELHO (#245)

Re: checkpointer continuous flushing

Hi,

On 03/17/2016 06:36 PM, Fabien COELHO wrote:

Hello Tomas,

Thanks for these great measures.

* 4 x CPU E5-4620 (2.2GHz)

4*8 = 32 cores / 64 threads.

Yep. I only used 32 clients though, to keep some of the CPU available
for the rest of the system (also, HT does not really double the number
of cores).

* 256GB of RAM

Wow!

* 24x SSD on LSI 2208 controller (with 1GB BBWC)

Wow! RAID configuration ? The patch is designed to fix very big issues
on HDD, but it is good to see that the impact is good on SSD as well.

Yep, RAID-10. I agree that doing the test on a HDD-based system would be
useful, however (a) I don't have a comparable system at hand at the
moment, and (b) I was a bit worried that it'll hurt performance on SSDs,
but thankfully that's not the case.

I will do the test on a much smaller system with HDDs in a few days.

Is it possible to run tests with distinct table spaces on those many disks?

Nope, that'd require reconfiguring the system (and then back), and I
don't have access to that system (just SSH). Also, I don't quite see
what would that tell us?

* shared_buffers=64GB

1/4 of the available memory.

The pgbench was scale 60000, so ~750GB of data on disk,

*3 available memory, mostly on disk.

or like this ("throttled"):

pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench

The reason for the throttling is that people generally don't run
production databases 100% saturated, so it'd be sad to improve the
100% saturated case and hurt the common case by increasing latency.

Sure.

The machine does ~8000 tps, so 5000 tps is ~60% of that.

Ok.

I would have suggested using the --latency-limit option to filter out
very slow queries, otherwise if the system is stuck it may catch up
later, but then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the
transaction distribution is expected to be the same, around 5000 tps,
and the green run looks pretty ok with respect to that. The magenta one
shows that about 25% of the time, things are not good at all, and the
higher figures just show the catching up, which is not really
interesting if you asked for a web page and it is finally delivered 1
minutes later.

Maybe. But that'd only increase the stress on the system, possibly
causing more issues, no? And the magenta line is the old code, thus it
would only increase the improvement of the new code.

Notice the max latency is in microseconds (as logged by pgbench), so
according to the "max latency" charts the latencies are below 10 seconds
(old) and 1 second (new) about 99% of the time. So I don't think this
would make any measurable difference in practice.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#247

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Tomas Vondra (#246)

Re: checkpointer continuous flushing

Is it possible to run tests with distinct table spaces on those many disks?

Nope, that'd require reconfiguring the system (and then back), and I don't
have access to that system (just SSH).

Ok.

Also, I don't quite see what would that tell us?

Currently the flushing context is shared between table space, but I think
that it should be per table space. My tests did not manage to convince
Andres, so getting some more figures would be great. That will be another
time!

I would have suggested using the --latency-limit option to filter out
very slow queries, otherwise if the system is stuck it may catch up
later, but then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the
transaction distribution is expected to be the same, around 5000 tps,
and the green run looks pretty ok with respect to that. The magenta one
shows that about 25% of the time, things are not good at all, and the
higher figures just show the catching up, which is not really
interesting if you asked for a web page and it is finally delivered 1
minutes later.

Maybe. But that'd only increase the stress on the system, possibly causing
more issues, no? And the magenta line is the old code, thus it would only
increase the improvement of the new code.

Yes and no. I agree that it stresses the system a little more, but the
fact that you have 5000 tps in the end does not show that you can really
sustain 5000 tps with reasonnable latency. I find this later information
more interesting than knowing that you can get 5000 tps on average,
thanks to some catching up. Moreover the non throttled runs already shown
that the system could do 8000 tps, so the bandwidth is already there.

Notice the max latency is in microseconds (as logged by pgbench), so
according to the "max latency" charts the latencies are below 10 seconds
(old) and 1 second (new) about 99% of the time.

AFAICS, the max latency is aggregated by second, but then it does not say
much about the distribution of individuals latencies in the interval, that
is whether they were all close to the max or not, Having the same chart
with median or average might help. Also, with the stddev chart, the
percent do not correspond with the latency one, so it may be that the
latency is high but the stddev is low, i.e. all transactions are equally
bad on the interval, or not.

So I must admit that I'm not clear at all how to interpret the max latency
& stddev charts you provided.

So I don't think this would make any measurable difference in practice.

I think that it may show that 25% of the time the system could not match
the target tps, even if it can handle much more on average, so the tps
achieved when discarding late transactions would be under 4000 tps.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#248

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Fabien COELHO (#247)

Re: checkpointer continuous flushing

Hi,

On 03/17/2016 10:14 PM, Fabien COELHO wrote:

...

I would have suggested using the --latency-limit option to filter out
very slow queries, otherwise if the system is stuck it may catch up
later, but then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the
transaction distribution is expected to be the same, around 5000 tps,
and the green run looks pretty ok with respect to that. The magenta one
shows that about 25% of the time, things are not good at all, and the
higher figures just show the catching up, which is not really
interesting if you asked for a web page and it is finally delivered 1
minutes later.

Maybe. But that'd only increase the stress on the system, possibly
causing more issues, no? And the magenta line is the old code, thus it
would only increase the improvement of the new code.

Yes and no. I agree that it stresses the system a little more, but
the fact that you have 5000 tps in the end does not show that you can
really sustain 5000 tps with reasonnable latency. I find this later
information more interesting than knowing that you can get 5000 tps
on average, thanks to some catching up. Moreover the non throttled
runs already shown that the system could do 8000 tps, so the
bandwidth is already there.

Sure, but thanks to the tps charts we *do know* that for vast majority
of the intervals (each second) the number of completed transactions is
very close to 5000. And that wouldn't be possible if large part of the
latencies were close to the maximums.

With 5000 tps and 32 clients, that means the average latency should be
less than 6ms, otherwise the clients couldn't make ~160 tps each. But we
do see that the maximum latency for most intervals is way higher. Only
~10% of the intervals have max latency below 10ms, for example.

Notice the max latency is in microseconds (as logged by pgbench),
so according to the "max latency" charts the latencies are below
10 seconds (old) and 1 second (new) about 99% of the time.

AFAICS, the max latency is aggregated by second, but then it does
not say much about the distribution of individuals latencies in the
interval, that is whether they were all close to the max or not,
Having the same chart with median or average might help. Also, with
the stddev chart, the percent do not correspond with the latency one,
so it may be that the latency is high but the stddev is low, i.e. all
transactions are equally bad on the interval, or not.

So I must admit that I'm not clear at all how to interpret the max
latency & stddev charts you provided.

You're right those charts are not describing distributions of the
latencies but those aggregated metrics. And it's not particularly simple
to deduce information about the source statistics, for example because
all the intervals have the same "weight" although the number of
transactions that completed in each interval may be different.

But I do think it's a very useful tool when it comes to measuring the
consistency of behavior over time, assuming you're asking questions
about the intervals and not the original transactions.

For example, had there been intervals with vastly different transaction
rates, we'd see that on the tps charts (i.e. the chart would be much
more gradual or wobbly, just like the "unpatched" one). Or if there were
intervals with much higher variance of latencies, we'd see that on the
STDDEV chart.

I'll consider repeating the benchmark and logging some reasonable sample
of transactions - for the 24h run the unthrottled benchmark did ~670M
transactions. Assuming ~30B per line, that's ~20GB, so 5% sample should
be ~1GB of data, which I think is enough.

But of course, that's useful for answering questions about distribution
of the individual latencies in global, not about consistency over time.

So I don't think this would make any measurable difference in practice.

I think that it may show that 25% of the time the system could not
match the target tps, even if it can handle much more on average, so
the tps achieved when discarding late transactions would be under
4000 tps.

You mean the 'throttled-tps' chart? Yes, that one shows that without the
patches, there's a lot of intervals where the tps was much lower -
presumably due to a lot of slow transactions.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#249

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Tomas Vondra (#248)

Re: checkpointer continuous flushing

Hello Tomas,

But I do think it's a very useful tool when it comes to measuring the
consistency of behavior over time, assuming you're asking questions
about the intervals and not the original transactions.

For a throttled run, I think it is better to check whether or not the
system could handle the load "as expected", i.e. with reasonnable latency,
so somehow I'm interested in the "original transactions" as scheduled by
the client, and whether they were processed efficiently, but then it must
be aggregated by interval to get some statistics.

For example, had there been intervals with vastly different transaction
rates, we'd see that on the tps charts (i.e. the chart would be much more
gradual or wobbly, just like the "unpatched" one). Or if there were intervals
with much higher variance of latencies, we'd see that on the STDDEV chart.

On HDDs what happens is that transactions are "blocked/freezed", the tps
is very low, the latency very high, but then with few tx (even 1 or 0 at
time) and all latencies very bad but nevertheless close one to the other,
in a bad way, the resulting stddev may be quite small anyway.

I'll consider repeating the benchmark and logging some reasonable sample of
transactions

Beware that this measure is skewed, because on HDDs when the system is
stuck, it is stuck on very few transactions which are waiting, but they
would seldom show on statistics are there are very few of them. That is
why I'm interested in those that could not make it, hence my interest in
--latency-limit option which just say that.

So I don't think this would make any measurable difference in practice.

I think that it may show that 25% of the time the system could not
match the target tps, even if it can handle much more on average, so
the tps achieved when discarding late transactions would be under
4000 tps.

You mean the 'throttled-tps' chart?

Yes.

Yes, that one shows that without the patches, there's a lot of intervals
where the tps was much lower - presumably due to a lot of slow
transactions.

Yep. That is what is measured with the latency limit option, by counting
the dropped transactions that where not processed in a timely maner.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#250

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Fabien COELHO (#249)

3 attachment(s)

Re: checkpointer continuous flushing

Hi,

I've repeated the tests, but this time logged details for 5% of the
transaction (instead of aggregating the data for each second). I've also
made the tests shorter - just 12 hours instead of 24, to reduce the time
needed to complete the benchmark.

Overall, this means ~300M transactions in total for the un-throttled
case, so sample with ~15M transactions available when computing the
following charts.

I've used the same commits as during the previous testing, i.e. a298a1e0
(before patches) and 23a27b03 (with patches).

One interesting difference is that while the "patched" version resulted
in slightly better performance (8122 vs. 8000 tps), the "unpatched"
version got considerably slower (6790 vs. 7725 tps) - that's ~13%
difference, so not negligible. Not sure what's the cause - the
configuration was exactly the same, there's nothing in the log and the
machine was dedicated to the testing. The only explanation I have is
that the unpatched code is a bit more unstable when it comes to this
type of stress testing.

There results (including scripts for generating the charts) are here:

https://github.com/tvondra/flushing-benchmark-2

Attached are three charts - again, those are using CDF to illustrate the
distributions and compare them easily:

1) regular-latency.png

The two curves intersect at ~4ms, where both CDF reach ~85%. For the
shorter transactions, the old code is slightly faster (i.e. apparently
there's some per-transaction overhead). For higher latencies though, the
patched code is clearly winning - there are far fewer transactions over
6ms, which makes a huge difference. (Notice the x-axis is actually
log-scale, so the tail on the old code is actually much longer than it
might appear.)

2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized, so
it's more representative of actual production use), the difference is
quite clearly in favor of the new code.

3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how much
later the transactions were scheduled. Again, the new code is winning.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#251

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Tomas Vondra (#250)

Re: checkpointer continuous flushing

Hello Tomas,

Thanks again for these interesting benches.

Overall, this means ~300M transactions in total for the un-throttled case, so
sample with ~15M transactions available when computing the following charts.

Still a very sizable run!

There results (including scripts for generating the charts) are here:

https://github.com/tvondra/flushing-benchmark-2

This repository seems empty.

1) regular-latency.png

I'm wondering whether it would be clearer if the percentiles where
relative to the largest sample, not to itself, so that the figures from
the largest one would still be between 0 and 1, but the other (unpatched)
one would go between 0 and 0.85, that is would be cut short proportionnaly
to the actual performance.

The two curves intersect at ~4ms, where both CDF reach ~85%. For the
shorter transactions, the old code is slightly faster (i.e. apparently
there's some per-transaction overhead).

I'm not sure how meaningfull is the crossing, because both curves do not
reflect the same performance. I think that they may not cross at all if
the normalization is with the same reference, i.e. the better run.

2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized, so it's
more representative of actual production use), the difference is quite
clearly in favor of the new code.

Indeed, it is a no brainer.

3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how much later
the transactions were scheduled. Again, the new code is winning.

No brainer again. I infer from this figure that with the initial version
60% of transactions have trouble being processed on time, while this is
maybe about 35% with the new version.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#252

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Fabien COELHO (#251)

Re: checkpointer continuous flushing

Hi,

On 03/22/2016 07:35 AM, Fabien COELHO wrote:

Hello Tomas,

Thanks again for these interesting benches.

Overall, this means ~300M transactions in total for the un-throttled
case, so sample with ~15M transactions available when computing the
following charts.

Still a very sizable run!

There results (including scripts for generating the charts) are here:

https://github.com/tvondra/flushing-benchmark-2

This repository seems empty.

Strange. Apparently I forgot to push, or maybe it did not complete
before I closed the terminal. Anyway, pushing now (it'll take a bit more
time to complete).

1) regular-latency.png

I'm wondering whether it would be clearer if the percentiles where
relative to the largest sample, not to itself, so that the figures
from the largest one would still be between 0 and 1, but the other
(unpatched) one would go between 0 and 0.85, that is would be cut
short proportionnaly to the actual performance.

I'm not sure what you mean by 'relative to largest sample'?

The two curves intersect at ~4ms, where both CDF reach ~85%. For
the shorter transactions, the old code is slightly faster (i.e.
apparently there's some per-transaction overhead).

I'm not sure how meaningfull is the crossing, because both curves do
not reflect the same performance. I think that they may not cross at
all if the normalization is with the same reference, i.e. the better
run.

Well, I think the curves illustrate exactly the performance difference,
because with the old code the percentiles after p=0.85 get much higher.
Which is the point of the crossing, although I agree the exact point
does not have a particular meaning.

2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized,
so it's more representative of actual production use), the
difference is quite clearly in favor of the new code.

Indeed, it is a no brainer.

Yep.

3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how
much later the transactions were scheduled. Again, the new code is
winning.

No brainer again. I infer from this figure that with the initial
version 60% of transactions have trouble being processed on time,
while this is maybe about 35% with the new version.

Yep.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#253

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Tomas Vondra (#250)

Re: checkpointer continuous flushing

Hi,

On 2016-03-21 18:46:58 +0100, Tomas Vondra wrote:

I've repeated the tests, but this time logged details for 5% of the
transaction (instead of aggregating the data for each second). I've also
made the tests shorter - just 12 hours instead of 24, to reduce the time
needed to complete the benchmark.

Overall, this means ~300M transactions in total for the un-throttled case,
so sample with ~15M transactions available when computing the following
charts.

I've used the same commits as during the previous testing, i.e. a298a1e0
(before patches) and 23a27b03 (with patches).

One interesting difference is that while the "patched" version resulted in
slightly better performance (8122 vs. 8000 tps), the "unpatched" version got
considerably slower (6790 vs. 7725 tps) - that's ~13% difference, so not
negligible. Not sure what's the cause - the configuration was exactly the
same, there's nothing in the log and the machine was dedicated to the
testing. The only explanation I have is that the unpatched code is a bit
more unstable when it comes to this type of stress testing.

There results (including scripts for generating the charts) are here:

https://github.com/tvondra/flushing-benchmark-2

Attached are three charts - again, those are using CDF to illustrate the
distributions and compare them easily:

1) regular-latency.png

The two curves intersect at ~4ms, where both CDF reach ~85%. For the shorter
transactions, the old code is slightly faster (i.e. apparently there's some
per-transaction overhead). For higher latencies though, the patched code is
clearly winning - there are far fewer transactions over 6ms, which makes a
huge difference. (Notice the x-axis is actually log-scale, so the tail on
the old code is actually much longer than it might appear.)

2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized, so it's
more representative of actual production use), the difference is quite
clearly in favor of the new code.

3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how much
later the transactions were scheduled. Again, the new code is winning.

Thanks for running these tests!

I think this shows that we're in a good shape, and that the commits
succeeded in what they were attempting. Very glad to hear that.

WRT tablespaces: What I'm planning to do, unless somebody has a better
proposal, is to basically rent two big amazon instances, and run pgbench
in parallel over N tablespaces. Once with local SSD and once with local
HDD storage.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#254

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Tomas Vondra (#252)

Re: checkpointer continuous flushing

1) regular-latency.png

I'm wondering whether it would be clearer if the percentiles where
relative to the largest sample, not to itself, so that the figures
from the largest one would still be between 0 and 1, but the other
(unpatched) one would go between 0 and 0.85, that is would be cut
short proportionnaly to the actual performance.

I'm not sure what you mean by 'relative to largest sample'?

You took 5% of the tx on two 12 hours runs, totaling say 85M tx on one
and 100M tx on the other, so you get 4.25M tx from the first and 5M from
the second.

I'm saying that the percentile should be computed on the largest one (5M),
so that you get a curve like the following, with both curve having the
same transaction density on the y axis, so the second one does not go up
to the top, reflecting that in this case less transactions where
processed.

A
+ ____----- # up to 100%
| / ___---- # cut short
| | /
| | |
| _/ /
|/__/
+------------->

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#255

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Fabien COELHO (#254)

Re: checkpointer continuous flushing

Hi,

On 03/22/2016 10:44 AM, Fabien COELHO wrote:

1) regular-latency.png

I'm wondering whether it would be clearer if the percentiles
where relative to the largest sample, not to itself, so that the
figures from the largest one would still be between 0 and 1, but
the other (unpatched) one would go between 0 and 0.85, that is
would be cut short proportionnaly to the actual performance.

I'm not sure what you mean by 'relative to largest sample'?

You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
one and 100M tx on the other, so you get 4.25M tx from the first and
5M from the second.

I'm saying that the percentile should be computed on the largest one
(5M), so that you get a curve like the following, with both curve
having the same transaction density on the y axis, so the second one
does not go up to the top, reflecting that in this case less
transactions where processed.

Huh, that seems weird. That's not how percentiles or CDFs work, and I
don't quite understand what would that tell us.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#256

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#253)

Re: checkpointer continuous flushing

WRT tablespaces: What I'm planning to do, unless somebody has a better
proposal, is to basically rent two big amazon instances, and run pgbench
in parallel over N tablespaces. Once with local SSD and once with local
HDD storage.

Ok.

Not sure how to control that table spaces are actually on distinct
dedicated disks with VMs, but this is the idea.

To emphasize potential bad effects without having to build too large a
host and involve too many table spaces, I would suggest to reduce
significantly the "checkpoint_flush_after" setting while running these
tests.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#257

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Tomas Vondra (#255)

Re: checkpointer continuous flushing

On 2016-03-22 10:48:20 +0100, Tomas Vondra wrote:

Hi,

On 03/22/2016 10:44 AM, Fabien COELHO wrote:

1) regular-latency.png

I'm wondering whether it would be clearer if the percentiles
where relative to the largest sample, not to itself, so that the
figures from the largest one would still be between 0 and 1, but
the other (unpatched) one would go between 0 and 0.85, that is
would be cut short proportionnaly to the actual performance.

I'm not sure what you mean by 'relative to largest sample'?

You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
one and 100M tx on the other, so you get 4.25M tx from the first and
5M from the second.

OK

I'm saying that the percentile should be computed on the largest one
(5M), so that you get a curve like the following, with both curve
having the same transaction density on the y axis, so the second one
does not go up to the top, reflecting that in this case less
transactions where processed.

Huh, that seems weird. That's not how percentiles or CDFs work, and I don't
quite understand what would that tell us.

My impression is that we actually know what we need to know anyway?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#258

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Fabien COELHO (#256)

Re: checkpointer continuous flushing

On 2016-03-22 10:52:55 +0100, Fabien COELHO wrote:

To emphasize potential bad effects without having to build too large a host
and involve too many table spaces, I would suggest to reduce significantly
the "checkpoint_flush_after" setting while running these tests.

Meh, that completely distorts the test.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#259

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Tomas Vondra (#255)

Re: checkpointer continuous flushing

You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
one and 100M tx on the other, so you get 4.25M tx from the first and
5M from the second.

OK

I'm saying that the percentile should be computed on the largest one
(5M), so that you get a curve like the following, with both curve
having the same transaction density on the y axis, so the second one
does not go up to the top, reflecting that in this case less
transactions where processed.

Huh, that seems weird. That's not how percentiles or CDFs work, and I don't
quite understand what would that tell us.

It would tell us that for a given transaction number (in the
latency-ordered list) whether its latency is above or below the other run.

I think it would probably show that the latency is always better for the
patched version by getting rid of the crossing which has no meaning and
seems to suggest, wrongly, that in some case the other is better than the
first, but as the y axis of both curves are not in the same unit (not same
transaction density) this is just an illusion implied by a misplaced
normalization.

So I'm basically saying that the y axis should be just the transaction
number, not a percent.

Anyway, these are just details, your figures show that the patch is a very
significant win on SSDs, all is well!

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#260

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#257)

Re: checkpointer continuous flushing

My impression is that we actually know what we need to know anyway?

Sure, the overall summary is "it is much better with the patch" on this
large SSD test, which is good news because the patch was really designed
to help with HDDs.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#261

Fabien COELHO

coelho@cri.ensmp.fr

almost 10 years ago

In reply to: Andres Freund (#258)

Re: checkpointer continuous flushing

To emphasize potential bad effects without having to build too large a host
and involve too many table spaces, I would suggest to reduce significantly
the "checkpoint_flush_after" setting while running these tests.

Meh, that completely distorts the test.

Yep, I agree.

The point would be to show whether there is a significant impact, or not,
with less hardware & cost involved in the test.

Now if you can put 16 disks with 16 table spaces with 16 buffers per
bucket, that is good, fine with me! I'm just trying to point out that you
could probably get comparable relative results with 4 disks, 4 tables
spaces and 4 buffers per bucket, so it is an alternative and less
expensive testing strategy.

This just shows that I usually work on a tight (negligeable?) budget:-)

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers