checkpoint writeback via sync_file_range

Started by Robert Haasabout 14 years ago14 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

Greg Smith muttered a while ago about wanting to do something with
sync_file_range to improve checkpoint behavior on Linux. I thought he
was talking about trying to sync only the range of blocks known to be
dirty, which didn't seem like a very exciting idea, but after looking
at the man page for sync_file_range, I think I understand what he was
really going for: sync_file_range allows you to hint the Linux kernel
that you'd like it to clean a certain set of pages. I further recall
from Greg's previous comments that in the scenarios he's seen,
checkpoint I/O spikes are caused not so much by the data written out
by the checkpoint itself but from the other dirty data in the kernel
buffer cache. Based on that, I whipped up the attached patch, which,
if sync_file_range is available, simply iterates through everything
that will eventually be fsync'd before beginning the write phase and
tells the Linux kernel to put them all under write-out.

I don't know that I have a suitable place to test this, and I'm not
quite sure what a good test setup would look like either, so while
I've tested that this appears to issue the right kernel calls, I am
not sure whether it actually fixes the problem case. But here's the
patch, anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

writeback-v1.patchapplication/octet-stream; name=writeback-v1.patchDownload
diff --git a/configure b/configure
index af4f9a3..3da0771 100755
--- a/configure
+++ b/configure
@@ -19263,7 +19263,8 @@ fi
 
 
 
-for ac_func in cbrt dlopen fcvt fdatasync getifaddrs getpeerucred getrlimit memmove poll pstat readlink setproctitle setsid sigprocmask symlink sysconf towlower utime utimes waitpid wcstombs wcstombs_l
+
+for ac_func in cbrt dlopen fcvt fdatasync sync_file_range getifaddrs getpeerucred getrlimit memmove poll pstat readlink setproctitle setsid sigprocmask symlink sysconf towlower utime utimes waitpid wcstombs wcstombs_l
 do
 as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
diff --git a/configure.in b/configure.in
index 9cad436..2d1608d 100644
--- a/configure.in
+++ b/configure.in
@@ -1216,7 +1216,7 @@ PGAC_VAR_INT_TIMEZONE
 AC_FUNC_ACCEPT_ARGTYPES
 PGAC_FUNC_GETTIMEOFDAY_1ARG
 
-AC_CHECK_FUNCS([cbrt dlopen fcvt fdatasync getifaddrs getpeerucred getrlimit memmove poll pstat readlink setproctitle setsid sigprocmask symlink sysconf towlower utime utimes waitpid wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fcvt fdatasync sync_file_range getifaddrs getpeerucred getrlimit memmove poll pstat readlink setproctitle setsid sigprocmask symlink sysconf towlower utime utimes waitpid wcstombs wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e65962..f6e20b8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -70,6 +70,11 @@
 
 /* User-settable parameters */
 int			CheckPointSegments = 3;
+#ifdef USE_WRITEBACK
+bool		checkpoint_writeback = true;
+#else
+bool		checkpoint_writeback = false;
+#endif
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8f68bcc..0402f2c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1766,6 +1766,7 @@ CheckPointBuffers(int flags)
 {
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+	smgrwriteback();
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 43bc43a..e0344ce 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -347,6 +347,21 @@ pg_flush_data(int fd, off_t offset, off_t amount)
 #endif
 }
 
+/*
+ * pg_writeback --- advise OS that the data described won't be needed soon
+ *
+ * Treat as noop if no OS support is available.
+ */
+int
+pg_writeback(int fd)
+{
+#if defined(HAVE_SYNC_FILE_RANGE)
+	return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#else
+	return 0;
+#endif
+}
+
 
 /*
  * InitFileAccess --- initialize this module during backend startup
@@ -1336,6 +1351,23 @@ retry:
 }
 
 int
+FileWriteback(File file)
+{
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileWriteback: %d (%s)",
+			   file, VfdCache[file].fileName));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	return pg_writeback(VfdCache[file].fd);
+}
+
+int
 FileSync(File file)
 {
 	int			returnCode;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bfc9f06..d85e9bf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -931,6 +931,95 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
+ *	mdwriteback() -- Initiate writeback of data to stable storage.
+ */
+void
+mdwriteback(void)
+{
+#ifdef USE_WRITEBACK
+	HASH_SEQ_STATUS hstat;
+	PendingOperationEntry *entry;
+	int			absorb_counter;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOpsTable.
+	 */
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	/* Scan the hashtable for fsync requests. */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOpsTable);
+	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		SMgrRelation reln;
+		MdfdVec    *seg;
+
+		/*
+		 * If writeback is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing this on
+		 * the fly behaves sensibly.)
+		 */
+		if (!checkpoint_writeback)
+			break;
+
+		/* Skip canceled entries. */
+		if (entry->canceled)
+			continue;
+
+		/* Absorb fsync requests so that the queue doesn't overflow. */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * Find or create an smgr hash entry for this relation.  See
+		 * mdsync() a full explanation of why we go back through the smgr
+		 * layer here.
+		 */
+		reln = smgropen(entry->tag.rnode.node, entry->tag.rnode.backend);
+
+		/*
+		 * It is possible that the relation has been dropped or truncated
+		 * since the fsync request was entered.  Since writeback is just a
+		 * performance optimization, there's no harm in just skipping the
+		 * segment if it turns out not to exist any more.
+		 */
+		seg = _mdfd_getseg(reln, entry->tag.forknum,
+						   entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
+						   false, EXTENSION_RETURN_NULL);
+		if (seg == NULL)
+			continue;
+
+		/*
+		 * Try to write it back.
+		 */
+		errno = FileWriteback(seg->mdfd_vfd);
+
+		/*
+		 * Since this is just a hint to the OS to get the file on disk,
+		 * there's no great harm if it fails.  Of course, failure here may be
+		 * a sign that the eventual fsync will also fail, but that's mdsync's
+		 * problem, not ours.
+		 */
+		if (errno != 0 && !FILE_POSSIBLY_DELETED(errno))
+		{
+			char	   *path;
+
+			path = _mdfd_segpath(reln, entry->tag.forknum,
+								 entry->tag.segno);
+			ereport(LOG,
+					(errcode_for_file_access(),
+				   errmsg("could not write back file \"%s\": %m", path)));
+		}
+	}
+#endif
+}
+
+/*
  *	mdsync() -- Sync previous writes to stable storage.
  */
 void
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5f87543..ccd952a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,6 +58,7 @@ typedef struct f_smgr
 											  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);		/* may be NULL */
+	void		(*smgr_writeback) (void);	/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
 	void		(*smgr_post_ckpt) (void);		/* may be NULL */
 } f_smgr;
@@ -67,7 +68,7 @@ static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
 		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
-		mdpreckpt, mdsync, mdpostckpt
+		mdpreckpt, mdwriteback, mdsync, mdpostckpt
 	}
 };
 
@@ -533,6 +534,21 @@ smgrpreckpt(void)
 }
 
 /*
+ *	smgrwriteback() -- Initial writeback during checkpoint.
+ */
+void
+smgrwriteback(void)
+{
+	int			i;
+
+	for (i = 0; i < NSmgr; i++)
+	{
+		if (smgrsw[i].smgr_writeback)
+			(*(smgrsw[i].smgr_writeback)) ();
+	}
+}
+
+/*
  *	smgrsync() -- Sync files to disk during checkpoint.
  */
 void
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5c910dd..c1eec9c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -182,6 +182,7 @@ static bool check_phony_autocommit(bool *newval, void **extra, GucSource source)
 static bool check_debug_assertions(bool *newval, void **extra, GucSource source);
 static bool check_bonjour(bool *newval, void **extra, GucSource source);
 static bool check_ssl(bool *newval, void **extra, GucSource source);
+static bool check_checkpoint_writeback(bool *newval, void **extra, GucSource source);
 static bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_log_stats(bool *newval, void **extra, GucSource source);
 static bool check_canonical_path(char **newval, void **extra, GucSource source);
@@ -816,6 +817,25 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"checkpoint_writeback",
+#ifdef USE_WRITEBACK
+			PGC_SIGHUP,
+#else
+			PGC_INTERNAL,
+#endif
+			WAL_CHECKPOINTS,
+			gettext_noop("Initiates OS writeback of dirty data at checkpoint start."),
+			gettext_noop("For RAID arrays, this should be approximately the number of drive spindles in the array.")
+		},
+		&checkpoint_writeback,
+#ifdef USE_WRITEBACK
+		true,
+#else
+		false,
+#endif
+		check_checkpoint_writeback, NULL, NULL
+	},
+	{
 		{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Continues processing past damaged page headers."),
 			gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
@@ -8374,6 +8394,19 @@ check_bonjour(bool *newval, void **extra, GucSource source)
 }
 
 static bool
+check_checkpoint_writeback(bool *newval, void **extra, GucSource source)
+{
+#ifndef USE_WRITEBACK
+	if (*newval)
+	{
+		GUC_check_errmsg("writeback is not supported by this build");
+		return false;
+	}
+#endif
+	return true;
+}
+
+static bool
 check_ssl(bool *newval, void **extra, GucSource source)
 {
 #ifndef USE_SSL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 315db46..bfaf0bc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -180,6 +180,7 @@
 #checkpoint_timeout = 5min		# range 30s-1h
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
+#checkpoint_writeback = true	# false if async writeback not supported
 
 # - Archiving -
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 93622c4..a3fa124 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -186,6 +186,7 @@ extern bool reachedConsistency;
 
 /* these variables are GUC parameters related to XLOG */
 extern int	CheckPointSegments;
+extern bool	checkpoint_writeback;
 extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db84f49..adee3e4 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -544,6 +544,9 @@
 /* Define to 1 if you have the `symlink' function. */
 #undef HAVE_SYMLINK
 
+/* Define to 1 if you have the `sync_file_range' function. */
+#undef HAVE_SYNC_FILE_RANGE
+
 /* Define to 1 if you have the `sysconf' function. */
 #undef HAVE_SYSCONF
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index ac45ee6..05cf2f2 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -129,14 +129,23 @@
 
 /*
  * USE_PREFETCH code should be compiled only if we have a way to implement
- * prefetching.  (This is decoupled from USE_POSIX_FADVISE because there
- * might in future be support for alternative low-level prefetch APIs.)
+ * prefetching.  (This is decoupled from HAVE_FILE_SYNC_RANGE because there
+ * might in future be support for alternative low-level writeback APIs.)
  */
 #ifdef USE_POSIX_FADVISE
 #define USE_PREFETCH
 #endif
 
 /*
+ * USE_WRITEBACK code should be compiled only if we have a way to implement
+ * writeback.  (This is decoupled from HAVE_SYNC_FILE_RANGE because there
+ * might in future be support for alternative low-level writeback APIs.)
+ */
+#ifdef HAVE_SYNC_FILE_RANGE
+#define USE_WRITEBACK
+#endif
+
+/*
  * This is the default directory in which AF_UNIX socket files are
  * placed.	Caution: changing this risks breaking your existing client
  * applications, which are likely to continue to look in the old
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 22e7fe8..dbf74c0 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -66,6 +66,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
+extern int	FileWriteback(File file);
 extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
@@ -100,6 +101,7 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+extern int	pg_writeback(int fd);
 
 /* Filename components for OpenTemporaryFile */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 46c8402..37c76fb 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -95,6 +95,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
+extern void smgrwriteback(void);
 extern void smgrsync(void);
 extern void smgrpostckpt(void);
 
@@ -120,6 +121,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
+extern void mdwriteback(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);
 
#2Greg Smith
greg@2ndQuadrant.com
In reply to: Robert Haas (#1)
Re: checkpoint writeback via sync_file_range

On 1/10/12 9:14 PM, Robert Haas wrote:

Based on that, I whipped up the attached patch, which,
if sync_file_range is available, simply iterates through everything
that will eventually be fsync'd before beginning the write phase and
tells the Linux kernel to put them all under write-out.

I hadn't really thought of using it that way. The kernel expects that
when this is called the normal way, you're going to track exactly which
segments you want it to sync. And that data isn't really passed through
the fsync absorption code yet; the list of things to fsync has already
lost that level of detail.

What you're doing here doesn't care though, and I hadn't considered that
SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
docs. Used this way, it's basically fsync without the wait or
guarantee; it just tries to push what's already dirty further ahead of
the write queue than those writes would otherwise be.

One idea I was thinking about here was building a little hash table
inside of the fsync absorb code, tracking how many absorb operations
have happened for whatever the most popular relation files are. The
idea is that we might say "use sync_file_range every time <N> calls for
a relation have come in", just to keep from ever accumulating too many
writes to any one file before trying to nudge some of it out of there.
The bat that keeps hitting me in the head here is that right now, a
single fsync might have a full 1GB of writes to flush out, perhaps
because it extended a table and then write more than that to it. And in
everything but a SSD or giant SAN cache situation, 1GB of I/O is just
too much to fsync at a time without the OS choking a little on it.

I don't know that I have a suitable place to test this, and I'm not
quite sure what a good test setup would look like either, so while
I've tested that this appears to issue the right kernel calls, I am
not sure whether it actually fixes the problem case.

I'll put this into my testing queue after the upcoming CF starts.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#2)
Re: checkpoint writeback via sync_file_range

On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith <greg@2ndquadrant.com> wrote:

On 1/10/12 9:14 PM, Robert Haas wrote:

Based on that, I whipped up the attached patch, which,
if sync_file_range is available, simply iterates through everything
that will eventually be fsync'd before beginning the write phase and
tells the Linux kernel to put them all under write-out.

I hadn't really thought of using it that way.  The kernel expects that when
this is called the normal way, you're going to track exactly which segments
you want it to sync.  And that data isn't really passed through the fsync
absorption code yet; the list of things to fsync has already lost that level
of detail.

What you're doing here doesn't care though, and I hadn't considered that
SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
docs.  Used this way, it's basically fsync without the wait or guarantee; it
just tries to push what's already dirty further ahead of the write queue
than those writes would otherwise be.

I don't think this will help at all, I think it will just make things worse.

The problem comes from hammering the fsyncs one after the other. What
this patch does is initiate all of the fsyncs at the same time, so it
will max out the disks even more because this will hit all disks all
at once.

It does open the door to various other uses, so I think this work will
be useful.

One idea I was thinking about here was building a little hash table inside
of the fsync absorb code, tracking how many absorb operations have happened
for whatever the most popular relation files are.  The idea is that we might
say "use sync_file_range every time <N> calls for a relation have come in",
just to keep from ever accumulating too many writes to any one file before
trying to nudge some of it out of there. The bat that keeps hitting me in
the head here is that right now, a single fsync might have a full 1GB of
writes to flush out, perhaps because it extended a table and then write more
than that to it.  And in everything but a SSD or giant SAN cache situation,
1GB of I/O is just too much to fsync at a time without the OS choking a
little on it.

A better idea. Seems like it should be easy enough to keep a counter.

I see some other uses around large writes also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#4Florian Weimer
fweimer@bfk.de
In reply to: Greg Smith (#2)
Re: checkpoint writeback via sync_file_range

* Greg Smith:

One idea I was thinking about here was building a little hash table
inside of the fsync absorb code, tracking how many absorb operations
have happened for whatever the most popular relation files are. The
idea is that we might say "use sync_file_range every time <N> calls
for a relation have come in", just to keep from ever accumulating too
many writes to any one file before trying to nudge some of it out of
there. The bat that keeps hitting me in the head here is that right
now, a single fsync might have a full 1GB of writes to flush out,
perhaps because it extended a table and then write more than that to
it. And in everything but a SSD or giant SAN cache situation, 1GB of
I/O is just too much to fsync at a time without the OS choking a
little on it.

Isn't this pretty much like tuning vm.dirty_bytes? We generally set it
to pretty low values, and seems to help to smoothen the checkpoints.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#3)
Re: checkpoint writeback via sync_file_range

On Wed, Jan 11, 2012 at 9:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

It does open the door to various other uses, so I think this work will
be useful.

Yes, I think this would allow a better design for the checkpointer.

Checkpoint scan will collect buffers to write for checkpoint and sort
them by fileid, like Koichi/Itagaki already suggested.

We then do all the writes for a particular file, then issue a
background sync_file_range, then sleep a little. Loop. At end of loop,
collect up and close the sync_file_range calls with a
SYNC_FILE_RANGE_WAIT_AFTER.

So we're interleaving the writes and fsyncs throughout the whole
checkpoint, not bursting the fsyncs at the end.

With that design we would just have a continuous checkpoint, rather
than having 0,5 or 0.9

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#6Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#1)
Re: checkpoint writeback via sync_file_range

On Wednesday, January 11, 2012 03:14:31 AM Robert Haas wrote:

Greg Smith muttered a while ago about wanting to do something with
sync_file_range to improve checkpoint behavior on Linux. I thought he
was talking about trying to sync only the range of blocks known to be
dirty, which didn't seem like a very exciting idea, but after looking
at the man page for sync_file_range, I think I understand what he was
really going for: sync_file_range allows you to hint the Linux kernel
that you'd like it to clean a certain set of pages. I further recall
from Greg's previous comments that in the scenarios he's seen,
checkpoint I/O spikes are caused not so much by the data written out
by the checkpoint itself but from the other dirty data in the kernel
buffer cache. Based on that, I whipped up the attached patch, which,
if sync_file_range is available, simply iterates through everything
that will eventually be fsync'd before beginning the write phase and
tells the Linux kernel to put them all under write-out.

I played around with this before and my problem was that sync_file_range is not
really a hint. It actually starts writeback *directly* and only returns when
the io is placed inside the queue (at least thats the way it was back then).
Which very quickly leads to it blocking all the time...

Andres

#7Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#3)
Re: checkpoint writeback via sync_file_range

On Wednesday, January 11, 2012 10:28:11 AM Simon Riggs wrote:

On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith <greg@2ndquadrant.com> wrote:

On 1/10/12 9:14 PM, Robert Haas wrote:

Based on that, I whipped up the attached patch, which,
if sync_file_range is available, simply iterates through everything
that will eventually be fsync'd before beginning the write phase and
tells the Linux kernel to put them all under write-out.

I hadn't really thought of using it that way. The kernel expects that
when this is called the normal way, you're going to track exactly which
segments you want it to sync. And that data isn't really passed through
the fsync absorption code yet; the list of things to fsync has already
lost that level of detail.

What you're doing here doesn't care though, and I hadn't considered that
SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
docs. Used this way, it's basically fsync without the wait or guarantee;
it just tries to push what's already dirty further ahead of the write
queue than those writes would otherwise be.

I don't think this will help at all, I think it will just make things
worse.

The problem comes from hammering the fsyncs one after the other. What
this patch does is initiate all of the fsyncs at the same time, so it
will max out the disks even more because this will hit all disks all
at once.

The advantage of using sync_file_range that way is that it starts writeout but
doesn't cause queue drains/barriers/whatever to be issued which can be quite a
signfiicant speed gain. In theory.

Andres

#8Andres Freund
andres@anarazel.de
In reply to: Florian Weimer (#4)
Re: checkpoint writeback via sync_file_range

On Wednesday, January 11, 2012 10:33:47 AM Florian Weimer wrote:

* Greg Smith:

One idea I was thinking about here was building a little hash table
inside of the fsync absorb code, tracking how many absorb operations
have happened for whatever the most popular relation files are. The
idea is that we might say "use sync_file_range every time <N> calls
for a relation have come in", just to keep from ever accumulating too
many writes to any one file before trying to nudge some of it out of
there. The bat that keeps hitting me in the head here is that right
now, a single fsync might have a full 1GB of writes to flush out,
perhaps because it extended a table and then write more than that to
it. And in everything but a SSD or giant SAN cache situation, 1GB of
I/O is just too much to fsync at a time without the OS choking a
little on it.

Isn't this pretty much like tuning vm.dirty_bytes? We generally set it
to pretty low values, and seems to help to smoothen the checkpoints.

If done correctly/way much more invasive you could only issue sync_file_range's
to the areas of the file where checkpointing needs to happen and you could
leave out e.g. hint bit only changes. Which could help to reduce the cost of
checkpoints.

Andres

#9Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#2)
Re: checkpoint writeback via sync_file_range

On Tue, Jan 10, 2012 at 11:38 PM, Greg Smith <greg@2ndquadrant.com> wrote:

What you're doing here doesn't care though, and I hadn't considered that
SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
docs.  Used this way, it's basically fsync without the wait or guarantee; it
just tries to push what's already dirty further ahead of the write queue
than those writes would otherwise be.

Well, my goal was to make sure they got into the write queue rather
than just sitting in memory while the kernel twiddles its thumbs. My
hope is that the kernel is smart enough that, when you put something
under write-out, the kernel writes it out as quickly as it can without
causing too much degradation in foreground activity. If that turns
out to be an incorrect assumption, we'll need a different approach,
but I thought it might be worth trying something simple first and
seeing what happens.

One idea I was thinking about here was building a little hash table inside
of the fsync absorb code, tracking how many absorb operations have happened
for whatever the most popular relation files are.  The idea is that we might
say "use sync_file_range every time <N> calls for a relation have come in",
just to keep from ever accumulating too many writes to any one file before
trying to nudge some of it out of there. The bat that keeps hitting me in
the head here is that right now, a single fsync might have a full 1GB of
writes to flush out, perhaps because it extended a table and then write more
than that to it.  And in everything but a SSD or giant SAN cache situation,
1GB of I/O is just too much to fsync at a time without the OS choking a
little on it.

That's not a bad idea, but there's definitely some potential down
side: you might end up reducing write-combining quite significantly if
you keep pushing things out to files when it isn't really needed yet.
I was aiming to only push things out when we're 100% sure that they're
going to have to be fsync'd, and certainly any already-written buffers
that are in the OS cache at the start of a checkpoint fall into that
category. That having been said, experimental evidence is king.

I'll put this into my testing queue after the upcoming CF starts.

Thanks!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Greg Smith
greg@2ndQuadrant.com
In reply to: Florian Weimer (#4)
Re: checkpoint writeback via sync_file_range

On 1/11/12 4:33 AM, Florian Weimer wrote:

Isn't this pretty much like tuning vm.dirty_bytes? We generally set it
to pretty low values, and seems to help to smoothen the checkpoints.

When I experimented with dropping the actual size of the cache,
checkpoint spikes improved, but things like VACUUM ran terribly slow.
On a typical medium to large server nowadays (let's say 16GB+),
PostgreSQL needs to have gigabytes of write cache for good performance.

What we're aiming to here is keep the benefits of having that much write
cache, while allowing checkpoint related work to send increasingly
strong suggestions about ordering what it needs written soon. There's
basically three primary states on Linux to be concerned about here:

Dirty: in the cache via standard write
|
v pdflush does writeback at 5 or 10% dirty || sync_file_range push
|
Writeback
|
v write happens in the background || fsync call
|
Stored on disk

The systems with bad checkpoint problems will typically have gigabytes
"Dirty", which is necessary for good performance. It's very lazy about
pushing things toward "Writeback" though. Getting the oldest portions
of the outstanding writes into the Writeback queue more aggressively
should make the eventual fsync less likely to block.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#11Greg Smith
greg@2ndQuadrant.com
In reply to: Andres Freund (#6)
Re: checkpoint writeback via sync_file_range

On 1/11/12 7:46 AM, Andres Freund wrote:

I played around with this before and my problem was that sync_file_range is not
really a hint. It actually starts writeback *directly* and only returns when
the io is placed inside the queue (at least thats the way it was back then).
Which very quickly leads to it blocking all the time...

Right, you're answering one of Robert's questions here: yes, once
something is pushed toward writeback, it moves toward an actual write
extremely fast. And the writeback queue can fill itself. But we don't
really care if this blocks. There's a checkpointer process, it will be
doing this work, and it has no other responsibilities anymore (as of
9.2, which is why some of these approaches suddenly become practical).
It's going to get blocked waiting for things sometimes, the way it
already does rarely when it writes, and often when it call fsync.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#12Andres Freund
andres@anarazel.de
In reply to: Greg Smith (#11)
Re: checkpoint writeback via sync_file_range

On Wednesday, January 11, 2012 03:20:09 PM Greg Smith wrote:

On 1/11/12 7:46 AM, Andres Freund wrote:

I played around with this before and my problem was that sync_file_range
is not really a hint. It actually starts writeback *directly* and only
returns when the io is placed inside the queue (at least thats the way
it was back then). Which very quickly leads to it blocking all the
time...

Right, you're answering one of Robert's questions here: yes, once
something is pushed toward writeback, it moves toward an actual write
extremely fast. And the writeback queue can fill itself. But we don't
really care if this blocks. There's a checkpointer process, it will be
doing this work, and it has no other responsibilities anymore (as of
9.2, which is why some of these approaches suddenly become practical).
It's going to get blocked waiting for things sometimes, the way it
already does rarely when it writes, and often when it call fsync.

We do care imo. The heavy pressure putting it directly in the writeback queue
leads to less efficient io because quite often it won't reorder sensibly with
other io anymore and thelike. At least that was my experience in using it with
in another application.
Lots of that changed with linux 3.2 (near complete rewrite of the writeback
mechanism), so a bit of that might be moot anyway.

I definitely aggree that 9.2 opens new possibilities there.

Andres

#13Greg Smith
greg@2ndQuadrant.com
In reply to: Andres Freund (#12)
Re: checkpoint writeback via sync_file_range

On 1/11/12 9:25 AM, Andres Freund wrote:

The heavy pressure putting it directly in the writeback queue
leads to less efficient io because quite often it won't reorder sensibly with
other io anymore and thelike. At least that was my experience in using it with
in another application.

Sure, this is one of the things I was cautioning about in the Double
Writes thread, with VACUUM being the worst such case I've measured.

The thing to realize here is that the data we're talking about must be
flushed to disk in the near future. And Linux will happily cache
gigabytes of it. Right now, the database asks for that to be forced to
disk via fsync, which means in chunks that can be large as a gigabyte.

Let's say we have a traditional storage array and there's competing
activity. 10MB/s would be a good random I/O write rate in that
situation. A single fsync that forces 1GB out at that rate will take
*100 seconds*. And I've seen exactly that when trying to--about 80
seconds is my current worst checkpoint stall ever.

And we don't have a latency vs. throughput knob any finer than that. If
one is added, and you turn it too far toward latency, throughput is
going to tank for the reasons you've also seen. Less reordering,
elevator sorting, and write combining. If the database isn't going to
micro-manage the writes, it needs to give the OS room to do that work
for it.

The most popular OS level approach to adjusting for this trade-off seems
to be "limit the cache size". That hasn't worked out very well when
I've tried it, again getting back to not having enough working room for
writes queued to reorganize them usefully. One theory I've considered
is that we might improve the VACUUM side of that using the same
auto-tuning approach that's been applied to two other areas now: scale
the maximum size of the ring buffers based on shared_buffers. I'm not
real confident in that idea though, because ultimately it won't change
the rate at which dirty buffers from VACUUM are evicted--and that's the
source of the bottleneck in that area.

There is one piece of information the database knows, but it isn't
communicating well to the OS yet. I could do a better job of advising
how to prioritize the writes that must happen soon--but not necessarily
right now. Yes, forcing them into write-back will be counterproductive
from a throughput perspective. The longer they sit at the "Dirty" cache
level above that, the better the odds they'll be done efficiently. But
this is the checkpoint process we're talking about here. It's going to
force the information to disk soon regardless. An intermediate step
pushing to write-back should give the OS a bit more room to move around
than fsync does, making the potential for a latency gain here seem quite
real. We'll see how the benchmarking goes.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#14Jeff Janes
jeff.janes@gmail.com
In reply to: Greg Smith (#13)
Re: checkpoint writeback via sync_file_range

On Thu, Jan 12, 2012 at 7:26 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 1/11/12 9:25 AM, Andres Freund wrote:

The heavy pressure putting it directly in the writeback queue
leads to less efficient io because quite often it won't reorder sensibly
with
other io anymore and thelike. At least that was my experience in using it
with
in another application.

Sure, this is one of the things I was cautioning about in the Double Writes
thread, with VACUUM being the worst such case I've measured.

The thing to realize here is that the data we're talking about must be
flushed to disk in the near future.  And Linux will happily cache gigabytes
of it.  Right now, the database asks for that to be forced to disk via
fsync, which means in chunks that can be large as a gigabyte.

Let's say we have a traditional storage array and there's competing
activity.  10MB/s would be a good random I/O write rate in that situation.
 A single fsync that forces 1GB out at that rate will take *100 seconds*.
 And I've seen exactly that when trying to--about 80 seconds is my current
worst checkpoint stall ever.

And we don't have a latency vs. throughput knob any finer than that.  If one
is added, and you turn it too far toward latency, throughput is going to
tank for the reasons you've also seen.  Less reordering, elevator sorting,
and write combining.  If the database isn't going to micro-manage the
writes, it needs to give the OS room to do that work for it.

Are there any IO benchmarking tools out there that benchmark the
effects of reordering, elevator sorting, write combining, etc.?

What I've seen is basically either "completely sequential" or
"completely random" with not much in between.

Cheers,

Jeff