Write lifetime hints for NVMe

Started by Dmitry Dolgovalmost 8 years ago5 messages
#1Dmitry Dolgov
9erthalion6@gmail.com
1 attachment(s)

Hi,

From what I see some time ago the write lifetime hints support for NVMe multi
streaming was merged into Linux kernel [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35. Theoretically it allows data
written together on media so they can be erased together, which minimizes
garbage collection, resulting in reduced write amplification as well as
efficient flash utilization [2]https://regmedia.co.uk/2016/09/23/0_storage-intelligence-prodoverview-2015-0.pdf. I couldn't find any discussion about that on
hackers, so I decided to experiment with this feature a bit. My idea was to
test quite naive approach when all file descriptors, that are related to
temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them
`RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any
infrastructure around to enable/disable hints.

It turns out that it's possible to perform benchmarks on some EC2 instance
types (e.g. c5) with the corresponding version of the kernel, since they expose
a volume as nvme device:

```
# nvme list
Node SN Model
Namespace Usage Format FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1 vol01cdbc7ec86f17346 Amazon Elastic Block Store
1 0.00 B / 8.59 GB 512 B + 0 B 1.0
```

To get some baseline results I've run several rounds of pgbench on these quite
modest instances (dedicated, with optimized EBS) with slightly adjusted
`max_wal_size` and with default configuration:

$ pgbench -s 200 -i
$ pgbench -T 600 -c 2 -j 2

Analyzing `strace` output I can see that during this test there were some
significant number of operations with pg_stat_tmp and xlogtemp, so I assume
write lifetime hints should have some effect.

As a result I've got reduction of latency about 5-8% (but so far these numbers
are unstable, probably because of virtualization).

```
# without patch
number of transactions actually processed: 491945
latency average = 2.439 ms
tps = 819.906323 (including connections establishing)
tps = 819.908755 (excluding connections establishing)
```

```
with patch
number of transactions actually processed: 521805
latency average = 2.300 ms
tps = 869.665330 (including connections establishing)
tps = 869.668026 (excluding connections establishing)
```

So I have a few questions:

* Does it sound interesting and worthwhile to create a proper patch?

* Maybe someone else has similar results?

* Any suggestions about what can be the best/worst case scenarios of using such
kind of hints?

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
[2]: https://regmedia.co.uk/2016/09/23/0_storage-intelligence-prodoverview-2015-0.pdf

Attachments:

nvme_write_lifetime_poc.patchapplication/octet-stream; name=nvme_write_lifetime_poc.patchDownload
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 71516a9a5a..8823c4515c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1357,7 +1357,14 @@ FileInvalidate(File file)
 File
 PathNameOpenFile(const char *fileName, int fileFlags)
 {
-	return PathNameOpenFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT);
+	RWFWriteLifeHint hint = RWF_WRITE_LIFE_EXTREME;
+	return PathNameOpenFileHint(fileName, fileFlags, &hint);
+}
+
+File
+PathNameOpenFileHint(const char *fileName, int fileFlags, RWFWriteLifeHint *hint)
+{
+	return PathNameOpenFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT, hint);
 }
 
 /*
@@ -1368,11 +1375,12 @@ PathNameOpenFile(const char *fileName, int fileFlags)
  * (which should always be $PGDATA when this code is running).
  */
 File
-PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
+PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode, RWFWriteLifeHint *hint)
 {
-	char	   *fnamecopy;
-	File		file;
-	Vfd		   *vfdP;
+	char			 *fnamecopy;
+	File	   		  file;
+	Vfd		   		 *vfdP;
+	RWFWriteLifeHint defaulthint = RWF_WRITE_LIFE_EXTREME;
 
 	DO_DB(elog(LOG, "PathNameOpenFilePerm: %s %x %o",
 			   fileName, fileFlags, fileMode));
@@ -1407,6 +1415,11 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	DO_DB(elog(LOG, "PathNameOpenFile: success %d",
 			   vfdP->fd));
 
+	if (hint != NULL)
+		fcntl(vfdP->fd, F_SET_RW_HINT, hint);
+	else
+		fcntl(vfdP->fd, F_SET_RW_HINT, &defaulthint);
+
 	Insert(file);
 
 	vfdP->fileName = fnamecopy;
@@ -1577,9 +1590,10 @@ TempTablespacePath(char *path, Oid tablespace)
 static File
 OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 {
-	char		tempdirpath[MAXPGPATH];
-	char		tempfilepath[MAXPGPATH];
-	File		file;
+	char			 tempdirpath[MAXPGPATH];
+	char			 tempfilepath[MAXPGPATH];
+	File			 file;
+	RWFWriteLifeHint hint = RWF_WRITE_LIFE_SHORT;
 
 	TempTablespacePath(tempdirpath, tblspcOid);
 
@@ -1594,8 +1608,9 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 	 * Open the file.  Note: we don't use O_EXCL, in case there is an orphaned
 	 * temp file that can be reused.
 	 */
-	file = PathNameOpenFile(tempfilepath,
-							O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+	file = PathNameOpenFileHint(tempfilepath,
+								O_RDWR | O_CREAT | O_TRUNC | PG_BINARY,
+								&hint);
 	if (file <= 0)
 	{
 		/*
@@ -1608,8 +1623,8 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 		 */
 		mkdir(tempdirpath, S_IRWXU);
 
-		file = PathNameOpenFile(tempfilepath,
-								O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		file = PathNameOpenFileHint(tempfilepath,
+									O_RDWR | O_CREAT | O_TRUNC | PG_BINARY, &hint);
 		if (file <= 0 && rejectError)
 			elog(ERROR, "could not create temporary file \"%s\": %m",
 				 tempfilepath);
@@ -1634,7 +1649,8 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 File
 PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 {
-	File		file;
+	File			 file;
+	RWFWriteLifeHint hint = RWF_WRITE_LIFE_SHORT;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
@@ -1642,7 +1658,7 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 	 * Open the file.  Note: we don't use O_EXCL, in case there is an orphaned
 	 * temp file that can be reused.
 	 */
-	file = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+	file = PathNameOpenFileHint(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY, &hint);
 	if (file <= 0)
 	{
 		if (error_on_failure)
@@ -1672,12 +1688,13 @@ PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
 File
 PathNameOpenTemporaryFile(const char *path)
 {
-	File		file;
+	File			 file;
+	RWFWriteLifeHint hint = RWF_WRITE_LIFE_SHORT;
 
 	ResourceOwnerEnlargeFiles(CurrentResourceOwner);
 
 	/* We open the file read-only. */
-	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	file = PathNameOpenFileHint(path, O_RDONLY | PG_BINARY, &hint);
 
 	/* If no such file, then we don't raise an error. */
 	if (file <= 0 && errno != ENOENT)
@@ -2401,7 +2418,8 @@ OpenTransientFile(const char *fileName, int fileFlags)
 int
 OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 {
-	int			fd;
+	int				 fd;
+	RWFWriteLifeHint hint = RWF_WRITE_LIFE_SHORT;
 
 	DO_DB(elog(LOG, "OpenTransientFile: Allocated %d (%s)",
 			   numAllocatedDescs, fileName));
@@ -2427,6 +2445,7 @@ OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 		desc->create_subid = GetCurrentSubTransactionId();
 		numAllocatedDescs++;
 
+		fcntl(fd, F_SET_RW_HINT, &hint);
 		return fd;
 	}
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index db5ca16679..962d0af824 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -41,6 +41,17 @@
 
 #include <dirent.h>
 
+#define F_LINUX_SPECIFIC_BASE 1024
+#define F_SET_RW_HINT         (F_LINUX_SPECIFIC_BASE + 12)
+
+typedef enum RWFWriteLifeHint {
+	RWF_WRITE_LIFE_NOT_SET = 0, // No hint information set
+	RWF_WRITE_LIFE_NONE,        // No hints about write life time
+	RWF_WRITE_LIFE_SHORT,       // Data written has a short life time
+	RWF_WRITE_LIFE_MEDIUM,      // Data written has a medium life time
+	RWF_WRITE_LIFE_LONG,        // Data written has a long life time
+	RWF_WRITE_LIFE_EXTREME,     // Data written has an extremely long life time
+} RWFWriteLifeHint;
 
 /*
  * FileSeek uses the standard UNIX lseek(2) flags.
@@ -64,7 +75,8 @@ extern int	max_safe_fds;
 
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
-extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
+extern File PathNameOpenFileHint(const char *fileName, int fileFlags, RWFWriteLifeHint *hint);
+extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode, RWFWriteLifeHint *hint);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dmitry Dolgov (#1)
Re: Write lifetime hints for NVMe

On 01/27/2018 02:20 PM, Dmitry Dolgov wrote:

Hi,

From what I see some time ago the write lifetime hints support for NVMe multi
streaming was merged into Linux kernel [1]. Theoretically it allows data
written together on media so they can be erased together, which minimizes
garbage collection, resulting in reduced write amplification as well as
efficient flash utilization [2]. I couldn't find any discussion about that on
hackers, so I decided to experiment with this feature a bit. My idea was to
test quite naive approach when all file descriptors, that are related to
temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them
`RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any
infrastructure around to enable/disable hints.

It turns out that it's possible to perform benchmarks on some EC2 instance
types (e.g. c5) with the corresponding version of the kernel, since they expose
a volume as nvme device:

```
# nvme list
Node SN Model
Namespace Usage Format FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1 vol01cdbc7ec86f17346 Amazon Elastic Block Store
1 0.00 B / 8.59 GB 512 B + 0 B 1.0
```

To get some baseline results I've run several rounds of pgbench on these quite
modest instances (dedicated, with optimized EBS) with slightly adjusted
`max_wal_size` and with default configuration:

$ pgbench -s 200 -i
$ pgbench -T 600 -c 2 -j 2

Analyzing `strace` output I can see that during this test there were some
significant number of operations with pg_stat_tmp and xlogtemp, so I assume
write lifetime hints should have some effect.

As a result I've got reduction of latency about 5-8% (but so far these numbers
are unstable, probably because of virtualization).

```
# without patch
number of transactions actually processed: 491945
latency average = 2.439 ms
tps = 819.906323 (including connections establishing)
tps = 819.908755 (excluding connections establishing)
```

```
with patch
number of transactions actually processed: 521805
latency average = 2.300 ms
tps = 869.665330 (including connections establishing)
tps = 869.668026 (excluding connections establishing)
```

Aren't those numbers far lower that you'd expect from NVMe storage? I do
have a NVMe drive (Intel 750) in my machine, and I can do thousands of
transactions on it with two clients. Seems a bit suspicious.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Tomas Vondra (#2)
Re: Write lifetime hints for NVMe

On 27 January 2018 at 16:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Aren't those numbers far lower that you'd expect from NVMe storage? I do
have a NVMe drive (Intel 750) in my machine, and I can do thousands of
transactions on it with two clients. Seems a bit suspicious.

Maybe an NVMe storage can provide much higher numbers in general, but there are
resource limitations from AWS itself. I was using c5.large, which is the
smallest possible instance of type c5, so maybe that can explain absolute
numbers - but anyway I can recheck, just in case if I missed something.

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dmitry Dolgov (#3)
Re: Write lifetime hints for NVMe

On 01/27/2018 08:06 PM, Dmitry Dolgov wrote:

On 27 January 2018 at 16:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Aren't those numbers far lower that you'd expect from NVMe storage? I do
have a NVMe drive (Intel 750) in my machine, and I can do thousands of
transactions on it with two clients. Seems a bit suspicious.

Maybe an NVMe storage can provide much higher numbers in general, but there are
resource limitations from AWS itself. I was using c5.large, which is the
smallest possible instance of type c5, so maybe that can explain absolute
numbers - but anyway I can recheck, just in case if I missed something.

According to [1]https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html the C5 instances don't have actual NVMe devices (say,
storage in PCIe slot or connected using M.2) but EBS volumes exposed as
NVMe devices. That would certainly make explain the low IOPS numbers, as
EBS has built-in throttling. I don't know how much of the NVMe features
does this EBS variant support.

Amazon actually does provide instance types (f1 and i3) with real NVMe
devices. That's what I'd be testing.

I can do some testing on my system with NVMe storage, to see if there
really is any change thanks to the patch.

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Tomas Vondra (#4)
Re: Write lifetime hints for NVMe

On 27 January 2018 at 23:53, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Amazon actually does provide instance types (f1 and i3) with real NVMe
devices. That's what I'd be testing.

Yes, indeed, that's a better target for testing, thanks. I'll write back when
will get some results.