pg_fallocate

Started by Mitsumasa KONDOabout 12 years ago4 messages
#1Mitsumasa KONDO
kondo.mitsumasa@gmail.com
1 attachment(s)

Hi,

I'l like to add fallocate() system call to improve sequential read/write
peformance. fallocate() system call is different from posix_fallocate()
that is zero-fille algorithm to reserve continues disk space. fallocate()
is almost less overhead alogotithm to reserve continues disk space than
posix_fallocate().

It will be needed by sorted checkpoint and more faster vacuum command in
near the future.

If you get more detail information, please see linux manual.

I go sight seeing in Dublin with Ishii-san now:-)

Regards,

--

Mitsumasa KONDO

NTT Open Source Software

Attachments:

pg_fallocate_v0.patchapplication/octet-stream; name=pg_fallocate_v0.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 06f5eb0..340e0fd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3408,7 +3408,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
-
+#if defined(HAVE_FALLOCATE)
+	fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, XLogSegSize);
+#endif
 	/*
 	 * Zero-fill the file.	We have to do this the hard way to ensure that all
 	 * the file space has really been allocated --- on platforms that allow
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index de4d902..afa3d24 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -383,6 +383,21 @@ pg_flush_data(int fd, off_t offset, off_t amount)
 	return 0;
 }
 
+/*
+ * pg_fallocate --- advise OS that the data pre-allocate continus file segments
+ * in physical disk.
+ *
+ * Not all platforms have fallocate. Some platforms only have posix_fallocate,
+ * but it ped zero fill to get pre-allocate file segmnets. It is not good
+ * peformance when extend new segmnets, so we don't use posix_fallocate.
+ */
+int
+pg_fallocate(File file, int flags, off_t offset, off_t nbytes)
+{
+#if defined(HAVE_FALLOCATE)
+	return fallocate(VfdCache[file].fd, flags, offset, nbytes);
+#endif
+}
 
 /*
  * fsync_fname -- fsync a file or directory, handling errors properly
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..fe6f640 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -24,6 +24,7 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
+#include <linux/falloc.h>
 
 #include "miscadmin.h"
 #include "access/xlog.h"
@@ -510,6 +511,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	 * if bufmgr.c had to dump another buffer of the same file to make room
 	 * for the new page's buffer.
 	 */
+
+	if(forknum == 1)
+		pg_fallocate(v->mdfd_vfd, FALLOC_FL_KEEP_SIZE, 0, RELSEG_SIZE);
+
 	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
 		ereport(ERROR,
 				(errcode_for_file_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5eac52d..43d8eaf 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -146,6 +146,9 @@
 /* Define to 1 if you have the <editline/readline.h> header file. */
 #undef HAVE_EDITLINE_READLINE_H
 
+/* Define to 1 if you have the 'fallocate' function. */
+#undef HAVE_FALLOCATE
+
 /* Define to 1 if you have the `fdatasync' function. */
 #undef HAVE_FDATASYNC
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 54db287..b8643fc 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -112,6 +112,9 @@
 /* Define to 1 if you have the <editline/readline.h> header file. */
 /* #undef HAVE_EDITLINE_READLINE_H */
 
+/* Define to 1 if you have the 'fallocate' function. */
+/* #undef HAVE FALLOCATE */
+
 /* Define to 1 if you have the `fcvt' function. */
 #define HAVE_FCVT 1
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2a60229..5ac1f6a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,6 +113,7 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+extern int	pg_fallocate(File file, int flags, off_t offset, off_t amount);
 extern void fsync_fname(char *fname, bool isdir);
 
 /* Filename components for OpenTemporaryFile */
#2Robert Haas
robertmhaas@gmail.com
In reply to: Mitsumasa KONDO (#1)
Re: pg_fallocate

On Thu, Oct 31, 2013 at 9:16 AM, Mitsumasa KONDO
<kondo.mitsumasa@gmail.com> wrote:

I'l like to add fallocate() system call to improve sequential read/write
peformance. fallocate() system call is different from posix_fallocate() that
is zero-fille algorithm to reserve continues disk space. fallocate() is
almost less overhead alogotithm to reserve continues disk space than
posix_fallocate().

It will be needed by sorted checkpoint and more faster vacuum command in
near the future.

If you get more detail information, please see linux manual.

I go sight seeing in Dublin with Ishii-san now:-)

Our last attempts to improve performance in this area died in a fire
when it turned out that code that should have been an improvement fell
down over inexplicable ext4 behavior. I think, therefore, that
extensive benchmarking of this or any other proposed approach is
absolutely essential.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Mitsumasa KONDO (#1)
Re: pg_fallocate

On 10/31/13, 9:16 AM, Mitsumasa KONDO wrote:

I'l like to add fallocate() system call to improve sequential read/write
peformance. fallocate() system call is different from posix_fallocate()
that is zero-fille algorithm to reserve continues disk space.
fallocate() is almost less overhead alogotithm to reserve continues disk
space than posix_fallocate().

Your patch seems to be missing a bit that defines HAVE_FALLOCATE,
probably something in configure.in.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Mitsumasa KONDO (#1)
Re: pg_fallocate

On Thu, Oct 31, 2013 at 01:16:44PM +0000, Mitsumasa KONDO wrote:

--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -383,6 +383,21 @@ pg_flush_data(int fd, off_t offset, off_t amount)
return 0;
}
+/*
+ * pg_fallocate --- advise OS that the data pre-allocate continus file segments
+ * in physical disk.
+ *
+ * Not all platforms have fallocate. Some platforms only have posix_fallocate,
+ * but it ped zero fill to get pre-allocate file segmnets. It is not good
+ * peformance when extend new segmnets, so we don't use posix_fallocate.
+ */
+int
+pg_fallocate(File file, int flags, off_t offset, off_t nbytes)
+{
+#if defined(HAVE_FALLOCATE)
+	return fallocate(VfdCache[file].fd, flags, offset, nbytes);
+#endif
+}

You should set errno to ENOSYS and return -1 if HAVE_FALLOCATE isn't
defined.

--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -24,6 +24,7 @@
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
+#include <linux/falloc.h>

This would have to be wrapped in #ifdef HAVE_FALLOCATE or
HAVE_LINUX_FALLOC_H; if you want to create a wrapper around fallocate() you
should add PG defines for the flags, too. Otherwise it's probably easier to
just call fallocate() directly inside an #ifdef block as you did in xlog.c.

@@ -510,6 +511,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
* if bufmgr.c had to dump another buffer of the same file to make room
* for the new page's buffer.
*/
+
+	if(forknum == 1)
+		pg_fallocate(v->mdfd_vfd, FALLOC_FL_KEEP_SIZE, 0, RELSEG_SIZE);
+

Return value should be checked; if it's -1 and errno is something else than
ENOSYS or EOPNOTSUPP the disk space allocation failed and you must return an
error.

/ Oskari

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers