io_uring support

Started by Dmitry Dolgovover 6 years ago3 messages

9erthalion6@gmail.com

over 6 years ago

1 attachment(s)

Hi,

For already some time I'm following the new linux IO interface "io_uring", that
was introduced relatively recently [1]https://github.com/torvalds/linux/commit/38e7571c07be01f9f19b355a9306a4e3d5cb0f5b. Short description says:

Shared application/kernel submission and completion ring pairs, for
supporting fast/efficient IO.

For us the important part is probably that it's an asynchronious IO, that can
work not only with O_DIRECT, but with also with buffered access. Plus there are
claims that it's pretty efficient (efficiency was one of the design goals [2]http://kernel.dk/io_uring.pdf).
The interface consists of submit/complete queues and data structures, shared
between an application and the kernel. To facilitate application development
there is also a nice library to utilize io_uring from the user space [3]http://git.kernel.dk/cgit/liburing/.

Since I haven't found that many discussions in the hackers archives about async
IO, and out of curiosity decided to prepare an experimental patch to see how
this would looks like to use io_uring in PostgreSQL. I've tested this patch so
far only inside a qemu vm on the latest io_uring branch from linux-block tree.
The result is relatively simple, and introduces new interface smgrqueueread,
smgrsubmitread and smgrwaitread to queue any read we want, then submit a queue
to a kernel and then wait for a result. The simplest example of how this
interface could be used I found in pg_prewarm for buffers prefetching.

As a result of this experiment I have few questions, open points and requests
for the community experience:

* I guess the proper implementation to use async IO is a big deal, but could
bring also significant performance advantages. Is there any (nearest) future
for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest
example, but taking into account that io_uring supports ordering, barriers
and linked events, there are probably more use cases when it could be useful.

* Assuming that the answer for previous question is positive, there could be
different strategies how to use io_uring. So far I see different
opportunities for waiting. Let's say we have prepared a batch of async IO
operations and submitted it. Then we can e.g.

-> just wait for a batch to be finished
-> wait (in the same syscall as submitting) for previously submitted batches,
then start submitting again, and at the end wait for the leftovers
-> peek if there are any events completed, and get only those without waiting
for the whole batch (in this case it's necessary to make sure submission
queue is not overflowed)

So it's open what and when to use.

* Does it makes sense to use io_uring for smgrprefetch? Originally I've added
io_uring parts into FilePrefetch also (in the form of preparing and submiting
just one buffer), but not sure if this API is suitable.

* How may look like a data structure, that can describe IO from PostgreSQL
perspective? With io_uring we need to somehow identify IO operations that
were completed. For now I'm just using a buffer number. Btw, this
experimental patch has many limitations, e.g. only one ring is used for
everything, which is of course far from ideal and makes identification even
more important.

* There are few more freedom dimensions, that io_uring introduces - how many
rings to use, how many events per ring (which is going to be n for sqe and
2*n for cqe), how many IO operations per event to do (similar to
preadv/pwritev we can provide a vector), what would be the balance between
submit and complete queues. I guess it will require a lot of benchmarking to
find a good values for these.

[1]: https://github.com/torvalds/linux/commit/38e7571c07be01f9f19b355a9306a4e3d5cb0f5b
[2]: http://kernel.dk/io_uring.pdf
[3]: http://git.kernel.dk/cgit/liburing/

Attachments:

v1-0001-io-uring.patchapplication/octet-stream; name=v1-0001-io-uring.patchDownload

From a7f51cad2800fa9a26b65e5278e49e0c8fb5f9d6 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Sat, 17 Aug 2019 21:22:34 +0200
Subject: [PATCH v1] io uring

POC for io_uring support on pg_prewarm example.
---
 configure                       |  75 +++++++++++++++++++++
 contrib/pg_prewarm/pg_prewarm.c |  39 +++++++++++
 src/backend/storage/file/fd.c   | 113 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c   |  64 ++++++++++++++++++
 src/backend/storage/smgr/smgr.c |  40 +++++++++++
 src/backend/utils/misc/guc.c    |  10 +++
 src/include/pg_config.h.in      |   3 +
 src/include/storage/fd.h        |  12 ++++
 src/include/storage/md.h        |   4 ++
 src/include/storage/smgr.h      |   6 ++
 10 files changed, 366 insertions(+)

diff --git a/configure b/configure
index 2c98e80c19..bb26e2dd21 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ LD
 LDFLAGS_SL
 LDFLAGS_EX
 with_zlib
+with_liburing
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -863,6 +864,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_liburing
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -1569,6 +1571,7 @@ Optional Packages:
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
+  --without-liburing      do not use liburing
   --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
 
 Some influential environment variables:
@@ -8302,6 +8305,25 @@ else
 
 fi
 
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+      :
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=yes
+
+fi
 
 
 
@@ -11795,6 +11817,59 @@ fi
 
 fi
 
+if test "$with_liburing" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for io_uring_queue_init in -luring" >&5
+$as_echo_n "checking for io_uring_queue_init in -luring... " >&6; }
+if ${ac_cv_luring_init+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-luring  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char io_uring_queue_init ();
+int
+main ()
+{
+return io_uring_queue_init ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_luring_init=yes
+else
+  ac_cv_luring_init=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_luring_init" >&5
+$as_echo "$ac_cv_luring_init" >&6; }
+if test "x$ac_cv_luring_init" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBURING 1
+_ACEOF
+
+  LIBS="-luring $LIBS"
+
+else
+  as_fn_error $? "io uring library not found
+If you have liburing already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-liburing to disable ip uring support." "$LINENO" 5
+fi
+
+fi
+
 if test "$enable_spinlocks" = yes; then
 
 $as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index f3deb47a97..58f2dc02de 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -33,6 +33,7 @@ typedef enum
 {
 	PREWARM_PREFETCH,
 	PREWARM_READ,
+	PREWARM_ASYNC_READ,
 	PREWARM_BUFFER
 } PrewarmType;
 
@@ -84,6 +85,8 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		ptype = PREWARM_PREFETCH;
 	else if (strcmp(ttype, "read") == 0)
 		ptype = PREWARM_READ;
+	else if (strcmp(ttype, "asyncread") == 0)
+		ptype = PREWARM_ASYNC_READ;
 	else if (strcmp(ttype, "buffer") == 0)
 		ptype = PREWARM_BUFFER;
 	else
@@ -182,6 +185,42 @@ pg_prewarm(PG_FUNCTION_ARGS)
 			++blocks_done;
 		}
 	}
+	else if (ptype == PREWARM_ASYNC_READ)
+	{
+#ifdef HAVE_LIBURING
+		int chunk = 0, chunk_size = async_queue_depth - 1;
+		int64 start = 0, stop = 0;
+
+		while (stop <= last_block)
+		{
+			start = first_block + chunk * chunk_size;
+			stop = start + chunk_size;
+
+			for (block = start; block <= stop; ++block)
+			{
+				CHECK_FOR_INTERRUPTS();
+				smgrqueueread(rel->rd_smgr, forkNumber, block, blockbuffer.data);
+			}
+
+			smgrsubmitread(rel->rd_smgr, forkNumber, block);
+
+			for (block = start; block <= stop; ++block)
+			{
+				BlockNumber readBlock;
+
+				CHECK_FOR_INTERRUPTS();
+				readBlock = smgrwaitread(rel->rd_smgr, forkNumber, block);
+				++blocks_done;
+			}
+
+			chunk++;
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("async read is not supported by this build")));
+#endif
+	}
 	else if (ptype == PREWARM_BUFFER)
 	{
 		/*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a76112d6cd..859e6039e7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -79,6 +79,10 @@
 #include <sys/resource.h>		/* for getrlimit */
 #endif
 
+#ifdef HAVE_LIBURING
+#include "liburing.h"
+#endif
+
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
@@ -101,6 +105,9 @@
 #define PG_FLUSH_DATA_WORKS 1
 #endif
 
+
+int			async_queue_depth = 64;
+
 /*
  * We must leave some file descriptors free for system(), the dynamic loader,
  * and other code that tries to open files without consulting fd.c.  This
@@ -258,6 +265,9 @@ static Oid *tempTableSpaces = NULL;
 static int	numTempTableSpaces = -1;
 static int	nextTempTableSpace = 0;
 
+#ifdef HAVE_LIBURING
+struct io_uring 	ring;
+#endif
 
 /*--------------------
  *
@@ -801,6 +811,15 @@ InitFileAccess(void)
 
 	/* register proc-exit hook to ensure temp files are dropped at exit */
 	on_proc_exit(AtProcExit_Files, 0);
+
+#ifdef HAVE_LIBURING
+	int returnCode = io_uring_queue_init(async_queue_depth, &ring, 0);
+	if (returnCode < 0)
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("Cannot init io uring async_queue_depth %d, %s",
+					    async_queue_depth, strerror(-returnCode))));
+#endif
 }
 
 /*
@@ -1912,6 +1931,96 @@ retry:
 	return returnCode;
 }
 
+int
+FileQueueRead(File file, char *buffer, int amount, off_t offset, uint32 id)
+{
+#ifdef HAVE_LIBURING
+	int				returnCode;
+	io_data		   *data;
+	struct io_uring_sqe *sqe;
+
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileQueueRead: %d (%s) " INT64_FORMAT " %d %p",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   amount, buffer));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	data = (io_data *) palloc(sizeof(io_data));
+	data->id = id;
+	data->ioVector.iov_base = buffer;
+	data->ioVector.iov_len = amount;
+
+	sqe = io_uring_get_sqe(&ring);
+	if (sqe != NULL)
+	{
+		io_uring_prep_readv(sqe, vfdP->fd, &data->ioVector, 1, offset);
+		io_uring_sqe_set_data(sqe, data);
+
+		return 0;
+	}
+	else
+	{
+		ereport(FATAL,
+				(errcode(ERRCODE_SYSTEM_ERROR),
+				 errmsg("Cannot get sqe, %s", strerror(-returnCode))));
+	}
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("async read is not supported")));
+#endif
+}
+
+int
+FileSubmitRead()
+{
+#ifdef HAVE_LIBURING
+	int			returnCode;
+	returnCode = io_uring_submit(&ring);
+	if (returnCode < 0)
+		return returnCode;
+
+	return 0;
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("async read is not supported")));
+#endif
+}
+
+io_data *
+FileWaitRead()
+{
+#ifdef HAVE_LIBURING
+	int			returnCode;
+	struct io_uring_cqe *cqe = NULL;
+
+	returnCode = io_uring_wait_cqe(&ring, &cqe);
+	if (returnCode < 0)
+	{
+		io_data	*data = (io_data *) palloc(sizeof(io_data));
+		data->returnCode = returnCode;
+		return data;
+	}
+
+	io_uring_cqe_seen(&ring, cqe);
+	return io_uring_cqe_get_data(cqe);
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("async read is not supported")));
+#endif
+}
+
 int
 FileWrite(File file, char *buffer, int amount, off_t offset,
 		  uint32 wait_event_info)
@@ -2797,6 +2906,10 @@ static void
 AtProcExit_Files(int code, Datum arg)
 {
 	CleanupTempFiles(false, true);
+
+#ifdef HAVE_LIBURING
+	io_uring_queue_exit(&ring);
+#endif
 }
 
 /*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..1b988e051c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -663,6 +663,70 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ *	mdqueueread() -- Queue a read for the specified block from a relation.
+ */
+void
+mdqueueread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	if (FileQueueRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, blocknum) < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not queue read for block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+}
+
+/*
+ *	mdsubmitread() -- Submit all queued reads for the specified block from a
+ *	relation.
+ */
+void
+mdsubmitread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	MdfdVec    *v;
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	if (FileSubmitRead() < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not submit reads for block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+}
+
+/*
+ *	mdwaitread() -- Wait completion of a queued read for the specified block
+ *	from a relation.
+ */
+BlockNumber
+mdwaitread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	MdfdVec    *v;
+	io_data    *data;
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	data = FileWaitRead();
+	if (data->returnCode < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not wait read for block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+	else
+		return (BlockNumber) data->id;
+}
+
 /*
  *	mdwrite() -- Write the supplied block at the appropriate location.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b0d9f21e68..6d73d30db4 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,12 @@ typedef struct f_smgr
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, char *buffer);
+	void		(*smgr_queue_read) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum, char *buffer);
+	void		(*smgr_submit_read) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
+	BlockNumber	(*smgr_wait_read) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
@@ -76,6 +82,9 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_prefetch = mdprefetch,
 		.smgr_read = mdread,
+		.smgr_queue_read = mdqueueread,
+		.smgr_submit_read = mdsubmitread,
+		.smgr_wait_read = mdwaitread,
 		.smgr_write = mdwrite,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
@@ -565,6 +574,37 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrsw[reln->smgr_which].smgr_read(reln, forknum, blocknum, buffer);
 }
 
+/*
+ *	smgrqueueread() -- queue a read for a particular block from a relation into
+ *					   the supplied buffer.
+ */
+void
+smgrqueueread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 char *buffer)
+{
+	smgrsw[reln->smgr_which].smgr_queue_read(reln, forknum, blocknum, buffer);
+}
+
+/*
+ *	smgrsubmitread() -- submit all reads for a particular block from a relation
+ *						into the supplied buffer.
+ */
+void
+smgrsubmitread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	smgrsw[reln->smgr_which].smgr_submit_read(reln, forknum, blocknum);
+}
+
+/*
+ *	smgrwaitread() -- wait a reads for a particular block from a relation into
+ *					  the supplied buffer.
+ */
+BlockNumber
+smgrwaitread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	return smgrsw[reln->smgr_which].smgr_wait_read(reln, forknum, blocknum);
+}
+
 /*
  *	smgrwrite() -- Write the supplied buffer out.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..956d6cfc02 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2338,6 +2338,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"async_queue_depth", PGC_POSTMASTER, RESOURCES_KERNEL,
+			gettext_noop("Queue depth"),
+			NULL
+		},
+		&async_queue_depth,
+		64, 25, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
 	 */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 512213aa32..21c682f8f0 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -377,6 +377,9 @@
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
+/* Define to 1 if you have the `uring' library (-luring). */
+#undef HAVE_LIBURING
+
 /* Define to 1 if the system has the type `locale_t'. */
 #undef HAVE_LOCALE_T
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index d2a8c52044..dcc9336f10 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -47,6 +47,7 @@ typedef int File;
 
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
+extern PGDLLIMPORT int async_queue_depth;
 extern PGDLLIMPORT bool data_sync_retry;
 
 /*
@@ -67,6 +68,13 @@ extern int	max_safe_fds;
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
 #endif
 
+typedef struct io_data
+{
+	struct iovec 	 ioVector;
+	uint32		 	 id;
+	int				 returnCode;
+} io_data;
+
 /*
  * prototypes for functions in fd.c
  */
@@ -78,6 +86,10 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
 extern int	FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int	FileQueueRead(File file, char *buffer, int amount, off_t offset,
+						  uint32 id);
+extern int	FileSubmitRead();
+extern io_data *FileWaitRead();
 extern int	FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern off_t FileSize(File file);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..1048853232 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,6 +32,10 @@ extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   char *buffer);
+extern void mdqueueread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				   char *buffer);
+extern void mdsubmitread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum);
+extern BlockNumber mdwaitread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
 					BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index d286c8c7b1..0e51f26460 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -97,6 +97,12 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer);
+extern void smgrqueueread(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum, char *buffer);
+extern void smgrsubmitread(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum);
+extern BlockNumber smgrwaitread(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
-- 
2.21.0

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Dmitry Dolgov (#1)

Re: io_uring support

Hi,

On 2019-08-19 20:20:46 +0200, Dmitry Dolgov wrote:

For already some time I'm following the new linux IO interface "io_uring", that
was introduced relatively recently [1]. Short description says:

Shared application/kernel submission and completion ring pairs, for
supporting fast/efficient IO.

Yes, it's quite promising. I also played around some with it. One
thing I particularly like is that it seems somewhat realistic to have an
abstraction that both supports io_uring and window's iocp - personally I
don't think we need support for more than those.

For us the important part is probably that it's an asynchronious IO, that can
work not only with O_DIRECT, but with also with buffered access.

Note that while the buffered access does allow for some acceleration, it
currently does have quite noticable CPU overhead.

Since I haven't found that many discussions in the hackers archives about async
IO, and out of curiosity decided to prepare an experimental patch to see how
this would looks like to use io_uring in PostgreSQL.

Cool!

I've tested this patch so far only inside a qemu vm on the latest
io_uring branch from linux-block tree. The result is relatively
simple, and introduces new interface smgrqueueread, smgrsubmitread and
smgrwaitread to queue any read we want, then submit a queue to a
kernel and then wait for a result. The simplest example of how this
interface could be used I found in pg_prewarm for buffers prefetching.

Hm. I'm bit doubtful that that's going in the direction of being the
right interface. I think we'd basically have to insist that all AIO
capable smgr's use one common AIO layer (note that the UNDO patches add
another smgr implementation). Otherwise I think we'll have a very hard
time to make them cooperate. An interface like this would also lead to
a lot of duplicated interfaces, because we'd basically need most of the
smgr interface functions duplicated.

I suspect we'd rather have to build something where the existing
functions grow a parameter controlling synchronizity. If AIO is allowed
and supported, the smgr implementation would initiate the IO, together
with a completion function for it, and return some value allowing the
caller to wait for the result if desirable.

As a result of this experiment I have few questions, open points and requests
for the community experience:

* I guess the proper implementation to use async IO is a big deal, but could
bring also significant performance advantages. Is there any (nearest) future
for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest
example, but taking into account that io_uring supports ordering, barriers
and linked events, there are probably more use cases when it could be useful.

The lowest hanging fruit that I can see - and which I played with - is
making the writeback flushing use async IO. That's particularly
interesting for bgwriter. As it commonly only performs random IO, and
as we need to keep the number of dirty buffers in the kernel small to
avoid huge latency spikes, being able to submit IOs asynchronously can
yield significant benefits.

* Assuming that the answer for previous question is positive, there could be
different strategies how to use io_uring. So far I see different
opportunities for waiting. Let's say we have prepared a batch of async IO
operations and submitted it. Then we can e.g.

-> just wait for a batch to be finished
-> wait (in the same syscall as submitting) for previously submitted batches,
then start submitting again, and at the end wait for the leftovers
-> peek if there are any events completed, and get only those without waiting
for the whole batch (in this case it's necessary to make sure submission
queue is not overflowed)

So it's open what and when to use.

I don't think there's much point in working only with complete
batches. I think we'd loose too much of the benefit by introducing
unnecessary synchronous operations. I think we'd need to design the
interface in a way that there constantly can be in-progress IOs, block
when the queue is full, and handle finished IOs using a callback
mechanism or such.

* Does it makes sense to use io_uring for smgrprefetch? Originally I've added
io_uring parts into FilePrefetch also (in the form of preparing and submiting
just one buffer), but not sure if this API is suitable.

I have a hard time seeing that being worthwhile, unless we change the
way it's used significantly. I think to benefit really, we'd have to be
able to lock multiple buffers, and have io_uring prefetch directly into
buffers.

* How may look like a data structure, that can describe IO from PostgreSQL
perspective? With io_uring we need to somehow identify IO operations that
were completed. For now I'm just using a buffer number.

In my hacks I've used the sqe's user_data to point to a struct with
information about the IO.

Btw, this
experimental patch has many limitations, e.g. only one ring is used for
everything, which is of course far from ideal and makes identification even
more important.

I think we don't want to use more than one ring. Makes it too
complicated to have interdependencies between operations (e.g. waiting
for fsyncs before submitting further writes). I also don't really see
why we would benefit from more?

* There are few more freedom dimensions, that io_uring introduces - how many
rings to use, how many events per ring (which is going to be n for sqe and
2*n for cqe), how many IO operations per event to do (similar to
preadv/pwritev we can provide a vector), what would be the balance between
submit and complete queues. I guess it will require a lot of benchmarking to
find a good values for these.

One thing you didn't mention: A lot of this also requires that we
overhaul the way buffer locking for IOs works. Currently we really can
only have one proper IO in progress at a time, which clearly isn't
sufficient for anything that wants to use AIO.

Greetings,

Andres Freund

Dmitry Dolgov

9erthalion6@gmail.com

over 6 years ago

In reply to: Andres Freund (#2)

Re: io_uring support

On Mon, Aug 19, 2019 at 10:21 PM Andres Freund <andres@anarazel.de> wrote:

For us the important part is probably that it's an asynchronious IO, that can
work not only with O_DIRECT, but with also with buffered access.

Note that while the buffered access does allow for some acceleration, it
currently does have quite noticable CPU overhead.

I haven't looked deep at benchmarks yet, is there any public results that show
this? So far I've seen only [1]https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/, but it doesn't say too much about CPU
overhead. Probably it could be also interesting to check io_uring-bench.

I've tested this patch so far only inside a qemu vm on the latest
io_uring branch from linux-block tree. The result is relatively
simple, and introduces new interface smgrqueueread, smgrsubmitread and
smgrwaitread to queue any read we want, then submit a queue to a
kernel and then wait for a result. The simplest example of how this
interface could be used I found in pg_prewarm for buffers prefetching.

Hm. I'm bit doubtful that that's going in the direction of being the
right interface. I think we'd basically have to insist that all AIO
capable smgr's use one common AIO layer (note that the UNDO patches add
another smgr implementation). Otherwise I think we'll have a very hard
time to make them cooperate. An interface like this would also lead to
a lot of duplicated interfaces, because we'd basically need most of the
smgr interface functions duplicated.

I suspect we'd rather have to build something where the existing
functions grow a parameter controlling synchronizity. If AIO is allowed
and supported, the smgr implementation would initiate the IO, together
with a completion function for it, and return some value allowing the
caller to wait for the result if desirable.

Agree, all AIO capable smgr's need to use some common layer. But it seems hard
to implement some async operations only via adding more parameters, e.g.
accumulating AIO operations before submitting to a kernel.

As a result of this experiment I have few questions, open points and requests
for the community experience:

* I guess the proper implementation to use async IO is a big deal, but could
bring also significant performance advantages. Is there any (nearest) future
for such kind of async IO in PostgreSQL? Buffer prefetching is a simplest
example, but taking into account that io_uring supports ordering, barriers
and linked events, there are probably more use cases when it could be useful.

The lowest hanging fruit that I can see - and which I played with - is
making the writeback flushing use async IO. That's particularly
interesting for bgwriter. As it commonly only performs random IO, and
as we need to keep the number of dirty buffers in the kernel small to
avoid huge latency spikes, being able to submit IOs asynchronously can
yield significant benefits.

Yeah, sounds interesting. Are there any results you already can share? Maybe
it's possible to collaborate on this topic?

* Assuming that the answer for previous question is positive, there could be
different strategies how to use io_uring. So far I see different
opportunities for waiting. Let's say we have prepared a batch of async IO
operations and submitted it. Then we can e.g.

-> just wait for a batch to be finished
-> wait (in the same syscall as submitting) for previously submitted batches,
then start submitting again, and at the end wait for the leftovers
-> peek if there are any events completed, and get only those without waiting
for the whole batch (in this case it's necessary to make sure submission
queue is not overflowed)

So it's open what and when to use.

I don't think there's much point in working only with complete
batches. I think we'd loose too much of the benefit by introducing
unnecessary synchronous operations. I think we'd need to design the
interface in a way that there constantly can be in-progress IOs, block
when the queue is full, and handle finished IOs using a callback
mechanism or such.

What would happen if we suddenly don't have enough IO at this particular
moment to fill a queue? Probably there should be more triggers for blocking.

* How may look like a data structure, that can describe IO from PostgreSQL
perspective? With io_uring we need to somehow identify IO operations that
were completed. For now I'm just using a buffer number.

In my hacks I've used the sqe's user_data to point to a struct with
information about the IO.

Yes, that's the same approach I'm using too. I'm just not sure what exactly
should be this "struct with information about the IO", what should it contain
ideally?

experimental patch has many limitations, e.g. only one ring is used for
everything, which is of course far from ideal and makes identification even
more important.

I think we don't want to use more than one ring. Makes it too
complicated to have interdependencies between operations (e.g. waiting
for fsyncs before submitting further writes). I also don't really see
why we would benefit from more?

Since the balance between SQE and CQE can be important and there could be
different "sources of AIO" with different submission frequency, I thought I
could be handy to separate "heavy loaded" rings from common purpose rings
(especially in the case of ordered AIO).

* There are few more freedom dimensions, that io_uring introduces - how many
rings to use, how many events per ring (which is going to be n for sqe and
2*n for cqe), how many IO operations per event to do (similar to
preadv/pwritev we can provide a vector), what would be the balance between
submit and complete queues. I guess it will require a lot of benchmarking to
find a good values for these.

One thing you didn't mention: A lot of this also requires that we
overhaul the way buffer locking for IOs works. Currently we really can
only have one proper IO in progress at a time, which clearly isn't
sufficient for anything that wants to use AIO.

Yeah, that's correct. My hopes are that this could be done in small steps, e.g.
introduce AIO only for some particular cases to see how would it work.

[1]: https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/