O_DIRECT for relations and SLRUs (Prototype)

Started by Michael Paquierabout 7 years ago12 messages

michael@paquier.xyz

about 7 years ago

1 attachment(s)

Hi all,
(Added Kevin in CC)

There have been over the ages discussions about getting better
O_DIRECT support to close the gap with other players in the database
market, but I have not actually seen on those lists a patch which
makes use of O_DIRECT for relations and SLRUs (perhaps I missed it,
anyway that would most likely conflict now).

Attached is a toy patch that I have begun using for tests in this
area. That's nothing really serious at this stage, but you can use
that if you would like to see the impact of O_DIRECT. Of course,
things get significantly slower. The patch is able to compile, pass
regression tests, and looks stable. So that's usable for experiments.
The patch uses a GUC called direct_io, enabled to true to ease
regression testing when applying it.

Note that pg_attribute_aligned() cannot be used as that's not an
option with clang and a couple of other comilers as far as I know, so
the patch uses a simple set of placeholder buffers large enough to be
aligned with the OS pages, which should be 4k for Linux by the way,
and not set to BLCKSZ, but for WAL's O_DIRECT we don't really care
much with such details.

If there is interest for such things, perhaps we could get a patch
sorted out, with some angles of attack like:
- Move to use of page-aligned buffers for relations and SLRUs.
- Split use of O_DIRECT for SLRU and relations into separate GUCs.
- Perhaps other things.
However this is a large and very controversial topic, and of course
more complex than the experiment attached, still this prototype is fun
to play with.

Thanks for reading!
--
Michael

Attachments:

direct-io-v1.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..a3ba8cddba 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -120,6 +120,13 @@ typedef enum
 	SLRU_CLOSE_FAILED
 } SlruErrorCause;
 
+/*
+ * Static area used as placeholder for O_DIRECT operations.  Make sure
+ * to keep its size at twice the block size so as it would be aligned
+ * with the OS pages.
+ */
+static char direct_buf[BLCKSZ + BLCKSZ];
+
 static SlruErrorCause slru_errcause;
 static int	slru_errno;
 
@@ -644,6 +651,12 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	int			offset = rpageno * BLCKSZ;
 	char		path[MAXPGPATH];
 	int			fd;
+	int			direct_flag = 0;
+	char	   *aligned_buf = (char *) TYPEALIGN(BLCKSZ, direct_buf);
+
+	/* Use direct I/O if configured as such */
+	if (direct_io)
+		direct_flag |= O_DIRECT | O_SYNC;
 
 	SlruFileName(ctl, path, segno);
 
@@ -654,7 +667,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	 * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
 	 * where the file doesn't exist, and return zeroes instead.
 	 */
-	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+	fd = OpenTransientFile(path, O_RDWR | PG_BINARY | direct_flag);
 	if (fd < 0)
 	{
 		if (errno != ENOENT || !InRecovery)
@@ -681,7 +694,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 
 	errno = 0;
 	pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
-	if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
+	if (read(fd, aligned_buf, BLCKSZ) != BLCKSZ)
 	{
 		pgstat_report_wait_end();
 		slru_errcause = SLRU_READ_FAILED;
@@ -698,6 +711,9 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 		return false;
 	}
 
+	/* copy contents back to the slot */
+	memcpy(shared->page_buffer[slotno], aligned_buf, BLCKSZ);
+
 	return true;
 }
 
@@ -724,6 +740,12 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 	int			offset = rpageno * BLCKSZ;
 	char		path[MAXPGPATH];
 	int			fd = -1;
+	int			direct_flag = 0;
+	char	   *aligned_buf = (char *) TYPEALIGN(BLCKSZ, direct_buf);
+
+	/* Use direct I/O if configured as such */
+	if (direct_io)
+		direct_flag |= O_DIRECT | O_SYNC;
 
 	/*
 	 * Honor the write-WAL-before-data rule, if appropriate, so that we do not
@@ -804,7 +826,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 		 * don't use O_EXCL or O_TRUNC or anything like that.
 		 */
 		SlruFileName(ctl, path, segno);
-		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY | direct_flag);
 		if (fd < 0)
 		{
 			slru_errcause = SLRU_OPEN_FAILED;
@@ -840,9 +862,12 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 		return false;
 	}
 
+	/* use aligned buffer for O_DIRECT */
+	memcpy(aligned_buf, shared->page_buffer[slotno], BLCKSZ);
+
 	errno = 0;
 	pgstat_report_wait_start(WAIT_EVENT_SLRU_WRITE);
-	if (write(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
+	if (write(fd, aligned_buf, BLCKSZ) != BLCKSZ)
 	{
 		pgstat_report_wait_end();
 		/* if write didn't set errno, assume problem is no disk space */
@@ -858,12 +883,13 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 
 	/*
 	 * If not part of Flush, need to fsync now.  We assume this happens
-	 * infrequently enough that it's not a performance issue.
+	 * infrequently enough that it's not a performance issue.  O_DIRECT
+	 * has already made sure that this happens.
 	 */
 	if (!fdata)
 	{
 		pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
-		if (ctl->do_fsync && pg_fsync(fd))
+		if (!direct_io && ctl->do_fsync && pg_fsync(fd))
 		{
 			pgstat_report_wait_end();
 			slru_errcause = SLRU_FSYNC_FAILED;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..92bcb57934 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -148,6 +148,9 @@ int			max_safe_fds = 32;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* Use O_DIRECT for operations involving relation files and SLRUs */
+bool		direct_io = true;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c37dd1290b..a927c8bd5a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -159,6 +159,12 @@ static MemoryContext pendingOpsCxt; /* context for the above  */
 static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
+/*
+ * Static area used as placeholder for O_DIRECT operations.  Make sure
+ * to keep its size at twice the block size so as it would be aligned
+ * with the OS pages.
+ */
+static char direct_buf[BLCKSZ + BLCKSZ];
 
 /*** behavior for mdopen & _mdfd_getseg ***/
 /* ereport if segment not present */
@@ -296,6 +302,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	MdfdVec    *mdfd;
 	char	   *path;
 	File		fd;
+	int			direct_flag = 0;
 
 	if (isRedo && reln->md_num_open_segs[forkNum] > 0)
 		return;					/* created and opened already... */
@@ -304,7 +311,11 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 
 	path = relpath(reln->smgr_rnode, forkNum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	/* Use direct I/O if configured as such */
+	if (direct_io)
+		direct_flag |= O_DIRECT | O_SYNC;
+
+	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY | direct_flag);
 
 	if (fd < 0)
 	{
@@ -317,7 +328,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 		 * already, even if isRedo is not set.  (See also mdopen)
 		 */
 		if (isRedo || IsBootstrapProcessingMode())
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | direct_flag);
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -498,12 +509,18 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	char	   *aligned_buf = (char *) TYPEALIGN(BLCKSZ, direct_buf);
 
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum >= mdnblocks(reln, forknum));
 #endif
 
+	/*
+	 * Copy contents into an aligned buffer for direct I/O.
+	 */
+	memcpy(aligned_buf, buffer, BLCKSZ);
+
 	/*
 	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
 	 * more --- we mustn't create a block whose number actually is
@@ -522,7 +539,8 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
+	if ((nbytes = FileWrite(v->mdfd_vfd, aligned_buf, BLCKSZ, seekpos,
+							WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
 		if (nbytes < 0)
 			ereport(ERROR,
@@ -561,6 +579,7 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	MdfdVec    *mdfd;
 	char	   *path;
 	File		fd;
+	int			direct_flag = 0;
 
 	/* No work if already open */
 	if (reln->md_num_open_segs[forknum] > 0)
@@ -568,7 +587,11 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rnode, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	/* Use direct I/O if configured as such */
+	if (direct_io)
+		direct_flag |= O_DIRECT | O_SYNC;
+
+	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | direct_flag);
 
 	if (fd < 0)
 	{
@@ -579,7 +602,8 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 		 * substitute for mdcreate() in bootstrap mode only. (See mdcreate)
 		 */
 		if (IsBootstrapProcessingMode())
-			fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+			fd = PathNameOpenFile(path,
+					O_RDWR | O_CREAT | O_EXCL | PG_BINARY | direct_flag);
 		if (fd < 0)
 		{
 			if ((behavior & EXTENSION_RETURN_NULL) &&
@@ -719,6 +743,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	char	   *aligned_buf = (char *) TYPEALIGN(BLCKSZ, direct_buf);
 
 	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
@@ -733,7 +758,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
+	nbytes = FileRead(v->mdfd_vfd, aligned_buf, BLCKSZ, seekpos,
+					  WAIT_EVENT_DATA_FILE_READ);
 
 	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
 									   reln->smgr_rnode.node.spcNode,
@@ -760,7 +786,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 * update a block that was later truncated away.
 		 */
 		if (zero_damaged_pages || InRecovery)
-			MemSet(buffer, 0, BLCKSZ);
+			MemSet(aligned_buf, 0, BLCKSZ);
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
@@ -768,6 +794,11 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 							blocknum, FilePathName(v->mdfd_vfd),
 							nbytes, BLCKSZ)));
 	}
+
+	/*
+	 * Copy the result from the aligned buffer to the real one.
+	 */
+	memcpy(buffer, aligned_buf, BLCKSZ);
 }
 
 /*
@@ -784,12 +815,18 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	off_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
+	char	   *aligned_buf = (char *) TYPEALIGN(BLCKSZ, direct_buf);
 
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
 #endif
 
+	/*
+	 * Copy contents into an aligned buffer for direct I/O.
+	 */
+	memcpy(aligned_buf, buffer, BLCKSZ);
+
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
 										 reln->smgr_rnode.node.spcNode,
 										 reln->smgr_rnode.node.dbNode,
@@ -803,7 +840,8 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	nbytes = FileWrite(v->mdfd_vfd, aligned_buf, BLCKSZ, seekpos,
+					   WAIT_EVENT_DATA_FILE_WRITE);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
@@ -1011,11 +1049,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	{
 		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
 
-		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
+		/* nothing to do with direct I/O */
+		if (!direct_io &&
+			FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
+
 		segno--;
 	}
 }
@@ -1414,9 +1455,15 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	return;
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
+	/* Not actually needed with direct I/O */
+	if (direct_io)
+		return;
+
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
@@ -1799,11 +1846,16 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	MdfdVec    *v;
 	int			fd;
 	char	   *fullpath;
+	int			direct_flag = 0;
 
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
+	/* Use direct I/O if configured as such */
+	if (direct_io)
+		direct_flag |= O_DIRECT | O_SYNC;
+
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | direct_flag | oflags);
 
 	pfree(fullpath);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f81e0424e7..a0809ad628 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -631,6 +631,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Background Writer"),
 	/* RESOURCES_ASYNCHRONOUS */
 	gettext_noop("Resource Usage / Asynchronous Behavior"),
+	/* RESOURCES_IO */
+	gettext_noop("Resource Usage / I/O"),
 	/* WAL */
 	gettext_noop("Write-Ahead Log"),
 	/* WAL_SETTINGS */
@@ -1875,6 +1877,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"direct_io", PGC_POSTMASTER, RESOURCES_IO,
+			gettext_noop("Use direct I/O when flushing relations and SLRUs."),
+		},
+		&direct_io,
+		true,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb..ecaa347657 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -48,6 +48,7 @@ typedef int File;
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
+extern PGDLLIMPORT bool direct_io;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index a0970b2e1c..17178862a7 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -65,6 +65,7 @@ enum config_group
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
+	RESOURCES_IO,
 	WAL,
 	WAL_SETTINGS,
 	WAL_CHECKPOINTS,

Andrey Borodin

x4mmm@yandex-team.ru

almost 7 years ago

In reply to: Michael Paquier (#1)

Re: O_DIRECT for relations and SLRUs (Prototype)

Hi!

12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а):

Attached is a toy patch that I have begun using for tests in this
area. That's nothing really serious at this stage, but you can use
that if you would like to see the impact of O_DIRECT. Of course,
things get significantly slower.

Cool!
I've just gathered a group of students to task them with experimenting with shared buffer eviction algorithms during their February internship at Yandex-Sirius edu project. Your patch seems very handy for benchmarks in this area.

Thanks!

Best regards, Andrey Borodin.

Thomas Munro

thomas.munro@enterprisedb.com

almost 7 years ago

In reply to: Andrey Borodin (#2)

Re: O_DIRECT for relations and SLRUs (Prototype)

On Sun, Jan 13, 2019 at 5:13 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi!

12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а):

Attached is a toy patch that I have begun using for tests in this
area. That's nothing really serious at this stage, but you can use
that if you would like to see the impact of O_DIRECT. Of course,
things get significantly slower.

Cool!
I've just gathered a group of students to task them with experimenting with shared buffer eviction algorithms during their February internship at Yandex-Sirius edu project. Your patch seems very handy for benchmarks in this area.

+1, thanks for sharing the patch. Even though just turning on
O_DIRECT is the trivial part of this project, it's good to encourage
discussion. We may indeed become more sensitive to the quality of
buffer eviction algorithms, but it seems like the main work to regain
lost performance will be the background IO scheduling piece:

1. We need a new "bgreader" process to do read-ahead. I think you'd
want a way to tell it with explicit hints (for example, perhaps
sequential scans would advertise that they're reading sequentially so
that it starts to slurp future blocks into the buffer pool, and
streaming replicas might look ahead in the WAL and tell it what's
coming). In theory this might be better than the heuristics OSes use
to guess our access pattern and pre-fetch into the page cache, since
we have better information (and of course we're skipping a buffer
layer).

2. We need a new kind of bgwriter/syncer that aggressively creates
clean pages so that foreground processes rarely have to evict (since
that is now super slow), but also efficiently finds ranges of dirty
blocks that it can write in big sequential chunks.

3. We probably want SLRUs to use the main buffer pool, instead of
their own mini-pools, so they can benefit from the above.

Whether we need multiple bgreader and bgwriter processes or perhaps a
general IO scheduler process may depend on whether we also want to
switch to async (multiplexing from a single process). Starting simple
with a traditional sync IO and N processes seems OK to me.

--
Thomas Munro
http://www.enterprisedb.com

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Thomas Munro (#3)

Re: O_DIRECT for relations and SLRUs (Prototype)

On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote:

1. We need a new "bgreader" process to do read-ahead. I think you'd
want a way to tell it with explicit hints (for example, perhaps
sequential scans would advertise that they're reading sequentially so
that it starts to slurp future blocks into the buffer pool, and
streaming replicas might look ahead in the WAL and tell it what's
coming). In theory this might be better than the heuristics OSes use
to guess our access pattern and pre-fetch into the page cache, since
we have better information (and of course we're skipping a buffer
layer).

Yes, that could be interesting mainly for analytics by being able to
snipe better than the OS readahead.

2. We need a new kind of bgwriter/syncer that aggressively creates
clean pages so that foreground processes rarely have to evict (since
that is now super slow), but also efficiently finds ranges of dirty
blocks that it can write in big sequential chunks.

Okay, that's a new idea. A bgwriter able to do syncs in chunks would
be also interesting with O_DIRECT, no?

3. We probably want SLRUs to use the main buffer pool, instead of
their own mini-pools, so they can benefit from the above.

Wasn't there a thread about that on -hackers actually? I cannot see
any reference to it.

Whether we need multiple bgreader and bgwriter processes or perhaps a
general IO scheduler process may depend on whether we also want to
switch to async (multiplexing from a single process). Starting simple
with a traditional sync IO and N processes seems OK to me.

So you mean that we could just have a simple switch as a first step?
Or I misunderstood you :)

One of the reasons why I have begun this thread is that since we have
heard about the fsync issues on Linux, I think that there is room
for giving our user base more control of their fate without relying on
the Linux community decisions to potentially eat data and corrupt a
cluster with a page dirty bit cleared without its data actually
flushed. Even the latest kernels are not fixing all the patterns with
open fds across processes, switching the problem from one corner of
the table to another, and there are folks patching the Linux kernel to
make Postgres more reliable from this perspective, and living happily
with this option. As long as the option can be controlled and
defaults to false, it seems to be that we could do something. Even if
the performance is bad, this gives the user control of how he/she
wants things to be done.
--
Michael

Andrey Borodin

x4mmm@yandex-team.ru

almost 7 years ago

In reply to: Michael Paquier (#4)

Re: O_DIRECT for relations and SLRUs (Prototype)

13 янв. 2019 г., в 14:02, Michael Paquier <michael@paquier.xyz> написал(а):

3. We probably want SLRUs to use the main buffer pool, instead of
their own mini-pools, so they can benefit from the above.

Wasn't there a thread about that on -hackers actually? I cannot see
any reference to it.

I think it's here /messages/by-id/CAEepm=0o-=d8QPO=YGFiBSqq2p6KOvPVKG3bggZi5Pv4nQw8nw@mail.gmail.com

As long as the option can be controlled and
defaults to false, it seems to be that we could do something. Even if
the performance is bad, this gives the user control of how he/she
wants things to be done.

I like the idea of having this switch, I believe it will make development in this direction easier.
But I think there will be complain from users like "this feature is done wrong" due to really bad performance.

Best regards, Andrey Borodin.

Thomas Munro

thomas.munro@enterprisedb.com

almost 7 years ago

In reply to: Michael Paquier (#4)

Re: O_DIRECT for relations and SLRUs (Prototype)

On Sun, Jan 13, 2019 at 10:02 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote:

1. We need a new "bgreader" process to do read-ahead. I think you'd
want a way to tell it with explicit hints (for example, perhaps
sequential scans would advertise that they're reading sequentially so
that it starts to slurp future blocks into the buffer pool, and
streaming replicas might look ahead in the WAL and tell it what's
coming). In theory this might be better than the heuristics OSes use
to guess our access pattern and pre-fetch into the page cache, since
we have better information (and of course we're skipping a buffer
layer).

Yes, that could be interesting mainly for analytics by being able to
snipe better than the OS readahead.

2. We need a new kind of bgwriter/syncer that aggressively creates
clean pages so that foreground processes rarely have to evict (since
that is now super slow), but also efficiently finds ranges of dirty
blocks that it can write in big sequential chunks.

Okay, that's a new idea. A bgwriter able to do syncs in chunks would
be also interesting with O_DIRECT, no?

Well I'm just describing the stuff that the OS is doing for us in
another layer. Evicting dirty buffers currently consists of a
buffered pwrite(), which we can do a huge number of per second (given
enough spare RAM), but with O_DIRECT | O_SYNC we'll be limited by
storage device random IOPS, so workloads that evict dirty buffers in
foreground processes regularly will suffer. bgwriter should make sure
we always find clean buffers without waiting when we need them.

Yeah, I think pwrite() larger than 8KB at a time would be a goal, to
get large IO request sizes all the way down to the storage.

3. We probably want SLRUs to use the main buffer pool, instead of
their own mini-pools, so they can benefit from the above.

Wasn't there a thread about that on -hackers actually? I cannot see
any reference to it.

/messages/by-id/20180814213500.GA74618@60f81dc409fc.ant.amazon.com

Whether we need multiple bgreader and bgwriter processes or perhaps a
general IO scheduler process may depend on whether we also want to
switch to async (multiplexing from a single process). Starting simple
with a traditional sync IO and N processes seems OK to me.

So you mean that we could just have a simple switch as a first step?
Or I misunderstood you :)

I just meant that if we take over all the read-ahead and write-behind
work and use classic synchronous IO syscalls like pread()/pwrite(),
we'll probably need multiple processes to do it, depending on how much
IO concurrency the storage layer can take.

--
Thomas Munro
http://www.enterprisedb.com

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 7 years ago

In reply to: Michael Paquier (#4)

RE: O_DIRECT for relations and SLRUs (Prototype)

From: Michael Paquier [mailto:michael@paquier.xyz]

One of the reasons why I have begun this thread is that since we have heard
about the fsync issues on Linux, I think that there is room for giving our
user base more control of their fate without relying on the Linux community
decisions to potentially eat data and corrupt a cluster with a page dirty
bit cleared without its data actually flushed. Even the latest kernels
are not fixing all the patterns with open fds across processes, switching
the problem from one corner of the table to another, and there are folks
patching the Linux kernel to make Postgres more reliable from this
perspective, and living happily with this option. As long as the option
can be controlled and defaults to false, it seems to be that we could do
something. Even if the performance is bad, this gives the user control
of how he/she wants things to be done.

Thank you for starting an interesting topic. We probably want the direct I/O. On a INSERT and UPDATE heavy system with PostgreSQL 9.2, we suffered from occasional high response times due to the Linux page cache activity. Postgres processes competed for the page cache to read/write the data files, write online and archive WAL files, and write the server log files (auto_explain and autovacuum workers emitted a lot of logs.) The user with Oracle experience asked why PostgreSQL doesn't handle database I/O by itself...

And I wonder how useful the direct I/O for low latency devices like the persistent memory. The overhead of the page cache may become relatively higher.

Regards
Takayuki Tsunakawa

Andrey Borodin

x4mmm@yandex-team.ru

almost 7 years ago

In reply to: Michael Paquier (#1)

Re: O_DIRECT for relations and SLRUs (Prototype)

12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а):

Note that pg_attribute_aligned() cannot be used as that's not an
option with clang and a couple of other comilers as far as I know, so
the patch uses a simple set of placeholder buffers large enough to be
aligned with the OS pages, which should be 4k for Linux by the way,
and not set to BLCKSZ, but for WAL's O_DIRECT we don't really care
much with such details.

Is it possible to avoid those memcopy's by aligning available buffers instead?
I couldn't understand this from the patch and this thread.

Best regards, Andrey Borodin.

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Andrey Borodin (#8)

Re: O_DIRECT for relations and SLRUs (Prototype)

On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote:

Is it possible to avoid those memcpy's by aligning available buffers
instead? I couldn't understand this from the patch and this thread.

Sure, it had better do that. That's just a lazy implementation.
--
Michael

#10

Maksim Milyutin

milyutinma@gmail.com

almost 7 years ago

In reply to: Michael Paquier (#9)

Re: O_DIRECT for relations and SLRUs (Prototype)

On 1/15/19 11:28 AM, Michael Paquier wrote:

On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote:

Is it possible to avoid those memcpy's by aligning available buffers
instead? I couldn't understand this from the patch and this thread.

Sure, it had better do that. That's just a lazy implementation.

Hi!

Could you specify all cases when buffers will not be aligned with BLCKSZ?

AFAIC shared and temp buffers are aligned. And what ones are not?

--
Regards, Maksim Milyutin

#11

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Maksim Milyutin (#10)

Re: O_DIRECT for relations and SLRUs (Prototype)

On Tue, Jan 15, 2019 at 07:40:12PM +0300, Maksim Milyutin wrote:

Could you specify all cases when buffers will not be aligned with BLCKSZ?

AFAIC shared and temp buffers are aligned. And what ones are not?

SLRU buffers are not aligned with the OS pages (aka alignment with
4096 at least). There are also a bunch of code paths where the callers
of mdread() or mdwrite() don't do that, which makes a correct patch
more invasive.
--
Michael

#12

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#3)

Re: O_DIRECT for relations and SLRUs (Prototype)

On Sat, Jan 12, 2019 at 4:36 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

1. We need a new "bgreader" process to do read-ahead. I think you'd
want a way to tell it with explicit hints (for example, perhaps
sequential scans would advertise that they're reading sequentially so
that it starts to slurp future blocks into the buffer pool, and
streaming replicas might look ahead in the WAL and tell it what's
coming). In theory this might be better than the heuristics OSes use
to guess our access pattern and pre-fetch into the page cache, since
we have better information (and of course we're skipping a buffer
layer).

Right, like if we're reading the end of relation file 16384, we can
prefetch the beginning of 16384.1, but the OS won't know to do that.

2. We need a new kind of bgwriter/syncer that aggressively creates
clean pages so that foreground processes rarely have to evict (since
that is now super slow), but also efficiently finds ranges of dirty
blocks that it can write in big sequential chunks.

Yeah.

3. We probably want SLRUs to use the main buffer pool, instead of
their own mini-pools, so they can benefit from the above.

Right. I think this is important, and it makes me think that maybe
Michael's patch won't help us much in the end. I believe that the
number of pages that are needed for clog data, at least, can very
significantly depending on workload and machine size, so there's not
one number there that is going to work for everybody, and the
algorithms the SLRU code uses for page management have O(n) stuff in
them, so they don't scale well to large numbers of SLRU buffers
anyway. I think we should try to unify the SLRU stuff with
shared_buffers, and then have a test patch like Michael's (not for
commit) which we can use to see the impact of that, and then try to
reduce that impact with the stuff you mention under #1 and #2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company