possible new option for wal_sync_method

Started by Dan Scalesalmost 14 years ago7 messages
#1Dan Scales
scales@vmware.com
1 attachment(s)

When running Postgres on a single ext3 filesystem on Linux, we find that
the attached simple patch gives significant performance benefit (7-8% in
numbers below). The patch adds a new option for wal_sync_method, which
is "open_direct". With this option, the WAL is always opened with
O_DIRECT (but not O_SYNC or O_DSYNC). For Linux, the use of only
O_DIRECT should be correct. All WAL logs are fully allocated before
being used, and the WAL buffers are 8K-aligned, so all direct writes are
guaranteed to complete before returning. (See
http://lwn.net/Articles/348739/)

The advantage of using O_DIRECT is that there is no fsync/fdatasync()
used. All of the other wal_sync_methods use fsync/fdatasync(), either
explicitly or implicitly (via the O_SYNC and O_DATASYNC options).
fsync/fdatasync can be very slow on ext3, because it seems to have to
always wait for the current filesystem meta-data transaction to complete,
even if that meta-data operation is completely unrelated to the file
being fsync'ed. There can be many metadata operations happening on the
data files, so the WAL log fsync can wait for metadata operations on
the data files. Since O_DIRECT does not do any fsync/fdatasync operation,
it avoids this bottleneck, and can finish more quickly on average.
The open_sync and open_dsync options do not have this benefit, because
they do an equivalent of an fsync/fdatasync after every WAL write.

For the open_sync and open_dsync options, O_DIRECT is used for writes
only if the xlog will not need to be consumed by the archiver or
hot-standby. I am not keying the open_direct behavior based on whether
XLogIsNeeded() is true, because we see performance gain even when
archiving is enabled (using a simple script that copies and compresses
the log segments). For 2-processor, 50-warehouse DBT2 run on SLES 11, I
get the following NOTPM results:

wal_sync_method
fdatasync open_direct open_sync

archiving off: 17076 18481 17094
archiving on: 15704 16923 15898

Do folks have any interest in this change, or comments on its
usefulness/correctness? It would be just an extra option for
wal_sync_method that users can try out and has benefits for certain
configurations.

Dan

Attachments:

waldirect.patchtext/x-patch; name=waldirect.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 266c0de..a830a01 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -122,6 +122,7 @@ const struct config_enum_entry sync_method_options[] = {
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
 #endif
+	{"open_direct", SYNC_METHOD_OPEN_DIRECT, false},
 	{NULL, 0, false}
 };
 
@@ -1925,7 +1926,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 		 * fsync more than one file.
 		 */
 		if (sync_method != SYNC_METHOD_OPEN &&
-			sync_method != SYNC_METHOD_OPEN_DSYNC)
+			sync_method != SYNC_METHOD_OPEN_DSYNC &&
+			sync_method != SYNC_METHOD_OPEN_DIRECT)
 		{
 			if (openLogFile >= 0 &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
@@ -8958,6 +8960,15 @@ get_sync_bit(int method)
 		case SYNC_METHOD_OPEN_DSYNC:
 			return OPEN_DATASYNC_FLAG | o_direct_flag;
 #endif
+       case SYNC_METHOD_OPEN_DIRECT:
+			/*
+			 * Open the log with O_DIRECT flag only.  O_DIRECT guarantees
+			 * that data is written to disk when the IO completes if and
+			 * only if the file is fully allocated.  Fortunately, the log
+			 * files are always fully allocated by XLogFileInit() (or are
+			 * recycled from a fully-allocated log).
+			 */
+			return O_DIRECT;
 		default:
 			/* can't happen (unless we are out of sync with option array) */
 			elog(ERROR, "unrecognized wal_sync_method: %d", method);
@@ -9031,6 +9042,7 @@ issue_xlog_fsync(int fd, uint32 log, uint32 seg)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
+		case SYNC_METHOD_OPEN_DIRECT:
 			/* write synced it already */
 			break;
 		default:
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 400c52b..97acde5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -564,3 +564,4 @@
 #------------------------------------------------------------------------------
 
 # Add settings for extensions here
+wal_sync_method = open_direct
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8aecef..b888ee7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -83,6 +83,7 @@ typedef struct XLogRecord
 #define SYNC_METHOD_OPEN		2		/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4		/* for O_DSYNC */
+#define SYNC_METHOD_OPEN_DIRECT	5		/* for O_DIRECT */
 extern int	sync_method;
 
 /*
#2Andres Freund
andres@anarazel.de
In reply to: Dan Scales (#1)
Re: possible new option for wal_sync_method

Hi,

On Thursday, February 16, 2012 06:18:23 PM Dan Scales wrote:

When running Postgres on a single ext3 filesystem on Linux, we find that
the attached simple patch gives significant performance benefit (7-8% in
numbers below). The patch adds a new option for wal_sync_method, which
is "open_direct". With this option, the WAL is always opened with
O_DIRECT (but not O_SYNC or O_DSYNC). For Linux, the use of only
O_DIRECT should be correct. All WAL logs are fully allocated before
being used, and the WAL buffers are 8K-aligned, so all direct writes are
guaranteed to complete before returning. (See
http://lwn.net/Articles/348739/)

I don't think that behaviour is safe in the face of write caches in the IO
path. Linux takes care to issue flush/barrier instructions when necessary if
you issue an fsync/fdatasync, but to my knowledge it does not when O_DIRECT is
used (That would suck performancewise).
I think that behaviour is safe if you have no externally visible write caching
enabled but thats not exactly easy to get/document knowledge.

Why should there otherwise be any performance difference between O_DIRECT|
O_SYNC and O_DIRECT in wal write case? There is no metadata that needs to be
written and I have a hard time imaging that the check whether there is
metadata is that expensive.

I guess a more interesting case would be comparing O_DIRECT|O_SYNC with
O_DIRECT + fdatasync() or even O_DIRECT +
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER)

Any special reason youve did that comparison on ext3? Especially with
data=ordered its behaviour regarding syncs is pretty insane performancewise.
Ext4 would be a bit more interesting...

Andres

#3Marti Raudsepp
marti@juffo.org
In reply to: Dan Scales (#1)
Re: possible new option for wal_sync_method

On Thu, Feb 16, 2012 at 19:18, Dan Scales <scales@vmware.com> wrote:

fsync/fdatasync can be very slow on ext3, because it seems to have to
always wait for the current filesystem meta-data transaction to complete,
even if that meta-data operation is completely unrelated to the file
being fsync'ed.

Use the data=writeback mount option to remove this restriction. This
is actually the suggested setting for PostgreSQL file systems:
http://www.postgresql.org/docs/current/static/wal-intro.html

(Note that this is unsafe for some other applications, so I wouldn't
use it on the root file system)

Regards,
Marti

#4Josh Berkus
josh@agliodbs.com
In reply to: Dan Scales (#1)
Re: possible new option for wal_sync_method

On 2/16/12 9:18 AM, Dan Scales wrote:

Do folks have any interest in this change, or comments on its
usefulness/correctness? It would be just an extra option for
wal_sync_method that users can try out and has benefits for certain
configurations.

Does it have any benefit on Ext4/XFS/Butrfs?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#5Dan Scales
scales@vmware.com
In reply to: Andres Freund (#2)
Re: possible new option for wal_sync_method

Good point, thanks. From the ext3 source code, it looks like
ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
block device, whereas simple direct IO does not. So, that would make
this wal_sync_method option less useful, since, as you say, the user
would have to know if the block device is doing write caching.

For the numbers I reported, I don't think the performance gain is from
not doing the block device flush. The system being measured is a Fibre
Channel disk which should have a fully-nonvolatile disk array. And
measurements using systemtap show that blkdev_issue_flush() always takes
only in the microsecond range.

I think the overhead is still from the fact that ext3_sync_file() waits
for the current in-flight transaction if there is one (and does an
explicit device flush if there is no transaction to wait for.) I do
think there are lots of meta-data operations happening on the data files
(especially for a growing database), so the WAL log commit is waiting for
unrelated data operations. It would be nice if there a simple file
system operation that just flushed the cache of the block device
containing the filesystem (i.e. just does the blkdev_issue_flush() and
not the other things in ext3_sync_file()).

The ext4_sync_file() code looks fairly similar, so I think it may have
the same problem, though I can't be positive. In that case, this
wal_sync_method option might help ext4 as well.

With respect to sync_file_range(), the Linux code that I'm looking at
doesn't really seem to indicate that there is a device flush (since it
never calls a f_op->fsync_file operation). So sync_file_range() may be
not be as useful as thought.

By the way, all the numbers were measured with "data=writeback,
barrier=1" options for ext3. I don't think that I have seen a
significant different when the DBT2 workload for ext3 option
data=ordered.

I will measure all these numbers again tonight, but with barrier=0, so as
to try to confirm that the write flush itself isn't costing a lot for
this configuration.

Dan

----- Original Message -----
From: "Andres Freund" <andres@anarazel.de>
To: pgsql-hackers@postgresql.org
Cc: "Dan Scales" <scales@vmware.com>
Sent: Thursday, February 16, 2012 10:32:09 AM
Subject: Re: [HACKERS] possible new option for wal_sync_method

Hi,

On Thursday, February 16, 2012 06:18:23 PM Dan Scales wrote:

When running Postgres on a single ext3 filesystem on Linux, we find that
the attached simple patch gives significant performance benefit (7-8% in
numbers below). The patch adds a new option for wal_sync_method, which
is "open_direct". With this option, the WAL is always opened with
O_DIRECT (but not O_SYNC or O_DSYNC). For Linux, the use of only
O_DIRECT should be correct. All WAL logs are fully allocated before
being used, and the WAL buffers are 8K-aligned, so all direct writes are
guaranteed to complete before returning. (See
http://lwn.net/Articles/348739/)

I don't think that behaviour is safe in the face of write caches in the IO
path. Linux takes care to issue flush/barrier instructions when necessary if
you issue an fsync/fdatasync, but to my knowledge it does not when O_DIRECT is
used (That would suck performancewise).
I think that behaviour is safe if you have no externally visible write caching
enabled but thats not exactly easy to get/document knowledge.

Why should there otherwise be any performance difference between O_DIRECT|
O_SYNC and O_DIRECT in wal write case? There is no metadata that needs to be
written and I have a hard time imaging that the check whether there is
metadata is that expensive.

I guess a more interesting case would be comparing O_DIRECT|O_SYNC with
O_DIRECT + fdatasync() or even O_DIRECT +
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER)

Any special reason youve did that comparison on ext3? Especially with
data=ordered its behaviour regarding syncs is pretty insane performancewise.
Ext4 would be a bit more interesting...

Andres

#6Andres Freund
andres@anarazel.de
In reply to: Dan Scales (#5)
Re: possible new option for wal_sync_method

Hi,

On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote:

Good point, thanks. From the ext3 source code, it looks like
ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
block device, whereas simple direct IO does not. So, that would make
this wal_sync_method option less useful, since, as you say, the user
would have to know if the block device is doing write caching.

The experiments I know which played with disabling write caches nearly always
had the result that write caching as worth the overhead of syncing.

For the numbers I reported, I don't think the performance gain is from
not doing the block device flush. The system being measured is a Fibre
Channel disk which should have a fully-nonvolatile disk array. And
measurements using systemtap show that blkdev_issue_flush() always takes
only in the microsecond range.

Well, I think it has some io queue implications which could explain some of
the difference. With that regard I think it heavily depends on the kernel
version as thats an area which had loads of pretty radical changes in nearly
every release since 2.6.32.

I think the overhead is still from the fact that ext3_sync_file() waits
for the current in-flight transaction if there is one (and does an
explicit device flush if there is no transaction to wait for.) I do
think there are lots of meta-data operations happening on the data files
(especially for a growing database), so the WAL log commit is waiting for
unrelated data operations. It would be nice if there a simple file
system operation that just flushed the cache of the block device
containing the filesystem (i.e. just does the blkdev_issue_flush() and
not the other things in ext3_sync_file()).

I think you are right there. I think the metadata issue could be relieved a
lot by doing the growing of files in way much larger bits than currently. I
have seen profiles which indicated that lots of time was spent on increasing
the file size. I would be very interested in seing how much changes in that
area would benefit real-world benchmarks.

The ext4_sync_file() code looks fairly similar, so I think it may have
the same problem, though I can't be positive. In that case, this
wal_sync_method option might help ext4 as well.

The journaling code for ext4 is significantly different so I think it very
well might play a role here - although youre probably right and it wont be in
*_sync_file.

With respect to sync_file_range(), the Linux code that I'm looking at
doesn't really seem to indicate that there is a device flush (since it
never calls a f_op->fsync_file operation). So sync_file_range() may be
not be as useful as thought.

Hm, need to check that. I thought it invoked that path somewhere.

By the way, all the numbers were measured with "data=writeback,
barrier=1" options for ext3. I don't think that I have seen a
significant different when the DBT2 workload for ext3 option
data=ordered.

You have not? Interesting again because I have seen results that differed by a
magnitude.

I will measure all these numbers again tonight, but with barrier=0, so as
to try to confirm that the write flush itself isn't costing a lot for
this configuration.

Got any result so far?

Thanks,

Andres

#7Dan Scales
scales@vmware.com
In reply to: Andres Freund (#6)
Re: possible new option for wal_sync_method

Hi,

Got any result so far?

I measured the results with barrier=0, and yes, you are correct -- it seems that most of the benefit of the open_direct wal_sync_method is probably from not doing the barrier operation at the end of fsync():

wal_sync_method
fdatasync open_direct open_sync
no archive, barrier=1: 17309 18507 17138
no archive, barrier=0: 17771 18369 18045

archive, barrier=1 : 15789 16592 15645
archive, barrier=0 : 16616 16785 16547

It took me a while to look through Linux, and understand why barrier=1 had such an effect, even for disks with battery-backed caches. As you
pointed out, the barrier operation not only flushes the disk cache, but also has some queue implications, particularly for Linux releases below
2.6.37. I've been using 2.6.32, and in that case, the barrier at the end of fsync requires that all previously-queued operations be finished before the barrier occurs and flushes the disk cache. This means that each fsync of the WAL log is likely waiting for completely unrelated in-flight operations of the data files. That is why getting rid of the fsync of the WAL log has such a good performance win, even for disks that don't have a disk cache flush (because the cache is battery backed). This option will probably have less benefit for Linux 2.6.37 and above, where
barriers are eliminated, and operations are written more specifically in terms of disk cache flushes.

fsync() on ext3 (even for Linux 2.6.37 and above) does still wait for any outstanding meta-data transaction to commit. So, there is still another
reason to put the WAL log and data files on different logical disks (even if backed by the same physical disk).

It does still seem to me the sync_file_range() is unsafe in the case of non-battery backed disk write caches, since it doesn't sync the disk
cache. However, if sync_file_range() was being used to optimize checkpoint fsyncs, then one final fsync() to an unused file on the same block
device would do the trick of flushing the disk cache.

Dan

----- Original Message -----
From: "Andres Freund" <andres@anarazel.de>
To: pgsql-hackers@postgresql.org
Cc: "Dan Scales" <scales@vmware.com>
Sent: Monday, February 27, 2012 12:43:49 PM
Subject: Re: [HACKERS] possible new option for wal_sync_method

Hi,

On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote:

Good point, thanks. From the ext3 source code, it looks like
ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
block device, whereas simple direct IO does not. So, that would make
this wal_sync_method option less useful, since, as you say, the user
would have to know if the block device is doing write caching.

The experiments I know which played with disabling write caches nearly always
had the result that write caching as worth the overhead of syncing.

For the numbers I reported, I don't think the performance gain is from
not doing the block device flush. The system being measured is a Fibre
Channel disk which should have a fully-nonvolatile disk array. And
measurements using systemtap show that blkdev_issue_flush() always takes
only in the microsecond range.

Well, I think it has some io queue implications which could explain some of
the difference. With that regard I think it heavily depends on the kernel
version as thats an area which had loads of pretty radical changes in nearly
every release since 2.6.32.

I think the overhead is still from the fact that ext3_sync_file() waits
for the current in-flight transaction if there is one (and does an
explicit device flush if there is no transaction to wait for.) I do
think there are lots of meta-data operations happening on the data files
(especially for a growing database), so the WAL log commit is waiting for
unrelated data operations. It would be nice if there a simple file
system operation that just flushed the cache of the block device
containing the filesystem (i.e. just does the blkdev_issue_flush() and
not the other things in ext3_sync_file()).

I think you are right there. I think the metadata issue could be relieved a
lot by doing the growing of files in way much larger bits than currently. I
have seen profiles which indicated that lots of time was spent on increasing
the file size. I would be very interested in seing how much changes in that
area would benefit real-world benchmarks.

The ext4_sync_file() code looks fairly similar, so I think it may have
the same problem, though I can't be positive. In that case, this
wal_sync_method option might help ext4 as well.

The journaling code for ext4 is significantly different so I think it very
well might play a role here - although youre probably right and it wont be in
*_sync_file.

With respect to sync_file_range(), the Linux code that I'm looking at
doesn't really seem to indicate that there is a device flush (since it
never calls a f_op->fsync_file operation). So sync_file_range() may be
not be as useful as thought.

Hm, need to check that. I thought it invoked that path somewhere.

By the way, all the numbers were measured with "data=writeback,
barrier=1" options for ext3. I don't think that I have seen a
significant different when the DBT2 workload for ext3 option
data=ordered.

You have not? Interesting again because I have seen results that differed by a
magnitude.

I will measure all these numbers again tonight, but with barrier=0, so as
to try to confirm that the write flush itself isn't costing a lot for
this configuration.

Got any result so far?

Thanks,

Andres