Direct I/O issues

Started by Greg Smithover 19 years ago13 messageshackersdocs
Jump to latest
#1Greg Smith
gsmith@gregsmith.com
hackers

I've been trying to optimize a Linux system where benchmarking suggests
large performance differences between the various wal_sync_method options
(with o_sync being the big winner). I started that by using
src/tools/fsync/test_fsync to get an idea what I was dealing with (and to
spot which drives had write caching turned on). Since those results
didn't match what I was seeing in the benchmarks, I've been browsing the
backend source to figure out why. I noticed test_fsync appears to be,
ahem, out of sync with what the engine is doing.

It looks like V8.1 introduced O_DIRECT writes to the WAL, determined at
compile time by a series of preprocessor tests in
src/backend/access/transam/xlog.c When O_DIRECT is available,
O_SYNC/O_FSYNC/O_DSYNC writes use it. test_fsync doesn't do that.

I moved the new code (in 8.2 beta 3, lines 61-92 in xlog.c) into
test_fsync; all the flags had the same name so it dropped right in. You
can get the version I made at http://www.westnet.com/~gsmith/test_fsync.c
(fixed a compiler warning, too)

The results I get now look fishy. I'm not sure if I screwed up a step, or
if I'm seeing a real problem. The system here is running RedHat Linux,
RHEL ES 4.0 kernel 2.6.9, and the disk I'm writing to is a standard
7200RPM IDE drive. I turned off write caching with hdparm -W 0

Here's an excerpt from the stock test_fsync:

Compare one o_sync write to two:
one 16k o_sync write 8.717944
two 8k o_sync writes 17.501980

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 17.018495
write, fdatasync 8.842473
write, fsync, 8.809117

And here's the version I tried to modify to include O_DIRECT support:

Compare one o_sync write to two:
one 16k o_sync write 0.004995
two 8k o_sync writes 0.003027

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.004978
write, fdatasync 8.845498
write, fsync, 8.834037

Obivously the o_sync writes aren't waiting for the disk. Is this a
problem with O_DIRECT under Linux? Or is my code just not correctly
testing this behavior?

Just as a sanity check, I did try this on another system, running SuSE
with drives connected to a cciss SCSI device, and I got exactly the same
results. I'm concerned that Linux users who use O_SYNC because they
notice it's faster will be losing their WAL integrity without being aware
of the problem, especially as the whole O_DIRECT business isn't even
mentioned in the WAL documentation--it really deserves to be brought up in
the wal_sync_method notes at
http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

And while I'm mentioning improvements to that particular documentation
page...the wal_buffers notes there are so sparse they misled me initially.
They suggest only bumping it up for situations with very large
transactions; since I was testing with small ones I left it woefully
undersized initially. I would suggest copying the text from
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html to
here: "When full_page_writes is set and the system is very busy, setting
this value higher will help smooth response times during the period
immediately following each checkpoint." That seems to match what I found
in testing.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#2Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#1)
hackers
Re: [PERFORM] Direct I/O issues

I have applied your test_fsync patch for 8.2. Thanks.

---------------------------------------------------------------------------

Greg Smith wrote:

I've been trying to optimize a Linux system where benchmarking suggests
large performance differences between the various wal_sync_method options
(with o_sync being the big winner). I started that by using
src/tools/fsync/test_fsync to get an idea what I was dealing with (and to
spot which drives had write caching turned on). Since those results
didn't match what I was seeing in the benchmarks, I've been browsing the
backend source to figure out why. I noticed test_fsync appears to be,
ahem, out of sync with what the engine is doing.

It looks like V8.1 introduced O_DIRECT writes to the WAL, determined at
compile time by a series of preprocessor tests in
src/backend/access/transam/xlog.c When O_DIRECT is available,
O_SYNC/O_FSYNC/O_DSYNC writes use it. test_fsync doesn't do that.

I moved the new code (in 8.2 beta 3, lines 61-92 in xlog.c) into
test_fsync; all the flags had the same name so it dropped right in. You
can get the version I made at http://www.westnet.com/~gsmith/test_fsync.c
(fixed a compiler warning, too)

The results I get now look fishy. I'm not sure if I screwed up a step, or
if I'm seeing a real problem. The system here is running RedHat Linux,
RHEL ES 4.0 kernel 2.6.9, and the disk I'm writing to is a standard
7200RPM IDE drive. I turned off write caching with hdparm -W 0

Here's an excerpt from the stock test_fsync:

Compare one o_sync write to two:
one 16k o_sync write 8.717944
two 8k o_sync writes 17.501980

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 17.018495
write, fdatasync 8.842473
write, fsync, 8.809117

And here's the version I tried to modify to include O_DIRECT support:

Compare one o_sync write to two:
one 16k o_sync write 0.004995
two 8k o_sync writes 0.003027

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.004978
write, fdatasync 8.845498
write, fsync, 8.834037

Obivously the o_sync writes aren't waiting for the disk. Is this a
problem with O_DIRECT under Linux? Or is my code just not correctly
testing this behavior?

Just as a sanity check, I did try this on another system, running SuSE
with drives connected to a cciss SCSI device, and I got exactly the same
results. I'm concerned that Linux users who use O_SYNC because they
notice it's faster will be losing their WAL integrity without being aware
of the problem, especially as the whole O_DIRECT business isn't even
mentioned in the WAL documentation--it really deserves to be brought up in
the wal_sync_method notes at
http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

And while I'm mentioning improvements to that particular documentation
page...the wal_buffers notes there are so sparse they misled me initially.
They suggest only bumping it up for situations with very large
transactions; since I was testing with small ones I left it woefully
undersized initially. I would suggest copying the text from
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html to
here: "When full_page_writes is set and the system is very busy, setting
this value higher will help smooth response times during the period
immediately following each checkpoint." That seems to match what I found
in testing.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/rtmp/difftext/x-diffDownload+28-25
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#1)
hackers
Re: Direct I/O issues

Greg Smith <gsmith@gregsmith.com> writes:

The results I get now look fishy.

There are at least two things wrong with this program:

* It does not respect the alignment requirement for O_DIRECT buffers
(reportedly either 512 or 4096 bytes depending on filesystem).

* It does not check for errors (if it had, you might have realized the
other problem).

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#2)
hackers
Re: [PERFORM] Direct I/O issues

Bruce Momjian <bruce@momjian.us> writes:

I have applied your test_fsync patch for 8.2. Thanks.

... which means test_fsync is now broken. Why did you apply a patch
when the author pointed out that the program isn't working?

regards, tom lane

#5Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
hackers
Re: [PERFORM] Direct I/O issues

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

I have applied your test_fsync patch for 8.2. Thanks.

... which means test_fsync is now broken. Why did you apply a patch
when the author pointed out that the program isn't working?

I thought his code was OK, but the OS had issues. Clearly we need to
update test_fsync.c because it doesn't match the code. I have reverted
the patch but some day we need a fixed version.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#6Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#3)
hackers
Re: Direct I/O issues

On Thu, 23 Nov 2006, Tom Lane wrote:

* It does not check for errors (if it had, you might have realized the
other problem).

All the test_fsync code needs to check for errors better; there have been
multiple occasions where I've run that with quesiontable input and it
didn't complain, it just happily ran and reported times that were almost
0.

Thanks for the note about alignment, I had seen something about that in
the xlog.c but wasn't sure if that was important in this case.

It's very important to the project I'm working on that I get this cleared
up, and I think I'm in a good position to fix it myself now. I just
wanted to report the issue and get some initial feedback on what's wrong.
I'll try to rewrite that code with an eye toward the "Determine optimal
fdatasync/fsync, O_SYNC/O_DSYNC options" to-do item, which is what I'd
really like to have.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#7Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#6)
hackers
Re: Direct I/O issues

Greg Smith wrote:

On Thu, 23 Nov 2006, Tom Lane wrote:

* It does not check for errors (if it had, you might have realized the
other problem).

All the test_fsync code needs to check for errors better; there have been
multiple occasions where I've run that with quesiontable input and it
didn't complain, it just happily ran and reported times that were almost
0.

Thanks for the note about alignment, I had seen something about that in
the xlog.c but wasn't sure if that was important in this case.

It's very important to the project I'm working on that I get this cleared
up, and I think I'm in a good position to fix it myself now. I just
wanted to report the issue and get some initial feedback on what's wrong.
I'll try to rewrite that code with an eye toward the "Determine optimal
fdatasync/fsync, O_SYNC/O_DSYNC options" to-do item, which is what I'd
really like to have.

Please send an updated patch for test_fsync.c so we can get it working
for 8.2.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#8Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#6)
hackers
Re: [PERFORM] Direct I/O issues

Greg Smith wrote:

On Thu, 23 Nov 2006, Tom Lane wrote:

* It does not check for errors (if it had, you might have realized the
other problem).

All the test_fsync code needs to check for errors better; there have been
multiple occasions where I've run that with quesiontable input and it
didn't complain, it just happily ran and reported times that were almost
0.

Thanks for the note about alignment, I had seen something about that in
the xlog.c but wasn't sure if that was important in this case.

It's very important to the project I'm working on that I get this cleared
up, and I think I'm in a good position to fix it myself now. I just
wanted to report the issue and get some initial feedback on what's wrong.
I'll try to rewrite that code with an eye toward the "Determine optimal
fdatasync/fsync, O_SYNC/O_DSYNC options" to-do item, which is what I'd
really like to have.

I have developed a patch that moves the defines into a include file
where they can be used by the backend and test_fsync.c. I have also set
up things so there is proper alignment for O_DIRECT, and added error
checking.

Not sure if people want this for 8.2. I think we can modify
test_fsync.c anytime but the movement of the defines into an include
file is a backend code change.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/pgpatches/fsynctext/x-diffDownload+181-179
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#8)
hackers
Re: [PERFORM] Direct I/O issues

Bruce Momjian <bruce@momjian.us> writes:

Not sure if people want this for 8.2. I think we can modify
test_fsync.c anytime but the movement of the defines into an include
file is a backend code change.

I think fooling with this on the day before RC1 is an unreasonable risk ...
and I disapprove of moving this code into a widely-used include file
like xlog.h, too.

regards, tom lane

#10Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#9)
hackers
Re: [HACKERS] [PERFORM] Direct I/O issues

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Not sure if people want this for 8.2. I think we can modify
test_fsync.c anytime but the movement of the defines into an include
file is a backend code change.

I think fooling with this on the day before RC1 is an unreasonable risk ...
and I disapprove of moving this code into a widely-used include file
like xlog.h, too.

OK, you want a separate include or xlog_internal.h? And should I put in
just the test_fsync changes next week so at least we are closer to
having it work for 8.2?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#10)
hackersdocs
Re: [PATCHES] [PERFORM] Direct I/O issues

On Fri, 24 Nov 2006, Bruce Momjian wrote:

OK, I modified test_fsync.c by copying the defines from xlog.c, and
fixed the O_DIRECT alignment and check write()/fsync().

I just tested your new test_fsync as included in the 8.2rc1, and it's
working perfectly for me now on Linux. All the O_SYNC writes using
O_DIRECT are reporting realistic timings. I'm happy that this code is
working as it should and appreciate the quick response. I still think the
wal_sync_method documentation deserves an update noting that O_DIRECT is
used when available with the sync write methods.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#12Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#11)
hackersdocs
Re: [HACKERS] [PATCHES] [PERFORM] Direct I/O issues

Greg Smith wrote:

On Fri, 24 Nov 2006, Bruce Momjian wrote:

OK, I modified test_fsync.c by copying the defines from xlog.c, and
fixed the O_DIRECT alignment and check write()/fsync().

I just tested your new test_fsync as included in the 8.2rc1, and it's
working perfectly for me now on Linux. All the O_SYNC writes using
O_DIRECT are reporting realistic timings. I'm happy that this code is
working as it should and appreciate the quick response. I still think the
wal_sync_method documentation deserves an update noting that O_DIRECT is
used when available with the sync write methods.

O_DIRECT mention added, and backpatched to 8.2.X.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/rtmp/difftext/x-diffDownload+1-0
#13Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#9)
hackers
Re: [PERFORM] Direct I/O issues

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Not sure if people want this for 8.2. I think we can modify
test_fsync.c anytime but the movement of the defines into an include
file is a backend code change.

I think fooling with this on the day before RC1 is an unreasonable risk ...
and I disapprove of moving this code into a widely-used include file
like xlog.h, too.

fsync method defines moved to /include/access/xlogdefs.h so they can be
used by test_fsync.c.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/rtmp/difftext/x-diffDownload+73-149