Allowing WAL fsync to be done via O_SYNC
Based on the tests we did last week, it seems clear than on many
platforms it's a win to sync the WAL log by writing it with open()
option O_SYNC (or O_DSYNC where available) rather than issuing explicit
fsync() (resp. fdatasync()) calls. In theory fsync ought to be faster,
but it seems that too many kernels have inefficient implementations of
fsync.
I think we need to make both O_SYNC and fsync() choices available in
7.1. Two important questions need to be settled:
1. Is a compile-time flag (in config.h.in) good enough, or do we need
to make it configurable via a GUC variable? (A variable would have to
be postmaster-start-time changeable only, so you'd still need a
postmaster restart to change it.)
2. Which way should be the default?
There's also the lesser question of what to call the config symbol
or variable.
My inclination is to go with a compile-time flag named USE_FSYNC_FOR_WAL
and have the default be off (ie, use O_SYNC by default) but I'm not
strongly set on that. Opinions anyone?
In any case the code should automatically prefer O_DSYNC over O_SYNC if
available, and should prefer fdatasync() over fsync() if available;
I doubt we need to provide a knob to alter those choices.
BTW, are there any platforms where O_DSYNC exists but has a different
spelling?
regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010315 09:35] wrote:
BTW, are there any platforms where O_DSYNC exists but has a different
spelling?
Yes, FreeBSD only has: O_FSYNC
it doesn't have O_SYNC nor O_DSYNC.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes:
* Tom Lane <tgl@sss.pgh.pa.us> [010315 09:35] wrote:
BTW, are there any platforms where O_DSYNC exists but has a different
spelling?
Yes, FreeBSD only has: O_FSYNC
it doesn't have O_SYNC nor O_DSYNC.
Okay ... we can fall back to O_FSYNC if we don't see either of the
others. No problem. Any other weird cases out there? I think Andreas
might've muttered something about AIX but I'm not sure now.
regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes:
As a general rule, if something can be a run time option, as opposed to a
compile time option, then it should be. At the very least you keep the
installation simple and allow for easier experimenting.
I've been mentally working through the code, and see only one reason why
it might be necessary to go with a compile-time choice: suppose we see
that none of O_DSYNC, O_SYNC, O_FSYNC, [others] are defined? With the
compile-time choice it's easy: #define USE_FSYNC_FOR_WAL, and sail on.
If it's a GUC variable then we need a way to prevent the GUC option from
becoming unset (which would disable the fsync() calls, leaving nothing
to replace 'em). Doable, perhaps, but seems kind of ugly ... any
thoughts about that?
regards, tom lane
Import Notes
Reply to msg id not found: Pine.LNX.4.30.0103151910390.826-100000@peter.localdomainReference msg id not found: Pine.LNX.4.30.0103151910390.826-100000@peter.localdomain | Resolved by subject fallback
Tom Lane writes:
I think we need to make both O_SYNC and fsync() choices available in
7.1. Two important questions need to be settled:1. Is a compile-time flag (in config.h.in) good enough, or do we need
to make it configurable via a GUC variable? (A variable would have to
be postmaster-start-time changeable only, so you'd still need a
postmaster restart to change it.)
As a general rule, if something can be a run time option, as opposed to a
compile time option, then it should be. At the very least you keep the
installation simple and allow for easier experimenting.
There's also the lesser question of what to call the config symbol
or variable.
I suggest "wal_use_fsync" as a GUC variable, assuming the default would be
off. Otherwise "wal_use_open_sync". (Use a general-to-specific naming
scheme to allow for easier grouping. Having defaults be "off"
consistently is more intuitive.)
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Based on the tests we did last week, it seems clear than on many
platforms it's a win to sync the WAL log by writing it with open()
option O_SYNC (or O_DSYNC where available) rather than
issuing explicit fsync() (resp. fdatasync()) calls.
I don't remember big difference in using fsync or O_SYNC in tfsync
tests. Both depend on block size and keeping in mind that fsync
allows us syncing after writing *multiple* blocks I would either
use fsync as default or don't deal with O_SYNC at all.
But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should
use O_DSYNC by default.
(BTW, we didn't compare fdatasync and O_SYNC yet).
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
... I would either
use fsync as default or don't deal with O_SYNC at all.
But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should
use O_DSYNC by default.
Hm. We could do that reasonably painlessly as a compile-time test in
xlog.c, but I'm not clear on how it would play out as a GUC option.
Peter, what do you think about configuration-dependent defaults for
GUC variables?
regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010315 11:07] wrote:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
... I would either
use fsync as default or don't deal with O_SYNC at all.
But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should
use O_DSYNC by default.Hm. We could do that reasonably painlessly as a compile-time test in
xlog.c, but I'm not clear on how it would play out as a GUC option.
Peter, what do you think about configuration-dependent defaults for
GUC variables?
Sorry, what's a GUC? :)
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein wrote:
* Tom Lane <tgl@sss.pgh.pa.us> [010315 11:07] wrote:
Peter, what do you think about configuration-dependent defaults for
GUC variables?
Sorry, what's a GUC? :)
Grand Unified Configuration, Peter E.'s baby.
See the thread starting at
http://www.postgresql.org/mhonarc/pgsql-hackers/2000-03/msg00107.html
for details.
(And the search is working.... :-)).
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
* Peter Eisentraut <peter_e@gmx.net> [010315 11:33] wrote:
Alfred Perlstein writes:
Sorry, what's a GUC? :)
Grand Unified Configuration system
It's basically a cute name for the achievement that there's now a single
name space and interface for (almost) all postmaster run time
configuration variables,
Oh, thanks.
Well considering that, a runtime check for doing_sync_wal_writes
== 1 shouldn't be that expensive. Sort of the inverse of -F,
meaning that we're using O_SYNC for WAL writes, we don't need to
fsync it.
Btw, if you guys want to get some speed with WAL, I'd implement a
write-behind process if it was possible to do the O_SYNC writes.
...
And since we're sorta on the topic of IO, I noticed that it looks
like (at least in 7.0.3) that vacuum and certain other routines
read files in reverse order.
The problem (at least in FreeBSD) is that we haven't tuned
the system to detect reverse reading and hence don't do
much readahead. There may be some going on as a function
of the read clustering, but I'm not entirely sure.
I'd suspect that other OSs might have neglected to check
for reverse reading of files as well, but I'm not sure.
Basically, if there was a way to do this another way, or
anticipate the backwards motion and do large reads, it
may add latency, but it should improve performance.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Import Notes
Reply to msg id not found: Pine.LNX.4.30.0103152039540.826-100000@peter.localdomain
Alfred Perlstein writes:
Sorry, what's a GUC? :)
Grand Unified Configuration system
It's basically a cute name for the achievement that there's now a single
name space and interface for (almost) all postmaster run time
configuration variables,
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Alfred Perlstein <bright@wintelcom.net> writes:
And since we're sorta on the topic of IO, I noticed that it looks
like (at least in 7.0.3) that vacuum and certain other routines
read files in reverse order.
Vacuum does that because it's trying to push tuples down from the end
into free space in earlier blocks. I don't see much way around that
(nor any good reason to think that it's a critical part of vacuum's
performance anyway). Where else have you seen such behavior?
regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010315 11:45] wrote:
Alfred Perlstein <bright@wintelcom.net> writes:
And since we're sorta on the topic of IO, I noticed that it looks
like (at least in 7.0.3) that vacuum and certain other routines
read files in reverse order.Vacuum does that because it's trying to push tuples down from the end
into free space in earlier blocks. I don't see much way around that
(nor any good reason to think that it's a critical part of vacuum's
performance anyway). Where else have you seen such behavior?
Just vacuum, but the source is large, and I'm sort of lacking
on database-foo so I guessed that it may be done elsewhere.
You can optimize this out by implementing the read behind yourselves
sorta like this:
struct sglist *
read(fd, len)
{
if (fd.lastpos - fd.curpos <= THRESHOLD) {
fd.curpos = fd.lastpos - THRESHOLD;
len = THRESHOLD;
}
return (do_read(fd, len));
}
of course this is entirely wrong, but illustrates what
would/could help.
I would fix FreeBSD, but it's sort of a mess and beyond what
I've got time to do ATM.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Based on the tests we did last week, it seems clear than on many
platforms it's a win to sync the WAL log by writing it with open()
option O_SYNC (or O_DSYNC where available) rather than issuing explicit
fsync() (resp. fdatasync()) calls. In theory fsync ought to be faster,
but it seems that too many kernels have inefficient implementations of
fsync.
Can someone explain why configure/platform-specific flags are allowed to
be added at this stage in the release, but my pgmonitor patch was
rejected?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Peter Eisentraut <peter_e@gmx.net> writes:
As a general rule, if something can be a run time option, as opposed to a
compile time option, then it should be. At the very least you keep the
installation simple and allow for easier experimenting.I've been mentally working through the code, and see only one reason why
it might be necessary to go with a compile-time choice: suppose we see
that none of O_DSYNC, O_SYNC, O_FSYNC, [others] are defined? With the
compile-time choice it's easy: #define USE_FSYNC_FOR_WAL, and sail on.
If it's a GUC variable then we need a way to prevent the GUC option from
becoming unset (which would disable the fsync() calls, leaving nothing
to replace 'em). Doable, perhaps, but seems kind of ugly ... any
thoughts about that?
I don't think having something a run-time option is always a good idea.
Giving people too many choices is often confusing.
I think we should just check at compile time, and choose O_* if we have
it, and if not, use fsync(). No one will ever do the proper timing
tests to know which is better except us. Also, it seems O_* should be
faster because you are fsync'ing the buffer you just wrote, so there is
no looking around for dirty buffers like fsync().
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Can someone explain why configure/platform-specific flags are allowed to
be added at this stage in the release, but my pgmonitor patch was
rejected?
Possibly just because Marc hasn't stomped on me quite yet ;-)
However, I can actually make a case for this: we are flushing out
performance bugs in a new feature, ie WAL.
regards, tom lane
[ Charset ISO-8859-1 unsupported, converting... ]
Based on the tests we did last week, it seems clear than on many
platforms it's a win to sync the WAL log by writing it with open()
option O_SYNC (or O_DSYNC where available) rather than
issuing explicit fsync() (resp. fdatasync()) calls.I don't remember big difference in using fsync or O_SYNC in tfsync
tests. Both depend on block size and keeping in mind that fsync
allows us syncing after writing *multiple* blocks I would either
use fsync as default or don't deal with O_SYNC at all.
I see what you are saying. That the OS may be faster at fsync'ing two
blocks in one operation rather than doing to O_SYNC operations.
Seems we should just pick a default and leave the rest for a later
release. Marc wants RC1 tomorrow, I think.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Can someone explain why configure/platform-specific flags are allowed to
be added at this stage in the release, but my pgmonitor patch was
rejected?Possibly just because Marc hasn't stomped on me quite yet ;-)
However, I can actually make a case for this: we are flushing out
performance bugs in a new feature, ie WAL.
You did a masterful job of making my pgmonitor patch sound like a debug
aid instead of a feature too. :-)
Have you considered a career in law. :-)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
I've been mentally working through the code, and see only one reason why
it might be necessary to go with a compile-time choice: suppose we see
that none of O_DSYNC, O_SYNC, O_FSYNC, [others] are defined? With the
compile-time choice it's easy: #define USE_FSYNC_FOR_WAL, and sail on.
If it's a GUC variable then we need a way to prevent the GUC option from
becoming unset (which would disable the fsync() calls, leaving nothing
to replace 'em). Doable, perhaps, but seems kind of ugly ... any
thoughts about that?I don't think having something a run-time option is always a good idea.
Giving people too many choices is often confusing.I think we should just check at compile time, and choose O_* if we have
it, and if not, use fsync(). No one will ever do the proper timing
tests to know which is better except us. Also, it seems O_* should be
faster because you are fsync'ing the buffer you just wrote, so there is
no looking around for dirty buffers like fsync().
I later read Vadim's comment that fsync() of two blocks may be faster
than two O_* writes, so I am now confused about the proper solution.
However, I think we need to pick one and make it invisible to the user.
Perhaps a compiler/config.h flag for testing would be a good solution.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I later read Vadim's comment that fsync() of two blocks may be faster
than two O_* writes, so I am now confused about the proper solution.
However, I think we need to pick one and make it invisible to the user.
Perhaps a compiler/config.h flag for testing would be a good solution.
I believe that we don't know enough yet to nail down a hard-wired
decision. Vadim's idea of preferring O_DSYNC if it appears to be
different from O_SYNC is a good first cut, but I think we'd better make
it possible to override that, at least for testing purposes.
So I think it should be configurable at *some* level. I don't much care
whether it's a config.h entry or a GUC variable.
But consider this: we'll be more likely to get some feedback from the
field (allowing us to refine the policy in future releases) if it is a
GUC variable. Not many people will build two versions of the software,
but people might take the trouble to play with a run-time configuration
setting.
regards, tom lane