win32 performance - fsync question

Started by E.Rodichevalmost 21 years ago93 messages
#1E.Rodichev
er@sai.msu.su

Hi,

looking for the way how to increase performance at Windows XP box, I found
the parameters

#fsync = true # turns forced synchronization on or off
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, open_sync, or open_datasync

I have no idea how it works with win32. May I try fsync = false, or it is
dangerous? Which of wal_sync_method may I try at WinXP?

Regards,
E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#2Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: E.Rodichev (#1)
Re: win32 performance - fsync question

looking for the way how to increase performance at Windows XP box, I

found

the parameters

#fsync = true # turns forced synchronization on or

off

#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, open_sync, or
open_datasync

I have no idea how it works with win32. May I try fsync = false, or it

is

dangerous? Which of wal_sync_method may I try at WinXP?

wal_sync_method does nothing on XP. The fsync option will tremendously
increase performance on writes at the cost of possible data corruption
in the event of a expected server power down.

The main performance difference between win32 and various unix systems
is that fsync() takes much longer on win32 than linux.

Merlin

#3Magnus Hagander
mha@sollentuna.net
In reply to: Merlin Moncure (#2)
Re: win32 performance - fsync question

Hi,

looking for the way how to increase performance at Windows XP
box, I found the parameters

#fsync = true # turns forced
synchronization on or off
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync,
open_sync, or open_datasync

I have no idea how it works with win32. May I try fsync =
false, or it is dangerous? Which of wal_sync_method may I try
at WinXP?

You can try it, but it is dangerous.
fsync is the correct wal_sync_method.

For some reason the syncing is quite a lot slower on win32. One reason
might be that it does flush metadata about the file as well, which I
beleive at least Linux doesn't.

If it wasn't clear already, if you're running antivirus, try
uninstalling it. Note that you may need to uninstall it to get all
performance back, just disabling is often *not* enough as the kernel
driver is still loaded.

Things worth experimenting with (these are all untested, so please
report any successes):
1) Try reformatting with a cluster size of 8Kb (the pg page size), if
you can.
2) Disable the last access time (like noatime on linux). "fsutil
behavior set disablelastaccess 1"
3) Disable 8.3 filenames "fsutil behavior set disable8dot3 1"

2 and 3 may require a reboot.

(2 and 3 can be done on earlier windows through registry settings only,
in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem)

//Magnus

#4Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Magnus Hagander (#3)
Re: win32 performance - fsync question

Things worth experimenting with (these are all untested, so please
report any successes):
1) Try reformatting with a cluster size of 8Kb (the pg page size), if
you can.

What about recompiling pg with a 4k block size. Win32 file cluster
sizes and memory allocation units are both on 4k boundries.

Merlin

#5Noname
lsunley@mb.sympatico.ca
In reply to: Merlin Moncure (#4)
Re: win32 performance - fsync question

In <4214B68C.8000901@dunslane.net>, on 02/17/05
at 10:21 AM, Andrew Dunstan <andrew@dunslane.net> said:

E.Rodichev wrote:

This problem is addressed by file system (fsck, journalling etc.).
Is it reasonable to handle it directly within application?

In the words of the Duke of Wellington, "If you believe that you'll
believe anything."

Please review past discussions on the mailing lists on this point.

BTW, most journalling file systems do not guarantee file integrity, only
file metadata integrity. In particular, I believe this is tru of NTFS
(and whether it even does that has been debated).

So by all means turn off fsync if you want the performance gain *and*
you accept the risk. But if you do, don't come crying later that your
data has been lost or corrupted.

(the results are interesting, though - with fsync off Windows and Linux
are in the same performance ballpark.)

cheers

andrew

In anything I've done, Windows is very slow when you use fsync or the
Windows API equivalent.

If you need the performance, you had better have the machine hooked up to
a UPS (probably a good idea in any case) and set up something that is
triggered by the UPS running down to signal postgreSQL to do an immediate
shutdown.

--
-----------------------------------------------------------
lsunley@mb.sympatico.ca
-----------------------------------------------------------

#6E.Rodichev
er@sai.msu.su
In reply to: Magnus Hagander (#3)
Re: win32 performance - fsync question

On Thu, 17 Feb 2005, Magnus Hagander wrote:

Hi,

looking for the way how to increase performance at Windows XP
box, I found the parameters

#fsync = true # turns forced
synchronization on or off
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync,
open_sync, or open_datasync

I have no idea how it works with win32. May I try fsync =
false, or it is dangerous? Which of wal_sync_method may I try
at WinXP?

You can try it, but it is dangerous.
fsync is the correct wal_sync_method.

For some reason the syncing is quite a lot slower on win32. One reason
might be that it does flush metadata about the file as well, which I
beleive at least Linux doesn't.

If it wasn't clear already, if you're running antivirus, try
uninstalling it. Note that you may need to uninstall it to get all
performance back, just disabling is often *not* enough as the kernel
driver is still loaded.

No, I have not any resident disk-related staff.

Things worth experimenting with (these are all untested, so please
report any successes):
1) Try reformatting with a cluster size of 8Kb (the pg page size), if
you can.
2) Disable the last access time (like noatime on linux). "fsutil
behavior set disablelastaccess 1"
3) Disable 8.3 filenames "fsutil behavior set disable8dot3 1"

2 and 3 may require a reboot.

(2 and 3 can be done on earlier windows through registry settings only,
in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem)

I've repeated the test under 2 and 3 - no noticeable difference. With
disablelastaccess I got about 10% - 15% better results, but it is not
too significant.

Finally I tried

fsync = false

and got 580-620 tps. So, the short summary:

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux 800 tps

The general question is - does PostgreSQL really need fsync? I suppose it
is a question for design, not platform-specific one. It sounds like only
one scenario, when fsync is useful, is to interprocess communication via
open file. But PostgreSQL utilize IPC for this, so does fsync is really
required?

E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#7Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: E.Rodichev (#6)
Re: win32 performance - fsync question

The general question is - does PostgreSQL really need fsync? I suppose it
is a question for design, not platform-specific one. It sounds like only
one scenario, when fsync is useful, is to interprocess communication via
open file. But PostgreSQL utilize IPC for this, so does fsync is really
required?

NO!

Fsync is so that when your computer loses power without warning, you
will have no data loss.

If you turn it off, you run the risk of losing data if you lose power.

Chris

#8E.Rodichev
er@sai.msu.su
In reply to: Christopher Kings-Lynne (#7)
Re: win32 performance - fsync question

On Thu, 17 Feb 2005, Christopher Kings-Lynne wrote:

The general question is - does PostgreSQL really need fsync? I suppose it
is a question for design, not platform-specific one. It sounds like only
one scenario, when fsync is useful, is to interprocess communication via
open file. But PostgreSQL utilize IPC for this, so does fsync is really
required?

NO!

Fsync is so that when your computer loses power without warning, you will
have no data loss.

If you turn it off, you run the risk of losing data if you lose power.

Chris

This problem is addressed by file system (fsck, journalling etc.).
Is it reasonable to handle it directly within application?

Regards,
E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#9D'Arcy J.M. Cain
darcy@druid.net
In reply to: E.Rodichev (#8)
Re: win32 performance - fsync question

On Thu, 17 Feb 2005 17:54:38 +0300 (MSK)
"E.Rodichev" <er@sai.msu.su> wrote:

On Thu, 17 Feb 2005, Christopher Kings-Lynne wrote:

The general question is - does PostgreSQL really need fsync? I

suppose it> is a question for design, not platform-specific one. It
sounds like only> one scenario, when fsync is useful, is to
interprocess communication via> open file. But PostgreSQL utilize IPC
for this, so does fsync is really> required?

NO!

Fsync is so that when your computer loses power without warning, you
will have no data loss.

If you turn it off, you run the risk of losing data if you lose
power.

Chris

This problem is addressed by file system (fsck, journalling etc.).
Is it reasonable to handle it directly within application?

NO again!

Fsck only fixes up file system pointers after a crash. If the data did
not make it to the disk, no amount of fscking will put it there.

I'm not positive but I think that journalled file systems also need
fsync to guarantee that the information gets journalled but in any case,
journalling only helps if you have a journalled file system. Not
everyone does.

This is not to say that fsync is always required, just that it solves a
different problem than all those other tools.

-- 
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#10Doug McNaught
doug@mcnaught.org
In reply to: E.Rodichev (#8)
Re: win32 performance - fsync question

"E.Rodichev" <er@sai.msu.su> writes:

On Thu, 17 Feb 2005, Christopher Kings-Lynne wrote:

Fsync is so that when your computer loses power without warning, you
will have no data loss.

If you turn it off, you run the risk of losing data if you lose power.

Chris

This problem is addressed by file system (fsck, journalling etc.).
Is it reasonable to handle it directly within application?

No, it's not addressed by the file system. fsync() tells the OS to
make sure the data is on disk. Without that, the OS is free to just
keep the WAL data in memory cache, and a power failure could cause
data from committed transactions to be lost (we don't report commit
success until fsync() tells us the file data is on disk).

-Doug

#11Andrew Dunstan
andrew@dunslane.net
In reply to: E.Rodichev (#8)
Re: win32 performance - fsync question

E.Rodichev wrote:

This problem is addressed by file system (fsck, journalling etc.).
Is it reasonable to handle it directly within application?

In the words of the Duke of Wellington, "If you believe that you'll
believe anything."

Please review past discussions on the mailing lists on this point.

BTW, most journalling file systems do not guarantee file integrity, only
file metadata integrity. In particular, I believe this is tru of NTFS
(and whether it even does that has been debated).

So by all means turn off fsync if you want the performance gain *and*
you accept the risk. But if you do, don't come crying later that your
data has been lost or corrupted.

(the results are interesting, though - with fsync off Windows and Linux
are in the same performance ballpark.)

cheers

andrew

#12Magnus Hagander
mha@sollentuna.net
In reply to: Andrew Dunstan (#11)
Re: win32 performance - fsync question

So by all means turn off fsync if you want the performance gain *and*
you accept the risk. But if you do, don't come crying later that your
data has been lost or corrupted.

(the results are interesting, though - with fsync off Windows

and Linux

are in the same performance ballpark.)

Yes, this is definitly interesting. It confirms Merlins signs of I/O
being what kills the win32 version. IPC etc is a bit slower, but not
significantly.

In anything I've done, Windows is very slow when you use fsync or the
Windows API equivalent.

This is what we have discovered. AFAIK, all other major databases or
other similar apps (like exchange or AD) all open files with
FILE_FLAG_WRITE_THROUGH and do *not* use fsync. It might give noticably
better performance with an O_DIRECT style WAL logging at least. But I'm
unsure if the current code for O_DIRECT works on win32 - I think it
needs some fixing for that. Which might be worth looking at for 8.1.

Not much to do about the bgwriter, the way it is designed it *has* to
fsync during checkpoint. The Other Databases implement their own cache
and write data files directly also, but pg is designed to have the OS
cache helping out. Bypassing it would not be good for performance.

If you need the performance, you had better have the machine
hooked up to
a UPS (probably a good idea in any case) and set up something that is
triggered by the UPS running down to signal postgreSQL to do
an immediate
shutdown.

UPS will not help you. UPS does not help you if the OS crashes (hey,
yuo're on windows, this *does* happen). UPS does not help you if
somebody accidentally pulls the plug between the UPS and the server. UPS
does not help you if your server overheats and shuts down.
Bottom line, there are lots of cases when an UPS does not help. Having
an UPS (preferrably redundant UPSes feeding redundant power supplies -
this is not at all expensive today) is certainly a good thing, but it is
*not* a replacement for fsync. On *any* platform.

//Magnus

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#12)
Re: win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

This is what we have discovered. AFAIK, all other major databases or
other similar apps (like exchange or AD) all open files with
FILE_FLAG_WRITE_THROUGH and do *not* use fsync. It might give noticably
better performance with an O_DIRECT style WAL logging at least. But I'm
unsure if the current code for O_DIRECT works on win32 - I think it
needs some fixing for that. Which might be worth looking at for 8.1.

Doesn't Windows support O_SYNC (or even better O_DSYNC) flag to open()?
That should be the Posixy spelling of FILE_FLAG_WRITE_THROUGH, if the
latter means what I suppose it does.

Not much to do about the bgwriter, the way it is designed it *has* to
fsync during checkpoint.

Theoretically at least, the fsync during checkpoints should not be a
performance killer. The issue that's at hand here is fsyncing the WAL,
and the reason we need that is (a) to be sure a transaction is committed
when we say it is, and (b) to be sure that WAL writes hit disk before
associated data file updates do (it's write AHEAD log remember). Direct
writes of WAL should be fine.

So: try O_SYNC instead of fsync for WAL, ie, wal_sync_method =
open_sync or open_datasync.

regards, tom lane

#14Magnus Hagander
mha@sollentuna.net
In reply to: Tom Lane (#13)
Re: win32 performance - fsync question

This is what we have discovered. AFAIK, all other major databases or
other similar apps (like exchange or AD) all open files with
FILE_FLAG_WRITE_THROUGH and do *not* use fsync. It might

give noticably

better performance with an O_DIRECT style WAL logging at

least. But I'm

unsure if the current code for O_DIRECT works on win32 - I think it
needs some fixing for that. Which might be worth looking at for 8.1.

Doesn't Windows support O_SYNC (or even better O_DSYNC) flag to open()?
That should be the Posixy spelling of FILE_FLAG_WRITE_THROUGH, if the
latter means what I suppose it does.

They should, but someone said it didn't work. I haven't followed up on
it, though, so it is quite possible it works. If so, it is definitly
worth trying.

Not much to do about the bgwriter, the way it is designed it *has* to
fsync during checkpoint.

Theoretically at least, the fsync during checkpoints should not be a
performance killer.

If you run a tight benchmark past a checkpoint, it will make an effect
if the fsync takes twice as long as it does on unix. If the checkpoint
happens when other I/O is fairly low then it shuold not have an effect.

Merlin, was that by any chance you? We've been talking about these
things quite a lot :-)

So: try O_SYNC instead of fsync for WAL, ie, wal_sync_method =
open_sync or open_datasync.

Definitly worth cehcking out.

//Magnus

#15Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#14)
Re: win32 performance - fsync question

Things worth experimenting with (these are all untested, so please
report any successes):
1) Try reformatting with a cluster size of 8Kb (the pg page size), if
you can.
2) Disable the last access time (like noatime on linux). "fsutil
behavior set disablelastaccess 1"
3) Disable 8.3 filenames "fsutil behavior set disable8dot3 1"

2 and 3 may require a reboot.

(2 and 3 can be done on earlier windows through registry

settings only,

in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem)

I've repeated the test under 2 and 3 - no noticeable difference. With
disablelastaccess I got about 10% - 15% better results, but it is not
too significant.

Actually, that's enough to care about in a real world deployment.

Finally I tried

fsync = false

and got 580-620 tps. So, the short summary:

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux 800 tps

This Linux figure is really compared to the WinXP fsync=false, since you
have write cacheing on. The interesting one to compare with is the other
one you did:

Linux w/o write cache 80-90 tps

Which is still faster than windows, but not as much faster.

The general question is - does PostgreSQL really need fsync? I
suppose it
is a question for design, not platform-specific one. It sounds
like only
one scenario, when fsync is useful, is to interprocess
communication via
open file. But PostgreSQL utilize IPC for this, so does fsync is really
required?

No, fsync is used to make sure your data is committed to disk once you
commit a transaction. IPC is handled through shared memory and named
pipes.

//Magnus

#16Evgeny Rodichev
er@sai.msu.su
In reply to: Andrew Dunstan (#11)
Re: win32 performance - fsync question

On Thu, 17 Feb 2005, Andrew Dunstan wrote:

(the results are interesting, though - with fsync off Windows and Linux are
in the same performance ballpark.)

Some addition:

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux fsync = true 800 tps
Linux fsync = false 980 tps

Regards,
E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#17Magnus Hagander
mha@sollentuna.net
In reply to: Evgeny Rodichev (#16)
Re: win32 performance - fsync question

Doesn't Windows support O_SYNC (or even better O_DSYNC) flag

to open()?

That should be the Posixy spelling of FILE_FLAG_WRITE_THROUGH, if the
latter means what I suppose it does.

They should, but someone said it didn't work. I haven't
followed up on it, though, so it is quite possible it works.
If so, it is definitly worth trying.

Update on that. There is no O_SYNC nor O_DSYNC. They just aren't there.

However, we already have win32_open (in port/open.c) which is used to
open these files. We could probably add code there to check for O_SYNC
and map it to the correct win32 flags for CreateFile (because the
support certainly is there).

To make this happen, is it enough to define O_DSYNC in the win32 port
include file, and then implement it in the open call? Or do I need to
hack xlog.c? The comment claims it's hackery ;-), so I figured I should
verify that before actually testing things.

Oh, and finally. The win32 commands have the following options:
FILE_FLAG_NO_BUFFERING. This disables the cache completely. It also has
lots of limits, like every read and write has to be on a sector boundary
etc. It gives great performance with async I/O, because it bypasses the
memory manager. It appears to be like O_DIRECT on linux?

FILE_FLAG_WRITE_THROUGH:
"
Instructs the system to write through any intermediate cache and go
directly to disk.

If FILE_FLAG_NO_BUFFERING is not also specified, so that system caching
is in effect, then the data is written to the system cache, but is
flushed to disk without delay.

If FILE_FLAG_NO_BUFFERING is also specified, so that system caching is
not in effect, then the data is immediately flushed to disk without
going through the system cache. The operating system also requests a
write-through the hard disk cache to persistent media. However, not all
hardware supports this write-through capability.
"

It seems to me FILE_FLAG_NO_BUFFERING is the same as O_DSYNC. (A
different place in the docs says "Also, the file metadata may still be
cached. To flush the metadata to disk, use the FlushFileBuffers
function.", so it seems it's more DSYNC than SYNC)

//Magnus

#18Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Magnus Hagander (#17)
Re: win32 performance - fsync question

Doesn't Windows support O_SYNC (or even better O_DSYNC) flag to

open()?

That should be the Posixy spelling of FILE_FLAG_WRITE_THROUGH, if the
latter means what I suppose it does.

They should, but someone said it didn't work. I haven't followed up on
it, though, so it is quite possible it works. If so, it is definitly
worth trying.

Yes, and the other issue is that FlushFileBuffers() does not play nice
with raid controllers, it actually overrides their write caching so that
you can not get around the fsync performance issue using raid + bbu on
most configurations.

Not much to do about the bgwriter, the way it is designed it *has*

to

fsync during checkpoint.

Theoretically at least, the fsync during checkpoints should not be a
performance killer.

I agree: it's the WAL sync that is the problem. I don't mind a slower
sync during checkpoint because that is controllable. However, there is
also the raid issue.

If you run a tight benchmark past a checkpoint, it will make an effect
if the fsync takes twice as long as it does on unix. If the checkpoint
happens when other I/O is fairly low then it shuold not have an

effect.

Merlin, was that by any chance you? We've been talking about these
things quite a lot :-)

So: try O_SYNC instead of fsync for WAL, ie, wal_sync_method =
open_sync or open_datasync.

Definitly worth cehcking out.

Yeah.

Merlin

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#17)
Re: win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

Oh, and finally. The win32 commands have the following options:
FILE_FLAG_NO_BUFFERING. This disables the cache completely. It also has
lots of limits, like every read and write has to be on a sector boundary
etc. It gives great performance with async I/O, because it bypasses the
memory manager. It appears to be like O_DIRECT on linux?

FILE_FLAG_WRITE_THROUGH:
"
Instructs the system to write through any intermediate cache and go
directly to disk.

If FILE_FLAG_NO_BUFFERING is not also specified, so that system caching
is in effect, then the data is written to the system cache, but is
flushed to disk without delay.

If FILE_FLAG_NO_BUFFERING is also specified, so that system caching is
not in effect, then the data is immediately flushed to disk without
going through the system cache. The operating system also requests a
write-through the hard disk cache to persistent media. However, not all
hardware supports this write-through capability.
"

AFAICS it would make sense for us to specify both of those flags for WAL
writes.

We could either hack win32_open() to translate O_SYNC to those flags,
or make xlog.c aware of the Windows spellings of the flags. Probably
the former is less painful given that open.c already does wholesale
translations of open() flags.

One point that I no longer recall the reasoning behind is that xlog.c
doesn't think O_SYNC is a preferable default over fsync. We'd certainly
want to hack xlog.c to change its mind about that, at least on Windows;
assuming that the FILE_FLAG way is indeed faster.

regards, tom lane

#20Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Evgeny Rodichev (#16)
Re: win32 performance - fsync question

Some addition:

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux fsync = true 800 tps
Linux fsync = false 980 tps

Wow, that's terrible on Windows. If there's a solution, it'd be nice to
backport it...

Chris

#21Evgeny Rodichev
er@sai.msu.su
In reply to: Magnus Hagander (#15)
Re: win32 performance - fsync question

There are two different concerns here.

1. transactions loss because of unexpected power loss and/or system failure
2. inconsistent database state

For many application (1) is fairly acceptable, and (2) is not.

So I'd like to formulate my questions by another way.

- if PostgeSQL is running without fsync, and power loss occur, which kind
of damage is possible? 1, 2, or both?

- it looks like with proper fwrite/fflush policy it is possible to
guarantee that only transactions loss may occur, but database
keeps some consistent state as before (several) last transactions.
Is it true for PostgeSQL?

Regards,
E.R.
________________________________________________________________________e
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#22Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Evgeny Rodichev (#21)
Re: win32 performance - fsync question

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux fsync = true 800 tps
Linux fsync = false 980 tps

Wow, that's terrible on Windows. If there's a solution, it'd be nice

to

backport it...

there is. I just rigged up a test benchmark comparing sync methods. I
ran on 2 boxes, my xp workstation on 10k raptor and a win2k server on
3ware raid 5 (also on 10k raptors).

Workstation:
did 1000 FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING writes in
5.729633 seconds
did 1000 FILE_FLAG_WRITE_THROUGH writes in 0.593322 seconds
did 1000 flushfilebuffers writes in 15.898989 seconds

server:
did 1000 FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING writes in
16.501076 seconds
did 1000 FILE_FLAG_WRITE_THROUGH writes in 16.104133 seconds
did 1000 flushfilebuffers writes in 18.962439 seconds

server after running super altra secret dskcache '+p' mode:
did 1000 FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING writes in
0.256574 seconds
did 1000 FILE_FLAG_WRITE_THROUGH writes in 2.627602 seconds
did 1000 flushfilebuffers writes in 15.290967 seconds

dskcache.exe is required to enable power protect mode (unbypassing raid
conttoller write cache settings) on win2k.

enjoy.
Merlin

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#20)
Re: win32 performance - fsync question

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux fsync = true 800 tps
Linux fsync = false 980 tps

Wow, that's terrible on Windows. If there's a solution, it'd be nice to
backport it...

Actually, the number that's way out of line there is the Linux w/fsync
one. I infer that he's got disk write cache enabled and therefore the
transactions aren't really being synced to disk at all.

Any claimed TPS rate exceeding your disk drive's rotation rate is a
red flag.

regards, tom lane

#24Richard Huxton
dev@archonet.com
In reply to: Evgeny Rodichev (#21)
Re: win32 performance - fsync question

Evgeny Rodichev wrote:

There are two different concerns here.

1. transactions loss because of unexpected power loss and/or system failure
2. inconsistent database state

For many application (1) is fairly acceptable, and (2) is not.

So I'd like to formulate my questions by another way.

- if PostgeSQL is running without fsync, and power loss occur, which kind
of damage is possible? 1, 2, or both?

Both. If 1 can happen then 2 can happen.

- it looks like with proper fwrite/fflush policy it is possible to
guarantee that only transactions loss may occur, but database
keeps some consistent state as before (several) last transactions.
Is it true for PostgeSQL?

No - if fsync is on and the transaction is reported as committed then it
should still be there when the power returns. Provided you don't suffer
hardware failure you should be able to rely on a committed transaction
actually being written to disk. That's what fsync does for you.

--
Richard Huxton
Archonet Ltd

#25Magnus Hagander
mha@sollentuna.net
In reply to: Richard Huxton (#24)
Re: win32 performance - fsync question

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux fsync = true 800 tps
Linux fsync = false 980 tps

Wow, that's terrible on Windows. If there's a solution, it'd be nice

to

backport it...

there is. I just rigged up a test benchmark comparing sync methods. I
ran on 2 boxes, my xp workstation on 10k raptor and a win2k server on
3ware raid 5 (also on 10k raptors).

Workstation:
did 1000 FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING writes in
5.729633 seconds
did 1000 FILE_FLAG_WRITE_THROUGH writes in 0.593322 seconds
did 1000 flushfilebuffers writes in 15.898989 seconds

server:
did 1000 FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING writes in
16.501076 seconds
did 1000 FILE_FLAG_WRITE_THROUGH writes in 16.104133 seconds
did 1000 flushfilebuffers writes in 18.962439 seconds

server after running super altra secret dskcache '+p' mode:
did 1000 FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING writes in
0.256574 seconds
did 1000 FILE_FLAG_WRITE_THROUGH writes in 2.627602 seconds
did 1000 flushfilebuffers writes in 15.290967 seconds

dskcache.exe is required to enable power protect mode (unbypassing raid
conttoller write cache settings) on win2k.

I draw the following conclusions:
1) Using just FILE_FLAG_WRITE_THROUGH is not enough. It sends it out of
the cache, but it returns to the application before the data has hit
disk. AFAIK, that's not good enough for us.

2) Using both, we can get a *significant* speed boost.

Tom, if you look at all the requirements of FILE_FLAG_NO_BUFFERING on
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/
base/createfile.asp, can you say offhand if the WAL code fulfills them?
If it does, we can probably just hack it in win32_open (at least for
testing and a possible backpatch). Ifn ot, then we'll need to stuff code
in xlog.c.
(Specifically, I'm most worried about the memory alignment requirement)

//Magnus

#26Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Magnus Hagander (#25)
Re: win32 performance - fsync question

One point that I no longer recall the reasoning behind is that xlog.c
doesn't think O_SYNC is a preferable default over fsync. We'd

certainly

want to hack xlog.c to change its mind about that, at least on

Windows;

assuming that the FILE_FLAG way is indeed faster.

I also confirmed that the totally un-cached mode in windows
(FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING) will only work if the
amount of data written is some multiple of 512 bytes. Can WAL work
under this restriction?

Merlin

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#25)
Re: win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

Tom, if you look at all the requirements of FILE_FLAG_NO_BUFFERING on
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/
base/createfile.asp, can you say offhand if the WAL code fulfills them?

If I'm reading it right, you are referring to:

File access must begin at byte offsets within the file that are
integer multiples of the volume's sector size.

File access must be for numbers of bytes that are integer multiples
of the volume's sector size. For example, if the sector size is 512
bytes, an application can request reads and writes of 512, 1024, or
2048 bytes, but not of 335, 981, or 7171 bytes.

Buffer addresses for read and write operations should be sector
aligned (aligned on addresses in memory that are integer multiples
of the volume's sector size). Depending on the disk, this
requirement may not be enforced.

1 and 2 should be no problem since we only read or write integral pages
(8K). 3 is a bit bogus IMHO, or even a lot bogus. You can set
ALIGNOF_BUFFER in src/include/pg_config_manual.h to whatever you think
the alignment requirement really needs to be (I'd try 512).

regards, tom lane

#28Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Tom Lane (#27)
Re: win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

Tom, if you look at all the requirements of FILE_FLAG_NO_BUFFERING

on

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/

base/createfile.asp, can you say offhand if the WAL code fulfills

them?

If I'm reading it right, you are referring to:

File access must begin at byte offsets within the file that are
integer multiples of the volume's sector size.

File access must be for numbers of bytes that are integer

multiples

of the volume's sector size. For example, if the sector size is

512

bytes, an application can request reads and writes of 512, 1024,

or

2048 bytes, but not of 335, 981, or 7171 bytes.

Buffer addresses for read and write operations should be sector
aligned (aligned on addresses in memory that are integer multiples
of the volume's sector size). Depending on the disk, this
requirement may not be enforced.

1 and 2 should be no problem since we only read or write integral

pages

(8K). 3 is a bit bogus IMHO, or even a lot bogus. You can set
ALIGNOF_BUFFER in src/include/pg_config_manual.h to whatever you think
the alignment requirement really needs to be (I'd try 512).

After multiple runs on different blocksizes( a few anomalous results
aside), I didn't see a whole lot of difference between
FILE_FLAG_NO_BUFFERING being on or off for writing performance.
However, with NO_BUFFERING set, the file is not *read* cached at all.
While the performance is on not terrible for reads, some careful
consideration would have to be given for using it outside of WAL. For
WAL, though, it seems perfect. If my results are to be believed, we can
expect up to a 30 yes, that's three + zero times faster sync performance
by ditching FlushFileBuffers (although probably far less in practice).

Applying FILE_FLAG_WRITE_THROUGH to non WAL data files will give similar
speedups to checkpoints, but right now I'm making no assumptions about
the safety issue. I'd like to point out here that using the
FlushFileBuffers() sync approach it was impossible to get my 3ware raid
controller to cache the writes at all. This means that unless we change
the sync method for data files, win32 will always have horrible
checkpoint performance (and I do mean horrible).

My suggestion would be to FILE_FLAG_NO_BUFFERING |
FILE_FLAG_WRITE_THROUGH for WAL, and FILE_FLAG_WRITE_THROUGH for
everything else. Then it's time to power-fail test etc. and make sure
things work the way they are supposed to.

By the way, by some quirk of fate, 8k seems to be a fairly good choice
of block size. 4k block sizes give slightly lower latency but not
nearly as much throughput.

Merlin

#29Evgeny Rodichev
er@sai.msu.su
In reply to: Tom Lane (#23)
Re: win32 performance - fsync question

On Thu, 17 Feb 2005, Tom Lane wrote:

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

WinXP fsync = true 20-28 tps
WinXP fsync = false 600 tps
Linux fsync = true 800 tps
Linux fsync = false 980 tps

Wow, that's terrible on Windows. If there's a solution, it'd be nice to
backport it...

Actually, the number that's way out of line there is the Linux w/fsync
one. I infer that he's got disk write cache enabled and therefore the
transactions aren't really being synced to disk at all.

Any claimed TPS rate exceeding your disk drive's rotation rate is a
red flag.

Write cache is enabled under Linux by default all the time I make deal
with it (since 1993).

It doesn't interfere with fsync(), as linux kernel uses cache flush for
fsync.

I have 2.6.10 kernel running *without* any additional patches, and without
any specific hdparm settings.

fsync() really works fine as I switch off my notebook everyday 2-3 times,
and never had any data loss :)

Related staff from dmesg is

hda: cache flushes supported

Regards,
E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Evgeny Rodichev (#29)
Re: win32 performance - fsync question

Evgeny Rodichev <er@sai.msu.su> writes:

Any claimed TPS rate exceeding your disk drive's rotation rate is a
red flag.

Write cache is enabled under Linux by default all the time I make deal
with it (since 1993).

You're playing with fire.

fsync() really works fine as I switch off my notebook everyday 2-3 times,
and never had any data loss :)

Given that it's a notebook, it's possible that the hardware is smart
enough not to power down the disk until the disk is done writing
everything it's cached. Do you care to try some experiments with
pulling out the battery while Postgres is busy making updates?

regards, tom lane

#31Magnus Hagander
mha@sollentuna.net
In reply to: Tom Lane (#30)
Re: win32 performance - fsync question

After multiple runs on different blocksizes( a few anomalous results
aside), I didn't see a whole lot of difference between
FILE_FLAG_NO_BUFFERING being on or off for writing performance.
However, with NO_BUFFERING set, the file is not *read* cached at all.
While the performance is on not terrible for reads, some careful
consideration would have to be given for using it outside of WAL. For
WAL, though, it seems perfect. If my results are to be
believed, we can
expect up to a 30 yes, that's three + zero times faster sync
performance
by ditching FlushFileBuffers (although probably far less in practice).

Yes, for WAL it won't blow away read-cache stuff, since we normally
don't expect to read the data that's in WAL.

Is there actually a reason why we don't use O_DIRECT on Unix? From what
I can tell, O_SYNC does the write through but also puts it in the cache,
whereas O_DIRECT doesn't "waste cache" on it?

I was thinking of using O_DIRECT as the "compatibility flag" for the
combination of FILE_FLAG_WRITE_THROUGH and NO_BUFFERING, and using
O_SYNC for just the WRITE_THROUGH. Reasonable?

//Magnus

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#31)
Re: win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

Is there actually a reason why we don't use O_DIRECT on Unix?

Portability, or rather the complete lack of it. Stuff that isn't in the
Single Unix Spec is a hard sell.

regards, tom lane

#33Oliver Jowett
oliver@opencloud.com
In reply to: Evgeny Rodichev (#29)
Re: win32 performance - fsync question

Evgeny Rodichev wrote:

Write cache is enabled under Linux by default all the time I make deal
with it (since 1993).

It doesn't interfere with fsync(), as linux kernel uses cache flush for
fsync.

The problem is that most IDE drives lie (or perhaps you could say the
specification is ambiguous) about completion of the cache-flush command
-- they say "Yeah, I've flushed" when they have not actually written the
data to the media and have no provision for making sure it will get
there in the event of power failure.

So Linux is indeed doing a cache flush on fsync, but the hardware is not
behaving as expected. By turning off the write-cache on the disk via
hdparm, you manage to get the hardware to behave better. The kernel is
caching anyway, so the loss of the drive's write cache doesn't make a
big difference.

There was some work done for better IDE write-barrier support (related
to TCQ/SATA support?) in the kernel, but I'm not sure how far that has
progressed.

-O

#34Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Oliver Jowett (#33)
Re: win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

Is there actually a reason why we don't use O_DIRECT on Unix?

Portability, or rather the complete lack of it. Stuff that isn't in

the

Single Unix Spec is a hard sell.

Well, how about this (ok, maybe I'm way out in left field):
Change fsync option from on/off to on/off/O_SYNC. On win32 we treat
O_SYNC as opened with FILE_FLAG_WRITE_THROUGH. When we are in O_SYNC
mode, all files, WAL or otherwise, are assumed to be synced when written
and are therefore not synced during pg_fsync(). WAL syncing may of
course be overridden using alternate sync methods in postgresql.conf.

I suspect that this will drastically alter windows performance,
especially on raid systems. What is TBD is the safety aspect. What I
like about this that now are not dealing with a win32-only hack, any
unix system now has another performance setting top play with. We also
don't touch the O_DIRECT flag (on win32: FILE_FLAG_WRITE_THROUGH |
FILE_FLAG_NO_BUFFERING) leaving that can of worms for another day.

Under normal situations, we would expect O_SYNCing everything all the
time to slow stuff down, especially during checkpoints, but it might
actually help on a caching raid controller. On win32, it will help
because the performance of fsync() sucks so horribly, even or raid.

Merlin

#35Greg Stark
gsstark@mit.edu
In reply to: Oliver Jowett (#33)
Re: win32 performance - fsync question

Oliver Jowett <oliver@opencloud.com> writes:

So Linux is indeed doing a cache flush on fsync

Actually I think the root of the problem was precisely that Linux does not
issue any sort of cache flush commands to drives on fsync. There was some talk
on linux-kernel of what how they could take advantage of new ATA features
planned on new SATA drives coming out now to solve this. But they didn't seem
to think it was urgent or worth the performance hit of doing a complete cache
flush.

--
greg

#36Oliver Jowett
oliver@opencloud.com
In reply to: Greg Stark (#35)
Re: win32 performance - fsync question

Greg Stark wrote:

Oliver Jowett <oliver@opencloud.com> writes:

So Linux is indeed doing a cache flush on fsync

Actually I think the root of the problem was precisely that Linux does not
issue any sort of cache flush commands to drives on fsync. There was some talk
on linux-kernel of what how they could take advantage of new ATA features
planned on new SATA drives coming out now to solve this. But they didn't seem
to think it was urgent or worth the performance hit of doing a complete cache
flush.

Oh, ok. I haven't really kept up to date with it; I just run with
write-cache disabled on my IDE drives as a matter of course.

I did see this:
http://www.ussg.iu.edu/hypermail/linux/kernel/0304.1/0471.html

which implies you're never going to get an implementation that is safe
across all IDE hardware :(

-O

#37Evgeny Rodichev
er@sai.msu.su
In reply to: Tom Lane (#30)
Re: win32 performance - fsync question

On Thu, 17 Feb 2005, Tom Lane wrote:

Evgeny Rodichev <er@sai.msu.su> writes:

Any claimed TPS rate exceeding your disk drive's rotation rate is a
red flag.

Write cache is enabled under Linux by default all the time I make deal
with it (since 1993).

You're playing with fire.

Yes. I'm lucky in this play :)

More seriously, we (with Oleg Bartunov) investigated many platforms/OS
for commercial, scientific and other applications during past 10-12
years. I suppose, virtually all excluding modern mainframes.

For reliability Linux + PostreSQL was found the best one (including the
environment with very frequent unexpected power-off, as at some astronomical
observatories at high mountains).

Hence, I'm lucky :)

fsync() really works fine as I switch off my notebook everyday 2-3 times,
and never had any data loss :)

Given that it's a notebook, it's possible that the hardware is smart
enough not to power down the disk until the disk is done writing
everything it's cached. Do you care to try some experiments with
pulling out the battery while Postgres is busy making updates?

Yes, you are exactly right. All modern HDDs (not entry level ones) has
a huge cache (at device, not at controller), and provide the safe hardware
flush of cache *after* power off (thanks capacitors). My HDD has 16MB cache,
and it is the reason for excellent performance.

Regards,
E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#38Evgeny Rodichev
er@sai.msu.su
In reply to: Oliver Jowett (#33)
Re: win32 performance - fsync question

On Fri, 18 Feb 2005, Oliver Jowett wrote:

Evgeny Rodichev wrote:

Write cache is enabled under Linux by default all the time I make deal
with it (since 1993).

It doesn't interfere with fsync(), as linux kernel uses cache flush for
fsync.

The problem is that most IDE drives lie (or perhaps you could say the
specification is ambiguous) about completion of the cache-flush command --
they say "Yeah, I've flushed" when they have not actually written the data to
the media and have no provision for making sure it will get there in the
event of power failure.

Yes, I agree. But in my real SA practice I've met 50-100 times the situation
when HDD were unexpectedly physically corrupted (the heads touch a surface),
without possibility to restore. And I never met any corruption because of
possible "hardware lie".

So Linux is indeed doing a cache flush on fsync, but the hardware is not
behaving as expected. By turning off the write-cache on the disk via hdparm,
you manage to get the hardware to behave better. The kernel is caching
anyway, so the loss of the drive's write cache doesn't make a big difference.

Again, in practice, it is different. FreeBSD had a "true" flush (at least
2-3 yeas ago, not sure about the modern versions), and for write-intensive
applications it was a bit slower (comparing with linux), but it never was
more reliable (since 1996, at least).

Another practical example is Google :) Isn't reliable?

There was some work done for better IDE write-barrier support (related to
TCQ/SATA support?) in the kernel, but I'm not sure how far that has
progressed.

Yes, but IMHO it is not stable enough at the moment.

Regards,
E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#39Evgeny Rodichev
er@sai.msu.su
In reply to: Greg Stark (#35)
Re: win32 performance - fsync question

On Fri, 17 Feb 2005, Greg Stark wrote:

Oliver Jowett <oliver@opencloud.com> writes:

So Linux is indeed doing a cache flush on fsync

Actually I think the root of the problem was precisely that Linux does not
issue any sort of cache flush commands to drives on fsync.

No, it does. Let's try the simplest test:

for (i = 0; i < LEN; i++) {
write (fd, buf, 512);
if (sync) fsync (fd);
}

with sync = 0 and 1, and you'll see the difference.

There was some talk
on linux-kernel of what how they could take advantage of new ATA features
planned on new SATA drives coming out now to solve this. But they didn't seem
to think it was urgent or worth the performance hit of doing a complete cache
flush.

It was a bit different topic.

Regards,
E.R.
_________________________________________________________________________
Evgeny Rodichev Sternberg Astronomical Institute
email: er@sai.msu.su Moscow State University
Phone: 007 (095) 939 2383
Fax: 007 (095) 932 8841 http://www.sai.msu.su/~er

#40Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Magnus Hagander (#12)
Re: win32 performance - fsync question

""Magnus Hagander"" <mha@sollentuna.net>
news:6BCB9D8A16AC4241919521715F4D8BCE4768DD@algol.sollentuna.se...

This is what we have discovered. AFAIK, all other major databases or
other similar apps (like exchange or AD) all open files with
FILE_FLAG_WRITE_THROUGH and do *not* use fsync. It might give noticably
better performance with an O_DIRECT style WAL logging at least. But I'm
unsure if the current code for O_DIRECT works on win32 - I think it
needs some fixing for that. Which might be worth looking at for 8.1.

UPS will not help you. UPS does not help you if the OS crashes (hey,
yuo're on windows, this *does* happen). UPS does not help you if
somebody accidentally pulls the plug between the UPS and the server. UPS
does not help you if your server overheats and shuts down.
Bottom line, there are lots of cases when an UPS does not help. Having
an UPS (preferrably redundant UPSes feeding redundant power supplies -
this is not at all expensive today) is certainly a good thing, but it is
*not* a replacement for fsync. On *any* platform.

//Magnus

Oracle9 and SQL Server 2000 use this flag. Some comments on the
lost-data-concern about FILE_FLAG_WRITE_THROUGH:

(1) Assume you just use ordinary SCSI disks with write back cache on -
you will lost your data if the server suddently lost power;
you will *not* lost your data when OS crashes, server reset or whatever only
if the server has the power;
This has been verified with Oracle9 and SQL Server 2000.

(2) Turn off write back cache in disks, you will not lost data, but you
will see your performance decreased;

(3) If you use some advanced expensive disks like the battery-equipped
ones, then you can safely enable write back cache;

So UPS is useful for ordinary SCSI disks when write back cache is enabled,
but make sure don't let "somebody accidentally pulls the plug between the
UPS and the server" this unfortunate thing happen.

#41Greg Stark
gsstark@mit.edu
In reply to: Evgeny Rodichev (#39)
Re: win32 performance - fsync question

Evgeny Rodichev <er@sai.msu.su> writes:

No, it does. Let's try the simplest test:

for (i = 0; i < LEN; i++) {
write (fd, buf, 512);
if (sync) fsync (fd);
}

with sync = 0 and 1, and you'll see the difference.

Uh, I'm sure you'll see a difference, one will be limited by the i/o
throughput the IDE interface is capable of, the other will be limited purely
by the memory bandwidth and kernel syscall latency.

Try it with sync=1 and write caching disabled on your IDE drive and you should
see an even larger difference.

However, no filesystem and ide driver combination in linux 2.4 and afaik none
in 2.6 either issue any special ATA commands to force the drive to

There was some talk on linux-kernel of what how they could take advantage
of new ATA features planned on new SATA drives coming out now to solve
this. But they didn't seem to think it was urgent or worth the performance
hit of doing a complete cache flush.

It was a bit different topic.

Well no way to tell if we're talking about the same threads. But in the
discussion I saw it was clear they were talking about adding an interface to
drivers so for filesystems to issue cache flushes when necessary to guarantee
filesystem integrity. They still didn't seem to get that users cared about
their data too, not just filesystem integrity.

--
greg

#42Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#32)
Re: win32 performance - fsync question

Tom Lane wrote:

Portability, or rather the complete lack of it. Stuff that isn't in the
Single Unix Spec is a hard sell.

O_DIRECT is reasonably common among modern Unixen (it is supported by
Linux, FreeBSD, and probably a couple of the commercial variants like
AIX or IRIX); it should also be reasonably easy to check for support at
configure time. It's on my TODO list to take a gander at adding support
for O_DIRECT for WAL, I just haven't gotten around to it yet.

-Neil

#43Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Neil Conway (#42)
Re: win32 performance - fsync question

One point that I no longer recall the reasoning behind is that xlog.c
doesn't think O_SYNC is a preferable default over fsync.

For larger (>8k) transactions O_SYNC|O_DIRECT is only good with the recent
pending patch to group WAL writes together. The fsync method gives the OS a
chance to do the grouping. (Of course it does not matter if you have small
tx < 8k WAL)

Andreas

#44Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Zeugswetter Andreas DAZ SD (#43)
Re: win32 performance - fsync question

Magnus prepared a trivial patch which added the O_SYNC flag for windows
and mapped it to FILE_FLAG_WRITE_THROUGH in win32_open.c. We pg_benched
it and here are the results of our test on my WinXP workstation on a 10k
raptor:

Settings were pgbench -t 100 -c 10.

fsync = off:
~ 280 tps

fsync on, WAL=fsync:
~ 35 tps

fsync on, WAL=open_sync write cache policy on:
~ 240 tps

fsync on, WAL=open_sync write cache policy off:
~ 80 tps

80 tps, btw, is about the results I'd expect from linux on this
hardware. Also, the open_sync method plays much nicer with RAID
devices, but it would need some more rigorous testing before I'd
personally certify it as safe. As an aside, it doesn't look like the
open_sync can be trusted with write caching policy on the disk (the
default), and that's worth noting.

Merlin

#45Magnus Hagander
mha@sollentuna.net
In reply to: Merlin Moncure (#44)
1 attachment(s)
Re: win32 performance - fsync question

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some more
reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you turn off write
cache). And it's certainly a significant performance increase - it
brings the speed almost up to the same as linux.

//Magnus

Attachments:

o_sync.patchapplication/octet-stream; name=o_sync.patchDownload
Index: include/port.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/port.h,v
retrieving revision 1.69
diff -c -r1.69 port.h
*** include/port.h	6 Jan 2005 00:59:25 -0000	1.69
--- include/port.h	17 Feb 2005 21:20:41 -0000
***************
*** 174,180 ****

  #if defined(WIN32) && !defined(__CYGWIN__)

! /* open() replacement to allow delete of held files */
  #ifndef WIN32_CLIENT_ONLY
  extern int	win32_open(const char *, int,...);

--- 174,181 ----

  #if defined(WIN32) && !defined(__CYGWIN__)

! /* open() replacement to allow delete of held files and passing
!  * of special options. */
  #ifndef WIN32_CLIENT_ONLY
  extern int	win32_open(const char *, int,...);

Index: include/port/win32.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/port/win32.h,v
retrieving revision 1.42
diff -c -r1.42 win32.h
*** include/port/win32.h	26 Dec 2004 19:20:33 -0000	1.42
--- include/port/win32.h	17 Feb 2005 21:19:31 -0000
***************
*** 184,189 ****
--- 184,197 ----
  #define lstat(path, sb)	stat((path), (sb))

  /*
+  * Supplement to <fcntl.h>.
+  * This is the same value as _O_NOINHERIT in the MS header file. This is
+  * to ensure that we don't collide with a future definition. It means
+  * we cannot use _O_NOINHERIT ourselves.
+  */
+ #define O_SYNC 0x0080
+
+ /*
   * Supplement to <errno.h>.
   */
  #undef EAGAIN
Index: port/open.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/port/open.c,v
retrieving revision 1.7
diff -c -r1.7 open.c
*** port/open.c	31 Dec 2004 22:03:53 -0000	1.7
--- port/open.c	17 Feb 2005 21:40:12 -0000
***************
*** 13,18 ****
--- 13,19 ----

  #ifdef WIN32

+ #include <postgres.h>
  #include <windows.h>
  #include <fcntl.h>
  #include <errno.h>
***************
*** 62,68 ****
  	/* Check that we can handle the request */
  	assert((fileFlags & ((O_RDONLY | O_WRONLY | O_RDWR) | O_APPEND |
  						 (O_RANDOM | O_SEQUENTIAL | O_TEMPORARY) |
! 						 _O_SHORT_LIVED |
  	  (O_CREAT | O_TRUNC | O_EXCL) | (O_TEXT | O_BINARY))) == fileFlags);

  	sa.nLength = sizeof(sa);
--- 63,69 ----
  	/* Check that we can handle the request */
  	assert((fileFlags & ((O_RDONLY | O_WRONLY | O_RDWR) | O_APPEND |
  						 (O_RANDOM | O_SEQUENTIAL | O_TEMPORARY) |
! 						 _O_SHORT_LIVED | O_SYNC |
  	  (O_CREAT | O_TRUNC | O_EXCL) | (O_TEXT | O_BINARY))) == fileFlags);

  	sa.nLength = sizeof(sa);
***************
*** 81,87 ****
  				 ((fileFlags & O_RANDOM) ? FILE_FLAG_RANDOM_ACCESS : 0) |
  		   ((fileFlags & O_SEQUENTIAL) ? FILE_FLAG_SEQUENTIAL_SCAN : 0) |
  		  ((fileFlags & _O_SHORT_LIVED) ? FILE_ATTRIBUTE_TEMPORARY : 0) |
! 			 ((fileFlags & O_TEMPORARY) ? FILE_FLAG_DELETE_ON_CLOSE : 0),
  						NULL)) == INVALID_HANDLE_VALUE)
  	{
  		switch (GetLastError())
--- 82,89 ----
  				 ((fileFlags & O_RANDOM) ? FILE_FLAG_RANDOM_ACCESS : 0) |
  		   ((fileFlags & O_SEQUENTIAL) ? FILE_FLAG_SEQUENTIAL_SCAN : 0) |
  		  ((fileFlags & _O_SHORT_LIVED) ? FILE_ATTRIBUTE_TEMPORARY : 0) |
! 			 ((fileFlags & O_TEMPORARY) ? FILE_FLAG_DELETE_ON_CLOSE : 0)|
! 					((fileFlags & O_SYNC) ? FILE_FLAG_WRITE_THROUGH : 0),
  						NULL)) == INVALID_HANDLE_VALUE)
  	{
  		switch (GetLastError())
#46Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#45)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some more
reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you turn off write
cache). And it's certainly a significant performance increase - it
brings the speed almost up to the same as linux.

For testing, I have built and uploaded binaries from the 8.0 stable
branch with this patch applied. They are available from
http://www.hagander.net/pgsql/. Install the 8.0.1 version first (from
MSI or manually, your choice), then replace postmaster.exe and
postgres.exe with the ones in the ZIP file. If you're running as a
service, make sure to stop the service first.

To make sure it uses the new code, change wal_sync_method to open_sync
in postgresql.conf and restart the service.

The kind of testing we need help is "pulling the plug reliability
testing". For this, make sure you have write caching turned off (it's no
the disks properties page in the Device Manager), run a bunch of
transactions on the db and then pull the plug of the machine in the
middle. It should come up with all acknowledged transactions still
applied, and all others not.

//Magnus

#47Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#46)
Re: win32 performance - fsync question

Portability, or rather the complete lack of it. Stuff that

isn't in the

Single Unix Spec is a hard sell.

O_DIRECT is reasonably common among modern Unixen (it is supported by
Linux, FreeBSD, and probably a couple of the commercial variants like
AIX or IRIX); it should also be reasonably easy to check for
support at
configure time. It's on my TODO list to take a gander at
adding support
for O_DIRECT for WAL, I just haven't gotten around to it yet.

Let me know when you do, and if you need some pointers on the win32
parts of it :-) I'll happily leave the main changes alone.

//Magnus

#48Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#47)
Re: win32 performance - fsync question

One point that I no longer recall the reasoning behind is that xlog.c
doesn't think O_SYNC is a preferable default over fsync.

For larger (>8k) transactions O_SYNC|O_DIRECT is only good
with the recent
pending patch to group WAL writes together. The fsync method
gives the OS a
chance to do the grouping. (Of course it does not matter if
you have small
tx < 8k WAL)

This would be true for fdatasync() but not for fsync(), I think.

On win32 (which started this discussion, fsync will sync the directory
entry as well, which will lead to *at least* two seeks on the disk.
Writing two blocks after each other to an O_SYNC opened file should give
exactly two seeks.

Of course, this only moves the breakpoint up to n blocks, where n > 2 (3
or 4 depending on how many seeks the filesystem will require).

//Magnus

#49Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Magnus Hagander (#48)
Re: win32 performance - fsync question

One point that I no longer recall the reasoning behind is that xlog.c
doesn't think O_SYNC is a preferable default over fsync.

For larger (>8k) transactions O_SYNC|O_DIRECT is only good with the recent
pending patch to group WAL writes together. The fsync method gives the OS a
chance to do the grouping. (Of course it does not matter if you have small
tx < 8k WAL)

This would be true for fdatasync() but not for fsync(), I think.

No, it is only worse with fsync, since that adds a mandatory seek.

On win32 (which started this discussion, fsync will sync the directory
entry as well, which will lead to *at least* two seeks on the disk.
Writing two blocks after each other to an O_SYNC opened file should give
exactly two seeks.

I think you are making the following not maintainable assumptions.
1. there is no other outstanding IO on that drive that the OS happily
inserts between your two 8k writes
2. the rotational delay is neglectible
3. the per call overhead is neglectible

You will at least wait until the heads reach the write position again,
since you will not be able to supply the next 8k in time for the drive to
continue writing (with the single backend large tx I was referring to).

If you doubt what I am saying do dd blocksize tests on a raw device.
The results are, that up to ~256kb blocksize you can increase the drive
performance on a drive that does not have a powerfailsafe cache, and
does not lie about write success.

Andreas

#50Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Zeugswetter Andreas DAZ SD (#49)
Re: win32 performance - fsync question

On win32 (which started this discussion, fsync will sync the

directory

entry as well, which will lead to *at least* two seeks on the disk.
Writing two blocks after each other to an O_SYNC opened file should

give

exactly two seeks.

I think you are making the following not maintainable assumptions.
1. there is no other outstanding IO on that drive that the OS happily
inserts between your two 8k writes
2. the rotational delay is neglectible
3. the per call overhead is neglectible

You will at least wait until the heads reach the write position again,
since you will not be able to supply the next 8k in time for the drive

to

continue writing (with the single backend large tx I was referring

to).

If you doubt what I am saying do dd blocksize tests on a raw device.
The results are, that up to ~256kb blocksize you can increase the

drive

performance on a drive that does not have a powerfailsafe cache, and
does not lie about write success.

On win32 with standard hardware, WAL O_SYNC gives about 2-3x performance
according to pg_bench. This is in part because fsync() on win32 is the
'nuclear option', syncing meta data which slows down things
considerably. Not sure about unix, but the win32 O_DIRECT equivalent
disables the read cache and also gives slightly faster write performance
(presumably from removing the overhead of the cache manager).

The other issue is high performance RAID controllers. With dedicated
memory and processor, a good raid controller w/bbu might perform
significantly better with everything sent right to the controller, all
the time. On win32, fsync() bypasses the raid write cache killing the
performance gain from moving to a caching RAID controller.

Merlin

#51Magnus Hagander
mha@sollentuna.net
In reply to: Merlin Moncure (#50)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Magnus prepared a trivial patch which added the O_SYNC flag for
windows and mapped it to FILE_FLAG_WRITE_THROUGH in win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some
more reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you turn
off write cache). And it's certainly a significant
performance increase - it brings the speed almost up to the
same as linux.

I have now run a bunch of pull-the-plug testing on this patch (literally
pulling the plug, yes. to the point of some of my co-workers thinking
I'm crazy)

My results are:
Fisrt, baseline:
* Linux, with fsync (default), write-cache disabled: no data corruption
* Linux, with fsync (default), write-cache enabled: usually no data
corruption, but two runs which had
* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data corruption. Once I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0,
segment 16): No such file or directory

but the data in the database was consistent.

Almost all runs showed a line along the line:
2005-02-24 11:22:41 LOG: record with zero length at 0/A450548

In the final test, the BIOS decided the disk was giving up and
reassigned it as 0Mb.. Required two extra cold boots, then it was back
up to 20Gb. Still no data loss.

My tests was three clients doing lots of inserts and updates, some in
transactions some bare. In some tests, I kicked in a manual vacuum while
at it. Then I yanked the powercord, rebooted, manually started pg, and
verified taht the data in the db came up with the same values the cliens
reported as last committed. I also ran vacuum verbose on all tables
after it was back up to see if there were any warnings.

Test machine is a 1GHz Celeron, 256Mb RAM and a Maxtor IDE disk.

It'd of course be good if others could also test, but I'm getting the
feeling that this patch at least doesn't make things worse than before
:-) ANd it's *a lot* faster.

//Magnus

#52Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Magnus Hagander (#51)
Re: [pgsql-hackers-win32] win32 performance - fsync question

In the final test, the BIOS decided the disk was giving up and
reassigned it as 0Mb.. Required two extra cold boots, then it was back
up to 20Gb. Still no data loss.

I think it would be fun to re-run these tests with MySQL...

Chris

#53Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Magnus Hagander (#51)
Re: [pgsql-hackers-win32] win32 performance - fsync question

My results are:
Fisrt, baseline:
* Linux, with fsync (default), write-cache disabled: no data corruption
* Linux, with fsync (default), write-cache enabled: usually no data
corruption, but two runs which had
* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data corruption. Once I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0,
segment 16): No such file or directory

In case anyone is wondering, you can turn off write caching on FreeBSD,
for a terrible perfomance loss...

http://freebsd.active-venture.com/handbook/configtuning-disk.html#AEN8015

Chris

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#51)
Re: [pgsql-hackers-win32] win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

My results are:
Fisrt, baseline:
* Linux, with fsync (default), write-cache disabled: no data corruption
* Linux, with fsync (default), write-cache enabled: usually no data
corruption, but two runs which had

That makes sense.

* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data corruption. Once I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0,
segment 16): No such file or directory
but the data in the database was consistent.

It disturbs me that you couldn't produce data corruption in the cases
where it theoretically should occur. Seems like this is an indication
that your test was insufficiently severe, or that there is something
going on we don't understand.

regards, tom lane

#55Magnus Hagander
mha@sollentuna.net
In reply to: Tom Lane (#54)
Re: [pgsql-hackers-win32] win32 performance - fsync question

* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data corruption. Once I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010"

(log file

0, segment 16): No such file or directory
but the data in the database was consistent.

It disturbs me that you couldn't produce data corruption in
the cases where it theoretically should occur. Seems like
this is an indication that your test was insufficiently
severe, or that there is something going on we don't understand.

The Windows driver knows abotu the write cache, and at least fsync()
pushes through the write cache even if it's there. This seems to
indicate taht O_SYNC at least partiallyi does this as well. This is why
there is no performance difference at all on fsync() with write cache on
or off.

I don't know if this is true for all IDE disks. COuld be that my disk is
particularly well-behaved.

//Magnus

#56Noname
pgsql@mohawksoft.com
In reply to: Tom Lane (#54)
Re: [pgsql-hackers-win32] win32 performance - fsync

"Magnus Hagander" <mha@sollentuna.net> writes:

My results are:
Fisrt, baseline:
* Linux, with fsync (default), write-cache disabled: no data corruption
* Linux, with fsync (default), write-cache enabled: usually no data
corruption, but two runs which had

That makes sense.

* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data corruption. Once I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010" (log file 0,
segment 16): No such file or directory
but the data in the database was consistent.

It disturbs me that you couldn't produce data corruption in the cases
where it theoretically should occur. Seems like this is an indication
that your test was insufficiently severe, or that there is something
going on we don't understand.

I was thinking about that. A few years back, Microsoft had some serious
issues with write caching drives. They were taken to task for losing data
if Windows shut down too fast, especially on drives with a large cache.

MS is big enough and bad enough to get all the info they need from the
various drive makers to know how to handle write cache flushing. Even the
stuff that isn't documented.

If anyone has a very good debugger and/or emulator or even a logic
analyzer, it would be interesting to see if MS sends commands to the
drives after a disk write or a set of disk writes.

Also, I would like to see this test performed on NTFS and FAT32, and see
if you are more likely to lose data on FAT32.

#57Greg Stark
gsstark@mit.edu
In reply to: Magnus Hagander (#51)
Re: [pgsql-hackers-win32] win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

* Linux, with fsync (default), write-cache enabled: usually no data
corruption, but two runs which had

Are you verifying that all the data that was committed was actually stored? Or
just verifying that the database works properly after rebooting?

I'm a bit surprised that the write-cache lead to a corrupt database, and not
merely lost transactions. I had the impression that drives still handled the
writes in the order received.

You may find that if you check this case again that the "usually no data
corruption" is actually "usually lost transactions but no corruption".

--
greg

#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#57)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Greg Stark <gsstark@mit.edu> writes:

I'm a bit surprised that the write-cache lead to a corrupt database, and not
merely lost transactions. I had the impression that drives still handled the
writes in the order received.

There'd be little point in having a cache if they did, I should think.
I thought the point of the cache was to allow the disk to schedule I/O
in an order that minimizes seek time (ie, such a disk has got its own
elevator queue or similar).

You may find that if you check this case again that the "usually no data
corruption" is actually "usually lost transactions but no corruption".

That's a good point, but it seems difficult to be sure of the last
reportedly-committed transaction in a powerfail situation. Maybe if
you drive the test from a client on another machine?

regards, tom lane

#59Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#58)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Tom Lane <tgl@sss.pgh.pa.us> writes:

Greg Stark <gsstark@mit.edu> writes:

I'm a bit surprised that the write-cache lead to a corrupt database, and not
merely lost transactions. I had the impression that drives still handled the
writes in the order received.

There'd be little point in having a cache if they did, I should think.
I thought the point of the cache was to allow the disk to schedule I/O
in an order that minimizes seek time (ie, such a disk has got its own
elevator queue or similar).

If that were the case then SCSI drives that ship with write caching disabled
and using tagged command queuing instead would perform poorly.

I think the main motivation for write caching on IDE drives is that the IDE
protocol forces commands to be issued synchronously. So you can't send a
second command until the first command has completed. Without write caching
that limits the write bandwidth tremendously. Write caching is being used here
as a poor man's tcq.

--
greg

#60Magnus Hagander
mha@sollentuna.net
In reply to: Greg Stark (#59)
Re: [pgsql-hackers-win32] win32 performance - fsync question

You may find that if you check this case again that the

"usually no data

corruption" is actually "usually lost transactions but no

corruption".

That's a good point, but it seems difficult to be sure of the last
reportedly-committed transaction in a powerfail situation. Maybe if
you drive the test from a client on another machine?

FYI, that's what I did. Test client ran across the network to the
server, so it could output on the console which transaction was last
reported commityted.

In a couple of cases, the server came up with a transaction the client
had *not* reported as committed. But I think that can be explained by
the commit message not reaching the client over the network before power
went out.

//Magnus

#61Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#60)
Re: [pgsql-hackers-win32] win32 performance - fsync question

* Linux, with fsync (default), write-cache enabled: usually no data
corruption, but two runs which had

Are you verifying that all the data that was committed was
actually stored? Or
just verifying that the database works properly after rebooting?

I verified the data.

I'm a bit surprised that the write-cache lead to a corrupt
database, and not
merely lost transactions. I had the impression that drives
still handled the
writes in the order received.

In this case, it was lost transactions, not data corruption. Should be
more careful. I had copy/pasted the "no data corruption", should specify
what was lost.

A couple of the latest transactions were gone, but the database came up
in a consistent state, if a bit old.

Since Linux wasn't the stuff I actually was testing, I didn't run very
many tests on it though.

//Magnus

#62Greg Stark
gsstark@mit.edu
In reply to: Magnus Hagander (#61)
Re: [pgsql-hackers-win32] win32 performance - fsync question

"Magnus Hagander" <mha@sollentuna.net> writes:

I'm a bit surprised that the write-cache lead to a corrupt database, and
not merely lost transactions. I had the impression that drives still
handled the writes in the order received.

In this case, it was lost transactions, not data corruption.
...
A couple of the latest transactions were gone, but the database came up
in a consistent state, if a bit old.

That's interesting. It would be very interesting to know how reliably this is
true. It could potentially vary depending on the drive firmware.

I can't see any painless way to package up this kind of test for people to run
though. Powercycling machines repeatedly really isn't fun and takes a long
time. And testing this on vmware doesn't buy us anything.

--
greg

#63Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Greg Stark (#62)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Are you verifying that all the data that was committed was actually stored? Or
just verifying that the database works properly after rebooting?

I verified the data.

Does pg startup increase the xid by some amount (say 1000 xids) after crash ?
Else I think you would also need to rollback a range of xids after
the crash, to see if you don't loose data by reusing and rolling back xids.

The risk is datapages reaching the disk before WAL, because the disk rearranges.
I think you would not notice such corruption (with pg_dump) unless you do the
range rollback.

Andreas

#64Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#45)
Re: win32 performance - fsync question

Patch applied. Thanks.

I assume this is not approprate for 8.0.X.

---------------------------------------------------------------------------

Magnus Hagander wrote:

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some more
reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you turn off write
cache). And it's certainly a significant performance increase - it
brings the speed almost up to the same as linux.

//Magnus

Content-Description: o_sync.patch

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#65Michael Paesold
mpaesold@gmx.at
In reply to: Bruce Momjian (#64)
Re: win32 performance - fsync question

Bruce Momjian wrote:

Patch applied. Thanks.

I assume this is not approprate for 8.0.X.

---------------------------------------------------------------

Magnus Hagander wrote:

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some more
reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you turn off write
cache). And it's certainly a significant performance increase - it
brings the speed almost up to the same as linux.

The original patch did not have any documentation. Have you added some?
Since this has to be configured in GUC (wal_sync_method), the implications
should be documented somewhere, no?

Best Regards,
Michael Paesold

#66Magnus Hagander
mha@sollentuna.net
In reply to: Michael Paesold (#65)
Re: [pgsql-hackers-win32] win32 performance - fsync question

I'd like to see this one also considered for 8.0.x, though I'd certainly
like to see some more testing as well. Perhaps it's suitable for the
"8.0.x with extended testing" that is planned for the ARC replacement
code?

It does make a huge difference on win32. While we definitly don't want
to risk data, a 60% speedup in write intensive apps is a *lot*.

//Magnus

Show quoted text

-----Original Message-----
From: pgsql-hackers-win32-owner@postgresql.org
[mailto:pgsql-hackers-win32-owner@postgresql.org] On Behalf Of
Bruce Momjian
Sent: den 27 februari 2005 01:54
To: Magnus Hagander
Cc: Tom Lane; pgsql-hackers@postgresql.org;
pgsql-hackers-win32@postgresql.org; Merlin Moncure
Subject: Re: [pgsql-hackers-win32] [HACKERS] win32 performance
- fsync question

Patch applied. Thanks.

I assume this is not approprate for 8.0.X.

---------------------------------------------------------------
------------

Magnus Hagander wrote:

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some more
reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you turn

off write

cache). And it's certainly a significant performance increase - it
brings the speed almost up to the same as linux.

//Magnus

Content-Description: o_sync.patch

[ Attachment, skipping... ]

---------------------------(end of

broadcast)---------------------------

TIP 8: explain analyze is your friend

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, 
Pennsylvania 19073

---------------------------(end of
broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#67Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#66)
Re: win32 performance - fsync question

Patch applied. Thanks.

I assume this is not approprate for 8.0.X.

---------------------------------------------------------------

Magnus Hagander wrote:

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some more
reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you

turn off write

cache). And it's certainly a significant performance increase - it
brings the speed almost up to the same as linux.

The original patch did not have any documentation. Have you
added some?
Since this has to be configured in GUC (wal_sync_method), the
implications
should be documented somewhere, no?

The patch just implements behaviour that was already documented (for
unix) on a new platform (win32). The documentation in general appears to
have very little information on what to pick there, though ;-)

//Magnus

#68Michael Paesold
mpaesold@gmx.at
In reply to: Magnus Hagander (#67)
Re: win32 performance - fsync question

Magnus Hagander wrote:

Magnus Hagander wrote:

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

[snip]

Michael Paesold wrote:

The original patch did not have any documentation. Have you
added some? Since this has to be configured in GUC (wal_sync_method),
the implications should be documented somewhere, no?

The patch just implements behaviour that was already documented (for
unix) on a new platform (win32). The documentation in general appears >to
have very little information on what to pick there, though ;-)

Reading your mails about the pull-the-plug tests, I see that at least with
write caching enabled, fsync is more secure on win32 than open_sync. I.e.
one should disable write caching for use with open_sync. Also open_sync
seems to perform much better. All that information would be nice to have in
the docs.

Best Regards,
Michael Paesold

#69Dave Page
dpage@vale-housing.co.uk
In reply to: Michael Paesold (#68)
Re: [pgsql-hackers-win32] win32 performance - fsync question

-----Original Message-----
From: pgsql-hackers-win32-owner@postgresql.org on behalf of Bruce Momjian
Sent: Sun 2/27/2005 12:54 AM
To: Magnus Hagander
Cc: Tom Lane; pgsql-hackers@postgresql.org; pgsql-hackers-win32@postgresql.org; Merlin Moncure
Subject: Re: [pgsql-hackers-win32] [HACKERS] win32 performance - fsync question

Patch applied. Thanks.

I assume this is not approprate for 8.0.X.

I think it would be good to backpatch it given proper testing - the changes are relatively minor, and they do give a significant performance boost.

Regards, Dave

#70Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#66)
1 attachment(s)
Changing the default wal_sync_method to open_sync for Win32?

Magnus Hagander wrote:

I'd like to see this one also considered for 8.0.x, though I'd certainly
like to see some more testing as well. Perhaps it's suitable for the
"8.0.x with extended testing" that is planned for the ARC replacement
code?

It does make a huge difference on win32. While we definitly don't want
to risk data, a 60% speedup in write intensive apps is a *lot*.

While this patch has been applied to CVS HEAD, there are still two open
issues:

1. Should it be the default wal_sync_method for Win32?

Right now we do:

#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
#define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG
#else
#if defined(HAVE_FDATASYNC)
#define DEFAULT_SYNC_METHOD_STR "fdatasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
#define DEFAULT_SYNC_FLAGBIT 0
#else
#define DEFAULT_SYNC_METHOD_STR "fsync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
#define DEFAULT_SYNC_FLAGBIT 0
#endif

Basically we do open_datasync -> fdatasync -> fsync. This is
empirically what we found to be fastest on most operating systems, and
we default to the first one that exists on the operating system.

Notice we never default to open_sync. However, on Win32, Magnus got a
60% speedup by using open_sync, implemented using
FILE_FLAG_WRITE_THROUGH. Now, because this the fastest on Win32, I
think we should default to open_sync on Win32. The attached patch
implements this.

2. Another question is what to do with 8.0.X? Do we backpatch this for
Win32 performance? Can we test it enough to know it will work well?
8.0.2 is going to have a more rigorous testing cycle because of the
buffer manager changes.

---------------------------------------------------------------------------

//Magnus

-----Original Message-----
From: pgsql-hackers-win32-owner@postgresql.org
[mailto:pgsql-hackers-win32-owner@postgresql.org] On Behalf Of
Bruce Momjian
Sent: den 27 februari 2005 01:54
To: Magnus Hagander
Cc: Tom Lane; pgsql-hackers@postgresql.org;
pgsql-hackers-win32@postgresql.org; Merlin Moncure
Subject: Re: [pgsql-hackers-win32] [HACKERS] win32 performance
- fsync question

Patch applied. Thanks.

I assume this is not approprate for 8.0.X.

---------------------------------------------------------------
------------

Magnus Hagander wrote:

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

Attached is this trivial patch. As Merlin says, it needs some more
reliability testing. But the numbers are at least reasonable - it
*seems* like it's doing the right thing (as long as you turn

off write

cache). And it's certainly a significant performance increase - it
brings the speed almost up to the same as linux.

//Magnus

Content-Description: o_sync.patch

[ Attachment, skipping... ]

---------------------------(end of

broadcast)---------------------------

TIP 8: explain analyze is your friend

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, 
Pennsylvania 19073

---------------------------(end of
broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/pgpatches/osynctext/plainDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.181
diff -c -c -r1.181 xlog.c
*** src/backend/access/transam/xlog.c	12 Feb 2005 23:53:37 -0000	1.181
--- src/backend/access/transam/xlog.c	17 Mar 2005 04:07:44 -0000
***************
*** 69,78 ****
  #endif
  #endif
  
  #if defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD_STR    "open_datasync"
  #define DEFAULT_SYNC_METHOD		   SYNC_METHOD_OPEN
! #define DEFAULT_SYNC_FLAGBIT	   OPEN_DATASYNC_FLAG
  #else
  #if defined(HAVE_FDATASYNC)
  #define DEFAULT_SYNC_METHOD_STR   "fdatasync"
--- 69,83 ----
  #endif
  #endif
  
+ #if defined(WIN32)	/* Fastest on Win32 using FILE_FLAG_WRITE_THROUGH */
+ #define DEFAULT_SYNC_METHOD_STR    "open_sync"
+ #define DEFAULT_SYNC_METHOD		   SYNC_METHOD_OPEN
+ #define DEFAULT_SYNC_FLAGBIT	   OPEN_SYNC_FLAG
+ #else
  #if defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD_STR    "open_datasync"
  #define DEFAULT_SYNC_METHOD		   SYNC_METHOD_OPEN
! #define DEFAULT_SYNC_FLAGBIT	   OPEN_SYNC_FLAG
  #else
  #if defined(HAVE_FDATASYNC)
  #define DEFAULT_SYNC_METHOD_STR   "fdatasync"
***************
*** 84,89 ****
--- 89,95 ----
  #define DEFAULT_SYNC_FLAGBIT	  0
  #endif
  #endif
+ #endif
  
  
  /* User-settable parameters */
#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#70)
Re: Changing the default wal_sync_method to open_sync for Win32?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Notice we never default to open_sync. However, on Win32, Magnus got a
60% speedup by using open_sync, implemented using
FILE_FLAG_WRITE_THROUGH. Now, because this the fastest on Win32, I
think we should default to open_sync on Win32. The attached patch
implements this.

... and breaks open_datasync for all other platforms ...

regards, tom lane

#72Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Michael Paesold (#68)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Michael Paesold wrote:

Magnus Hagander wrote:

Magnus Hagander wrote:

Magnus prepared a trivial patch which added the O_SYNC flag
for windows and mapped it to FILE_FLAG_WRITE_THROUGH in
win32_open.c.

[snip]

Michael Paesold wrote:

The original patch did not have any documentation. Have you
added some? Since this has to be configured in GUC (wal_sync_method),
the implications should be documented somewhere, no?

The patch just implements behaviour that was already documented (for
unix) on a new platform (win32). The documentation in general appears >to
have very little information on what to pick there, though ;-)

Reading your mails about the pull-the-plug tests, I see that at least with
write caching enabled, fsync is more secure on win32 than open_sync. I.e.
one should disable write caching for use with open_sync. Also open_sync
seems to perform much better. All that information would be nice to have in
the docs.

Michael, I am not sure why you come to the conclusion that open_sync
requires turning off the disk write cache. I saw nothing to indicate
that in the thread:

http://archives.postgresql.org/pgsql-hackers-win32/2005-02/msg00035.php

I read the following:

* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data corruption. Once I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010"

(log file

0, segment 16): No such file or directory
but the data in the database was consistent.

It disturbs me that you couldn't produce data corruption in
the cases where it theoretically should occur. Seems like
this is an indication that your test was insufficiently
severe, or that there is something going on we don't understand.

The Windows driver knows abotu the write cache, and at least fsync()
pushes through the write cache even if it's there. This seems to
indicate taht O_SYNC at least partiallyi does this as well. This is why
there is no performance difference at all on fsync() with write cache on
or off.

I don't know if this is true for all IDE disks. COuld be that my disk is
particularly well-behaved.

This indicated to me that open_sync did not require any additional
changes than our current fsync.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#73Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#71)
1 attachment(s)
Re: Changing the default wal_sync_method to open_sync for

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Notice we never default to open_sync. However, on Win32, Magnus got a
60% speedup by using open_sync, implemented using
FILE_FLAG_WRITE_THROUGH. Now, because this the fastest on Win32, I
think we should default to open_sync on Win32. The attached patch
implements this.

... and breaks open_datasync for all other platforms ...

Oh, fixed.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/pgpatches/osynctext/plainDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.181
diff -c -c -r1.181 xlog.c
*** src/backend/access/transam/xlog.c	12 Feb 2005 23:53:37 -0000	1.181
--- src/backend/access/transam/xlog.c	17 Mar 2005 04:31:32 -0000
***************
*** 69,74 ****
--- 69,79 ----
  #endif
  #endif
  
+ #if defined(WIN32)	/* Fastest on Win32 using FILE_FLAG_WRITE_THROUGH */
+ #define DEFAULT_SYNC_METHOD_STR    "open_sync"
+ #define DEFAULT_SYNC_METHOD		   SYNC_METHOD_OPEN
+ #define DEFAULT_SYNC_FLAGBIT	   OPEN_SYNC_FLAG
+ #else
  #if defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD_STR    "open_datasync"
  #define DEFAULT_SYNC_METHOD		   SYNC_METHOD_OPEN
***************
*** 84,89 ****
--- 89,95 ----
  #define DEFAULT_SYNC_FLAGBIT	  0
  #endif
  #endif
+ #endif
  
  
  /* User-settable parameters */
#74Marc G. Fournier
scrappy@postgresql.org
In reply to: Bruce Momjian (#70)
Re: Changing the default wal_sync_method to open_sync for

On Wed, 16 Mar 2005, Bruce Momjian wrote:

Magnus Hagander wrote:

I'd like to see this one also considered for 8.0.x, though I'd certainly
like to see some more testing as well. Perhaps it's suitable for the
"8.0.x with extended testing" that is planned for the ARC replacement
code?

It does make a huge difference on win32. While we definitly don't want
to risk data, a 60% speedup in write intensive apps is a *lot*.

Notice we never default to open_sync. However, on Win32, Magnus got a
60% speedup by using open_sync, implemented using
FILE_FLAG_WRITE_THROUGH. Now, because this the fastest on Win32, I
think we should default to open_sync on Win32. The attached patch
implements this.

Considering how stable an Operating System Windows *isn't*, I think the
first thing Magnus states very much goes against making this the default:
"While we definitely don't want to risk data..." ...

Setting something like this that increases the risk to data should never
be 'the default behaviour', but a conscious decision on the part of the
administrator of the individual system ... and even then, with a good
skull-n-cross bones warning around it so that they understand the risks
...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

#75Michael Paesold
mpaesold@gmx.at
In reply to: Bruce Momjian (#72)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Bruce Momjian wrote:

Michael Paesold wrote:

Magnus Hagander wrote:

[snip]

Michael, I am not sure why you come to the conclusion that open_sync
requires turning off the disk write cache. I saw nothing to indicate
that in the thread:

I was just seeing his error message below...

http://archives.postgresql.org/pgsql-hackers-win32/2005-02/msg00035.php

I read the following:

* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data corruption. Once I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010"

(log file

0, segment 16): No such file or directory
but the data in the database was consistent.

A missing xlog file does not strike me as "very save". Perhaps someone can
explain what happened, but I would not feel good about this. Again this note
(from Tom Lane) in combination with the above error would tell me, we don't
fully understand the risk here.

It disturbs me that you couldn't produce data corruption in
the cases where it theoretically should occur. Seems like
this is an indication that your test was insufficiently
severe, or that there is something going on we don't understand.

The Windows driver knows abotu the write cache, and at least fsync()
pushes through the write cache even if it's there. This seems to
indicate taht O_SYNC at least partiallyi does this as well. This is why
there is no performance difference at all on fsync() with write cache on
or off.

I don't know if this is true for all IDE disks. COuld be that my disk is
particularly well-behaved.

This indicated to me that open_sync did not require any additional
changes than our current fsync.

We both based our understanding on the same evidence. It seems we just have
a different level of paranoia. ;-)

Best Regards,
Michael Paesold

#76Magnus Hagander
mha@sollentuna.net
In reply to: Tom Lane (#71)
Re: Changing the default wal_sync_method to open_sync for Win32?

I'd like to see this one also considered for 8.0.x, though I'd
certainly like to see some more testing as well. Perhaps it's
suitable for the "8.0.x with extended testing" that is planned for
the ARC replacement code?

It does make a huge difference on win32. While we definitly don't
want to risk data, a 60% speedup in write intensive apps

is a *lot*.

Notice we never default to open_sync. However, on Win32,

Magnus got a

60% speedup by using open_sync, implemented using
FILE_FLAG_WRITE_THROUGH. Now, because this the fastest on Win32, I
think we should default to open_sync on Win32. The attached patch
implements this.

Considering how stable an Operating System Windows *isn't*, I

The difference has nothing to do with the stability of the OS. It has to
do with wether we ignore the *hardware* write cache or not. Both methods
ignore the OS write cache.

think the first thing Magnus states very much goes against
making this the default:
"While we definitely don't want to risk data..." ...

Setting something like this that increases the risk to data
should never be 'the default behaviour', but a conscious
decision on the part of the administrator of the individual
system ... and even then, with a good skull-n-cross bones
warning around it so that they understand the risks ...

The same level of "risk due to write cache" exists on all other
operating systems already. (Actually, I only know it exists on linux,
but it sure looks like it exists on most others looking at performance
figures)

//Magnus

#77Magnus Hagander
mha@sollentuna.net
In reply to: Michael Paesold (#75)
Re: [pgsql-hackers-win32] win32 performance - fsync question

* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data

corruption. Once

I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010"

(log file

0, segment 16): No such file or directory
but the data in the database was consistent.

It disturbs me that you couldn't produce data corruption in the
cases where it theoretically should occur. Seems like this is an
indication that your test was insufficiently severe, or

that there

is something going on we don't understand.

The Windows driver knows abotu the write cache, and at

least fsync()

pushes through the write cache even if it's there. This seems to
indicate taht O_SYNC at least partiallyi does this as well. This is
why there is no performance difference at all on fsync() with write
cache on or off.

I don't know if this is true for all IDE disks. COuld be

that my disk

is particularly well-behaved.

This indicated to me that open_sync did not require any
additional changes than our current fsync.

fsync and open_sync both write through the write cache in the operating
system. Only fsync=off turns this off.

fsync also writes through the hardware write cache. o_sync does not.
This is what causes the large slowdown with write cache enabled,
*including* most battery backed write cache systems (pretty much making
the write-cache a waste of money). This may be a good thing on IDE
systems (for admins that don't know how to remove the little check in
the box for "enable write caching on the disk" that MS provides, which
*explicitly* warns that you may lose data if you enabled it), but it's a
very bad thing for anything higher end.

fsync also syncs the directory metadata. o_sync only cares about the
files contents. (This is what causes the large slowdown with write cache
*disabled*, becuase it requires multiple writes on multiple disk
locations for each fsync).

Basically, fsync hurts people who configure their box correctly, or who
use things like SCSI disks. o_sync hurts people who configure their
machine in an unsafe way.

//Magnus

#78Dave Page
dpage@vale-housing.co.uk
In reply to: Magnus Hagander (#76)
Re: Changing the default wal_sync_method to open_sync for Win32?

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian
Sent: 17 March 2005 04:20
To: Magnus Hagander
Cc: Tom Lane; pgsql-hackers@postgresql.org;
pgsql-hackers-win32@postgresql.org; Merlin Moncure
Subject: [HACKERS] Changing the default wal_sync_method to
open_sync for Win32?

1. Should it be the default wal_sync_method for Win32?

Yes.

2. Another question is what to do with 8.0.X? Do we
backpatch this for
Win32 performance? Can we test it enough to know it will work well?
8.0.2 is going to have a more rigorous testing cycle because of the
buffer manager changes.

This question was asked earlier, and iirc, a few people said yes, and
no-one said no. I'm most definitely in the yes camp.

Regards, Dave.

#79Dave Page
dpage@vale-housing.co.uk
In reply to: Marc G. Fournier (#74)
Re: Changing the default wal_sync_method to open_sync for

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Marc
G. Fournier
Sent: 17 March 2005 05:32
To: Bruce Momjian
Cc: Magnus Hagander; Tom Lane; pgsql-hackers@postgresql.org;
pgsql-hackers-win32@postgresql.org; Merlin Moncure
Subject: Re: [HACKERS] Changing the default wal_sync_method
to open_sync for

Considering how stable an Operating System Windows *isn't*

And the last time you ran a 2K/2K3 box was?....

No offence, but my 15 or so servers in that category have had less
downtime in total over the last 12 months than a certain cluster of
FreeBSD boxes I can think of, excluding network provider related
problems of course.

Regards, Dave.

#80Kenneth Marshall
ktm@it.is.rice.edu
In reply to: Bruce Momjian (#70)
Re: Changing the default wal_sync_method to open_sync for Win32?

On Wed, Mar 16, 2005 at 11:20:12PM -0500, Bruce Momjian wrote:

Basically we do open_datasync -> fdatasync -> fsync. This is
empirically what we found to be fastest on most operating systems, and
we default to the first one that exists on the operating system.

Notice we never default to open_sync. However, on Win32, Magnus got a
60% speedup by using open_sync, implemented using
FILE_FLAG_WRITE_THROUGH. Now, because this the fastest on Win32, I
think we should default to open_sync on Win32. The attached patch
implements this.

2. Another question is what to do with 8.0.X? Do we backpatch this for
Win32 performance? Can we test it enough to know it will work well?
8.0.2 is going to have a more rigorous testing cycle because of the
buffer manager changes.

My preference would be to back-patch to 8.0.1. I have some projects
where the performance difference will decide whether or not they go
with MSSQL or PostgreSQL.

Ken

#81Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Dave Page (#78)
Re: Changing the default wal_sync_method to open_sync for

Dave Page wrote:

2. Another question is what to do with 8.0.X? Do we
backpatch this for
Win32 performance? Can we test it enough to know it will work well?
8.0.2 is going to have a more rigorous testing cycle because of the
buffer manager changes.

This question was asked earlier, and iirc, a few people said yes, and
no-one said no. I'm most definitely in the yes camp.

I have backpatched O_SYNC for Win32 to 8.0.X. Everyone seems to agree
it should be supported by wal_sync_method. --- the "default" issue
still needs discussion.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#82Marc G. Fournier
scrappy@postgresql.org
In reply to: Dave Page (#79)
Re: Changing the default wal_sync_method to open_sync for

On Thu, 17 Mar 2005, Dave Page wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Marc
G. Fournier
Sent: 17 March 2005 05:32
To: Bruce Momjian
Cc: Magnus Hagander; Tom Lane; pgsql-hackers@postgresql.org;
pgsql-hackers-win32@postgresql.org; Merlin Moncure
Subject: Re: [HACKERS] Changing the default wal_sync_method
to open_sync for

Considering how stable an Operating System Windows *isn't*

And the last time you ran a 2K/2K3 box was?....

Actually, I will get Microsoft credit with their work on XP, I have been
impressed with it ...

Personally, I don't like the idea of making 'less data security' the
default on any platform ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

#83Marc G. Fournier
scrappy@postgresql.org
In reply to: Bruce Momjian (#81)
Re: Changing the default wal_sync_method to open_sync for

On Thu, 17 Mar 2005, Bruce Momjian wrote:

Dave Page wrote:

2. Another question is what to do with 8.0.X? Do we
backpatch this for
Win32 performance? Can we test it enough to know it will work well?
8.0.2 is going to have a more rigorous testing cycle because of the
buffer manager changes.

This question was asked earlier, and iirc, a few people said yes, and
no-one said no. I'm most definitely in the yes camp.

I have backpatched O_SYNC for Win32 to 8.0.X. Everyone seems to agree
it should be supported by wal_sync_method. --- the "default" issue
still needs discussion.

Even with Magnus' explanation that we're talking Hardware, and not OS risk
issues, I still think that the default should be the "least risky", with
the other options being well explained from both a risk/performance
standpoint, so that its a conscious decision on the admin's side ...

Any 'risk of data loss' has always been taboo, making the default
behaviour be to increase that risk seems to be a step backwards to me ..
having the option, fine ... effectively forcing that option is what I'm
against (and, by forcing, I mean how many ppl "change from the default"?)

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

#84Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#77)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Magnus Hagander wrote:

This indicated to me that open_sync did not require any
additional changes than our current fsync.

fsync and open_sync both write through the write cache in the operating
system. Only fsync=off turns this off.

fsync also writes through the hardware write cache. o_sync does not.
This is what causes the large slowdown with write cache enabled,
*including* most battery backed write cache systems (pretty much making
the write-cache a waste of money). This may be a good thing on IDE
systems (for admins that don't know how to remove the little check in
the box for "enable write caching on the disk" that MS provides, which
*explicitly* warns that you may lose data if you enabled it), but it's a
very bad thing for anything higher end.

I found the checkbox on XP looking at "Properties" for the drive, then
choosing "Hardware", the drive, "Properties", and "Policies".

fsync also syncs the directory metadata. o_sync only cares about the
files contents. (This is what causes the large slowdown with write cache
*disabled*, because it requires multiple writes on multiple disk
locations for each fsync).

Basically, fsync hurts people who configure their box correctly, or who
use things like SCSI disks. o_sync hurts people who configure their
machine in an unsafe way.

So, it seems that Win32 open_sync is exactly the same as our
"wal_sync_method = open_datasync" on Unix (it needs to be renamed), and
"wal_sync_method = fsync" on Win32 is something we don't have that
writes through the disk write cache even if it is enabled.

I have developed the following patch which renames our wal_sync_method
Win32 support from open_sync to open_datasync:

ftp://candle.pha.pa.us/pub/postgresql/mypatches

One issue with this patch is that if applied it would make open_datasync
the default sync method on Win32 because we prefer open_datasync over
all other sync methods. If we don't want to do that, I think we should
still do the rename for accuracy and add a !WIN32 test to prevent
open_datasync from being the default.

However, I do prefer this patch and let Win32 have the same write cache
issues as Unix, for consistency.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#85Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#84)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Bruce Momjian <pgman@candle.pha.pa.us> writes:

However, I do prefer this patch and let Win32 have the same write cache
issues as Unix, for consistency.

I agree that the open flag is more nearly O_DSYNC than O_SYNC.

ISTM Windows' idea of fsync is quite different from Unix's and therefore
we should name the wal_sync_method that invokes it something different
than fsync. "write_through" or some such? We already have precedent
that not all wal_sync_method values are available on all platforms.

I'm not taking a position on which the default should be ...

regards, tom lane

#86Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Marc G. Fournier (#83)
Re: Changing the default wal_sync_method to open_sync for

Marc G. Fournier wrote:

On Thu, 17 Mar 2005, Bruce Momjian wrote:

Dave Page wrote:

2. Another question is what to do with 8.0.X? Do we
backpatch this for
Win32 performance? Can we test it enough to know it will work well?
8.0.2 is going to have a more rigorous testing cycle because of the
buffer manager changes.

This question was asked earlier, and iirc, a few people said yes, and
no-one said no. I'm most definitely in the yes camp.

I have backpatched O_SYNC for Win32 to 8.0.X. Everyone seems to agree
it should be supported by wal_sync_method. --- the "default" issue
still needs discussion.

Even with Magnus' explanation that we're talking Hardware, and not OS risk
issues, I still think that the default should be the "least risky", with
the other options being well explained from both a risk/performance
standpoint, so that its a conscious decision on the admin's side ...

Any 'risk of data loss' has always been taboo, making the default
behaviour be to increase that risk seems to be a step backwards to me ..
having the option, fine ... effectively forcing that option is what I'm
against (and, by forcing, I mean how many ppl "change from the default"?)

I understand that logic. Please see my posting that their fsync is
something we don't have on Unix.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#87Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#85)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

However, I do prefer this patch and let Win32 have the same write cache
issues as Unix, for consistency.

I agree that the open flag is more nearly O_DSYNC than O_SYNC.

ISTM Windows' idea of fsync is quite different from Unix's and therefore
we should name the wal_sync_method that invokes it something different
than fsync. "write_through" or some such? We already have precedent
that not all wal_sync_method values are available on all platforms.

I'm not taking a position on which the default should be ...

Yes, I am thinking that too. I hesistated because it adds yet another
sync method, and we have to document it works only on Win32, but I see
no better solution.

I am going to let the Win32 users mostly vote on what the default should
be.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#88Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#87)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

we should name the wal_sync_method that invokes it something different
than fsync. "write_through" or some such? We already have precedent
that not all wal_sync_method values are available on all platforms.

Yes, I am thinking that too. I hesistated because it adds yet another
sync method, and we have to document it works only on Win32, but I see
no better solution.

It occurs to me that it'd probably be a good idea if the error message
for an unsupported wal_sync_method value explicitly listed the allowed
values for the platform. If there's no objection I'll try to make
that happen. (I'm not sure if it's trivial or not: I think the GUC
framework is a bit restrictive about custom error messages from GUC
assign hooks...)

regards, tom lane

#89Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#88)
Re: [pgsql-hackers-win32] win32 performance - fsync question

The default should clearly be the safest method.

Personally, I would disable anything but the safest method for all
database files that are not read-only.

IMO-YMMV.
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian
Sent: Thursday, March 17, 2005 10:53 AM
To: Tom Lane
Cc: Magnus Hagander; Michael Paesold; pgsql-hackers@postgresql.org;
pgsql-hackers-win32@postgresql.org; Merlin Moncure
Subject: Re: [pgsql-hackers-win32] [HACKERS] win32 performance - fsync
question

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

However, I do prefer this patch and let Win32 have the same write

cache

issues as Unix, for consistency.

I agree that the open flag is more nearly O_DSYNC than O_SYNC.

ISTM Windows' idea of fsync is quite different from Unix's and

therefore

we should name the wal_sync_method that invokes it something different
than fsync. "write_through" or some such? We already have precedent
that not all wal_sync_method values are available on all platforms.

I'm not taking a position on which the default should be ...

Yes, I am thinking that too. I hesistated because it adds yet another
sync method, and we have to document it works only on Win32, but I see
no better solution.

I am going to let the Win32 users mostly vote on what the default should
be.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania
19073

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dann Corbit (#89)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

ISTM Windows' idea of fsync is quite different from Unix's and therefore
we should name the wal_sync_method that invokes it something different
than fsync. "write_through" or some such?

Ah, I remember now. On Win32 our fsync is:
#define fsync(a) _commit(a)
I am wondering if we should call the new mode open_commit or
open_writethrough. Our typical rule is to tie it to the API call, which
should suggest open_commit.

fsync_writethrough, perhaps. I don't see any "open" about it.

regards, tom lane

#91Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#90)
Re: [pgsql-hackers-win32] win32 performance - fsync question

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

ISTM Windows' idea of fsync is quite different from Unix's and therefore
we should name the wal_sync_method that invokes it something different
than fsync. "write_through" or some such?

Ah, I remember now. On Win32 our fsync is:
#define fsync(a) _commit(a)
I am wondering if we should call the new mode open_commit or
open_writethrough. Our typical rule is to tie it to the API call, which
should suggest open_commit.

fsync_writethrough, perhaps. I don't see any "open" about it.

Sorry, yea, go confused.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#92Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Marc G. Fournier (#83)
Re: Changing the default wal_sync_method to open_sync for

Even with Magnus' explanation that we're talking Hardware, and not OS
risk issues, I still think that the default should be the "least risky",
with the other options being well explained from both a risk/performance
standpoint, so that its a conscious decision on the admin's side ...

Any 'risk of data loss' has always been taboo, making the default
behaviour be to increase that risk seems to be a step backwards to me ..
having the option, fine ... effectively forcing that option is what I'm
against (and, by forcing, I mean how many ppl "change from the default"?)

But doesn't making it the default just make it identical to the default
freebsd configuration? ie. Identical risk?

Chris

#93Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Magnus Hagander (#77)
1 attachment(s)
Re: [HACKERS] win32 performance - fsync question

I have applied the following patch to CVS HEAD and 8.0.X that changes
the Win32 O_SYNC flag to O_DATASYNC, because this the actual behavior of
the flag. This is now the default wal fsync method on Win32 because we
perfer O_DATASYNC to fsync().

And second, it changes Win32 fsync to a new wal sync method called
fsync_writethrough which is the old Win32 fsync behavior, which uses
_commit().

---------------------------------------------------------------------------

Magnus Hagander wrote:

* Win32, with fsync, write-cache disabled: no data corruption
* Win32, with fsync, write-cache enabled: no data corruption
* Win32, with osync, write cache disabled: no data corruption
* Win32, with osync, write cache enabled: no data

corruption. Once

I
got:
2005-02-24 12:19:54 LOG: could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/000000010000000000000010"

(log file

0, segment 16): No such file or directory
but the data in the database was consistent.

It disturbs me that you couldn't produce data corruption in the
cases where it theoretically should occur. Seems like this is an
indication that your test was insufficiently severe, or

that there

is something going on we don't understand.

The Windows driver knows abotu the write cache, and at

least fsync()

pushes through the write cache even if it's there. This seems to
indicate taht O_SYNC at least partiallyi does this as well. This is
why there is no performance difference at all on fsync() with write
cache on or off.

I don't know if this is true for all IDE disks. COuld be

that my disk

is particularly well-behaved.

This indicated to me that open_sync did not require any
additional changes than our current fsync.

fsync and open_sync both write through the write cache in the operating
system. Only fsync=off turns this off.

fsync also writes through the hardware write cache. o_sync does not.
This is what causes the large slowdown with write cache enabled,
*including* most battery backed write cache systems (pretty much making
the write-cache a waste of money). This may be a good thing on IDE
systems (for admins that don't know how to remove the little check in
the box for "enable write caching on the disk" that MS provides, which
*explicitly* warns that you may lose data if you enabled it), but it's a
very bad thing for anything higher end.

fsync also syncs the directory metadata. o_sync only cares about the
files contents. (This is what causes the large slowdown with write cache
*disabled*, becuase it requires multiple writes on multiple disk
locations for each fsync).

Basically, fsync hurts people who configure their box correctly, or who
use things like SCSI disks. o_sync hurts people who configure their
machine in an unsafe way.

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/pgpatches/osynctext/plainDownload
Index: doc/src/sgml/runtime.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v
retrieving revision 1.310
diff -c -c -r1.310 runtime.sgml
*** doc/src/sgml/runtime.sgml	19 Mar 2005 23:27:04 -0000	1.310
--- doc/src/sgml/runtime.sgml	24 Mar 2005 04:27:11 -0000
***************
*** 1587,1592 ****
--- 1587,1593 ----
          values are
          <literal>fsync</> (call <function>fsync()</> at each commit),
          <literal>fdatasync</> (call <function>fdatasync()</> at each commit),
+         <literal>fsync_writethrough</> (call <function>_commit()</> at each commit on Windows),
          <literal>open_sync</> (write WAL files with <function>open()</> option <symbol>O_SYNC</>), and
          <literal>open_datasync</> (write WAL files with <function>open()</> option <symbol>O_DSYNC</>).
          Not all of these choices are available on all platforms.
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.181
diff -c -c -r1.181 xlog.c
*** src/backend/access/transam/xlog.c	12 Feb 2005 23:53:37 -0000	1.181
--- src/backend/access/transam/xlog.c	24 Mar 2005 04:27:15 -0000
***************
*** 63,70 ****
  #endif
  #endif
  
  #if defined(OPEN_SYNC_FLAG)
! #if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG)
  #define OPEN_DATASYNC_FLAG	  O_DSYNC
  #endif
  #endif
--- 63,75 ----
  #endif
  #endif
  
+ #if defined(O_DSYNC)
  #if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG	  O_DSYNC
! #endif
! #else /* !defined(OPEN_SYNC_FLAG) */
! /* Win32 only has O_DSYNC */
  #define OPEN_DATASYNC_FLAG	  O_DSYNC
  #endif
  #endif
***************
*** 79,85 ****
--- 84,94 ----
  #define DEFAULT_SYNC_METHOD		  SYNC_METHOD_FDATASYNC
  #define DEFAULT_SYNC_FLAGBIT	  0
  #else
+ #ifndef FSYNC_IS_WRITE_THROUGH
  #define DEFAULT_SYNC_METHOD_STR   "fsync"
+ #else
+ #define DEFAULT_SYNC_METHOD_STR   "fsync_writethrough"
+ #endif
  #define DEFAULT_SYNC_METHOD		  SYNC_METHOD_FSYNC
  #define DEFAULT_SYNC_FLAGBIT	  0
  #endif
***************
*** 5154,5160 ****
--- 5163,5174 ----
  	int			new_sync_method;
  	int			new_sync_bit;
  
+ #ifndef FSYNC_IS_WRITE_THROUGH
  	if (pg_strcasecmp(method, "fsync") == 0)
+ #else
+ 	/* Win32 fsync() == _commit(0, which writes through a write cache */
+ 	if (pg_strcasecmp(method, "fsync_writethrough") == 0)
+ #endif
  	{
  		new_sync_method = SYNC_METHOD_FSYNC;
  		new_sync_bit = 0;
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.137
diff -c -c -r1.137 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample	19 Mar 2005 23:27:07 -0000	1.137
--- src/backend/utils/misc/postgresql.conf.sample	24 Mar 2005 04:27:18 -0000
***************
*** 114,120 ****
  
  #fsync = true			# turns forced synchronization on or off
  #wal_sync_method = fsync	# the default varies across platforms:
! 				# fsync, fdatasync, open_sync, or open_datasync
  #wal_buffers = 8		# min 4, 8KB each
  #commit_delay = 0		# range 0-100000, in microseconds
  #commit_siblings = 5		# range 1-1000
--- 114,121 ----
  
  #fsync = true			# turns forced synchronization on or off
  #wal_sync_method = fsync	# the default varies across platforms:
! 				# fsync, fdatasync, fsync_writethrough,
! 				# open_sync, open_datasync
  #wal_buffers = 8		# min 4, 8KB each
  #commit_delay = 0		# range 0-100000, in microseconds
  #commit_siblings = 5		# range 1-1000
Index: src/include/port/win32.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/port/win32.h,v
retrieving revision 1.43
diff -c -c -r1.43 win32.h
*** src/include/port/win32.h	27 Feb 2005 00:53:29 -0000	1.43
--- src/include/port/win32.h	24 Mar 2005 04:27:19 -0000
***************
*** 17,22 ****
--- 17,23 ----
  
  
  #define fsync(a)	_commit(a)
+ #define FSYNC_IS_WRITE_THROUGH
  #define ftruncate(a,b)	chsize(a,b)
  
  #define USES_WINSOCK
***************
*** 189,195 ****
   * to ensure that we don't collide with a future definition. It means
   * we cannot use _O_NOINHERIT ourselves.
   */
! #define O_SYNC 0x0080
  
  /*
   * Supplement to <errno.h>.
--- 190,196 ----
   * to ensure that we don't collide with a future definition. It means
   * we cannot use _O_NOINHERIT ourselves.
   */
! #define O_DSYNC 0x0080
  
  /*
   * Supplement to <errno.h>.
Index: src/port/open.c
===================================================================
RCS file: /cvsroot/pgsql/src/port/open.c,v
retrieving revision 1.8
diff -c -c -r1.8 open.c
*** src/port/open.c	27 Feb 2005 00:53:29 -0000	1.8
--- src/port/open.c	24 Mar 2005 04:27:19 -0000
***************
*** 63,69 ****
  	/* Check that we can handle the request */
  	assert((fileFlags & ((O_RDONLY | O_WRONLY | O_RDWR) | O_APPEND |
  						 (O_RANDOM | O_SEQUENTIAL | O_TEMPORARY) |
! 						 _O_SHORT_LIVED | O_SYNC |
  	  (O_CREAT | O_TRUNC | O_EXCL) | (O_TEXT | O_BINARY))) == fileFlags);
  
  	sa.nLength = sizeof(sa);
--- 63,69 ----
  	/* Check that we can handle the request */
  	assert((fileFlags & ((O_RDONLY | O_WRONLY | O_RDWR) | O_APPEND |
  						 (O_RANDOM | O_SEQUENTIAL | O_TEMPORARY) |
! 						 _O_SHORT_LIVED | O_DSYNC |
  	  (O_CREAT | O_TRUNC | O_EXCL) | (O_TEXT | O_BINARY))) == fileFlags);
  
  	sa.nLength = sizeof(sa);
***************
*** 83,89 ****
  		   ((fileFlags & O_SEQUENTIAL) ? FILE_FLAG_SEQUENTIAL_SCAN : 0) |
  		  ((fileFlags & _O_SHORT_LIVED) ? FILE_ATTRIBUTE_TEMPORARY : 0) |
  			 ((fileFlags & O_TEMPORARY) ? FILE_FLAG_DELETE_ON_CLOSE : 0)|
! 					((fileFlags & O_SYNC) ? FILE_FLAG_WRITE_THROUGH : 0),
  						NULL)) == INVALID_HANDLE_VALUE)
  	{
  		switch (GetLastError())
--- 83,89 ----
  		   ((fileFlags & O_SEQUENTIAL) ? FILE_FLAG_SEQUENTIAL_SCAN : 0) |
  		  ((fileFlags & _O_SHORT_LIVED) ? FILE_ATTRIBUTE_TEMPORARY : 0) |
  			 ((fileFlags & O_TEMPORARY) ? FILE_FLAG_DELETE_ON_CLOSE : 0)|
! 					((fileFlags & O_DSYNC) ? FILE_FLAG_WRITE_THROUGH : 0),
  						NULL)) == INVALID_HANDLE_VALUE)
  	{
  		switch (GetLastError())