AW: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace
This seems odd. As near as I can tell, O_SYNC is simply a command to do
fsync implicitly during each write call. It cannot save any I/O unless
I'm missing something significant. Where is the performance difference
coming from?Yes, odd, but sure very reproducible here.
I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC
AIX has O_DSYNC (which is _FDATASYNC) too, but I assumed O_SYNC
would be more portable. Now we have two, maybe it is more widespread
than I thought.
I attach my modified version of Andreas' program. Note I do
not believe his assertion that close() implies fsync() --- on the machines I've
used, it demonstrably does not sync.
Ok, I am not sure, but essentially do we need it to sync ? The OS sure isn't
supposed to notice after closing the file, that it ran out of disk space.
Andreas
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:
I attach my modified version of Andreas' program. Note I do not
believe his assertion that close() implies fsync() --- on the
machines I've used, it demonstrably does not sync.
Ok, I am not sure, but essentially do we need it to sync? The OS sure isn't
supposed to notice after closing the file, that it ran out of disk space.
I believe that out-of-space would be reported during the writes, anyway,
so that's not the issue.
The point of fsync'ing after the prewrite is to ensure that the indirect
blocks are down on disk. If you trust fdatasync (or O_DSYNC) to write
indirect blocks then it's not necessary --- but I'm pretty sure I heard
somewhere that some versions of fdatasync fail to guarantee that.
In any case, the real point of the prewrite is to move work out of the
transaction commit path, and so we're better off if we can sync the
indirect blocks during prewrite.
I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC
AIX has O_DSYNC (which is _FDATASYNC) too, but I assumed O_SYNC
Oh? What speeds do you get if you use that?
regards, tom lane
I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC
(defined to do the equivalent of fdatasync()), and got truly
fascinating results. Apparently, on this platform these flags change
the kernel's buffering behavior! Observe:
Solaris 2.6 fascinates even more!!!
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
$ time a.outreal 0m21.40s
user 0m0.02s
sys 0m0.60s
bash-2.02# gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
bash-2.02# time a.out
real 0m4.242s
user 0m0.000s
sys 0m0.450s
It's hard to believe... Writing with DSYNC takes the same time as
file initialization - ~2 sec.
Also, there is no difference if using 64k blocks.
INIT_WRITE + OSYNC gives 52 sec for 8k blocks and 5.7 sec
for 256k ones, but INIT_WRITE + DSYNC doesn't depend on block
size.
Modern IDE drive? -:))
Probably we should change code to use O_DSYNC if defined even without
changing XLogWrite to write more than 1 block at once (if requested)?
As for O_SYNC:
bash-2.02# gcc -Wall -O -DINIT_WRITE tfsync.c
bash-2.02# time a.out
real 0m54.786s
user 0m0.010s
sys 0m10.820s
bash-2.02# gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC tfsync.c
bash-2.02# time a.out
real 0m52.406s
user 0m0.020s
sys 0m0.650s
Not big win. Solaris has more optimized search for dirty blocks
than Tom' HP and Andreas' platform?
Vadim
Import Notes
Resolved by subject fallback
$ gcc -Wall -O -DINIT_WRITE tfsync.c
$ time a.outreal 1m15.11s
user 0m0.04s
sys 0m32.76sNote the large amount of system time here, and the fact that the extra
time in INIT_WRITE is all system time. I have previously
observed that fsync() on HPUX 10.20 appears to iterate through every
kernel disk buffer belonging to the file, presumably checking their
dirtybits one by one. The INIT_WRITE form loses because each fsync in
the second loop has to iterate through a full 16Mb worth of buffers,
whereas without INIT_WRITE there will only be as many buffers as the
amount of file we've filled so far. (On this platform, it'd probably
be a win to use log segments smaller than 16Mb...) It's interesting
that there's no visible I/O cost here for the extra write pass ---
the extra I/O must be completely overlapped with the extra system time.
Tom, could you run this test for different block sizes?
Up to 32*8k?
Just curious when you get something close to
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
$ time a.outreal 0m21.40s
user 0m0.02s
sys 0m0.60s
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
Tom, could you run this test for different block sizes?
Up to 32*8k?You mean changing the amount written per write(), while holding the
total file size constant, right?
Yes. Currently XLogWrite writes 8k blocks one by one. From what I've seen
on Solaris we can use O_DSYNC there without changing XLogWrite to
write() more than 1 block (if > 1 block is available for writing).
But on other platforms write(BLOCKS_TO_WRITE * 8k) + fsync() probably will
be
faster than BLOCKS_TO_WRITE * write(8k) (for file opened with O_DSYNC)
if BLOCKS_TO_WRITE > 1.
I just wonder with what BLOCKS_TO_WRITE we'll see same times for both
approaches.
Okay, I changed the program to
char zbuffer[8192 * BLOCKS];
(all else the same)
and on HPUX 10.20 I get
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
$ time a.out
real 1m18.48s
user 0m0.04s
sys 0m34.69s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=4 tfsync.c
$ time a.out
real 0m35.10s
user 0m0.01s
sys 0m9.08s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=8 tfsync.c
$ time a.out
real 0m29.75s
user 0m0.01s
sys 0m5.23s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=32 tfsync.c
$ time a.out
real 0m22.77s
user 0m0.01s
sys 0m1.80s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=64 tfsync.c
$ time a.out
real 0m22.08s
user 0m0.01s
sys 0m1.25s
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=1 tfsync.c
$ time a.out
real 0m20.64s
user 0m0.02s
sys 0m0.67s
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=4 tfsync.c
$ time a.out
real 0m20.72s
user 0m0.01s
sys 0m0.57s
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=32 tfsync.c
$ time a.out
real 0m20.59s
user 0m0.01s
sys 0m0.61s
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=64 tfsync.c
$ time a.out
real 0m20.86s
user 0m0.01s
sys 0m0.69s
So I also see that there is no benefit to writing more than one block at
a time with ODSYNC. And even at half a meg per write, DSYNC is slower
than ODSYNC with 8K per write! Note the fairly high system-time
consumption for DSYNC, too. I think this is not so much a matter of a
really good ODSYNC implementation, as a really bad DSYNC one ...
regards, tom lane
Import Notes
Reply to msg id not found: 8F4C99C66D04D4118F580090272A7A234D330A@sectorbase1.sectorbase.comReference msg id not found: 8F4C99C66D04D4118F580090272A7A234D330A@sectorbase1.sectorbase.com | Resolved by subject fallback
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
^^^^^^^^^^^
You should use -DUSE_OSYNC to test O_SYNC.
So you've tested N * write() + fsync(), exactly what I've asked -:)
So I also see that there is no benefit to writing more than
one block at a time with ODSYNC. And even at half a meg per write,
DSYNC is slower than ODSYNC with 8K per write! Note the fairly high
system-time consumption for DSYNC, too. I think this is not so much
a matter of a really good ODSYNC implementation, as a really bad DSYNC
one ...
So seems we can use O_DSYNC without losing log write performance
comparing with write() + fsync. Though, we didn't tested write() +
fdatasync()
yet...
Vadim
Import Notes
Resolved by subject fallback
More numbers, these from a Powerbook G3 laptop running Linux 2.2:
[tgl@g3 tmp]$ uname -a
Linux g3 2.2.18-4hpmac #1 Thu Dec 21 15:16:15 MST 2000 ppc unknown
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m32.418s
user 0m0.020s
sys 0m14.020s
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m10.894s
user 0m0.000s
sys 0m4.030s
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=8 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m7.211s
user 0m0.000s
sys 0m2.200s
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=32 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m4.441s
user 0m0.020s
sys 0m0.870s
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m4.488s
user 0m0.000s
sys 0m0.640s
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m3.725s
user 0m0.000s
sys 0m0.310s
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m3.785s
user 0m0.000s
sys 0m0.290s
[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.out
real 0m3.753s
user 0m0.010s
sys 0m0.300s
Starting to look like we should just use ODSYNC where available, and
forget about dumping more per write ...
regards, tom lane
Starting to look like we should just use ODSYNC where available, and
forget about dumping more per write ...
I'll run these tests on RedHat 7.0 tomorrow.
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
^^^^^^^^^^^
You should use -DUSE_OSYNC to test O_SYNC.
Ooops ... let's hear it for cut-and-paste, and for sharp-eyed readers!
Just for completeness, here are the results for O_SYNC:
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=1 tfsync.c
$ time a.out
real 0m43.44s
user 0m0.02s
sys 0m0.74s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=4 tfsync.c
$ time a.out
real 0m26.38s
user 0m0.01s
sys 0m0.59s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=8 tfsync.c
$ time a.out
real 0m23.86s
user 0m0.01s
sys 0m0.59s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=64 tfsync.c
$ time a.out
real 0m22.93s
user 0m0.01s
sys 0m0.66s
Better than fsync(), but still not up to O_DSYNC.
So seems we can use O_DSYNC without losing log write performance
comparing with write() + fsync. Though, we didn't tested write() +
fdatasync() yet...
Good point, we should check fdatasync() too --- although I have no
machines where it's different from fsync().
regards, tom lane
So seems we can use O_DSYNC without losing log write performance
comparing with write() + fsync. Though, we didn't tested write() +
fdatasync() yet...Good point, we should check fdatasync() too --- although I have no
machines where it's different from fsync().
I've tested it on Solaris - not better than O_DSYNC (expected, taking
in account that O_DSYNC results don't depend on block counts).
Ok, I've made changes in xlog.c and run tests: 50 clients inserted
(int4, text[1-256]) into 50 tables,
-B 16384, -wal_buffers 256, -wal_files 0.
FSYNC: 257tps
O_DSYNC: 333tps
Just(?) 30% faster, -:(
But I had no ability to place log on separate disk, yet...
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
Ok, I've made changes in xlog.c and run tests:
Could you send me your diffs?
regards, tom lane
On Saturday 10 March 2001 08:41, Tom Lane wrote:
More numbers, these from a Powerbook G3 laptop running Linux 2.2:
Eeegghhh. Sorry... But where did you get O_DSYNC on Linux?????
Maybe here?
bits/fcntl.h: # define O_DSYNC O_SYNC
There is no any O_DSYNC in the kernel... Even in the latest 2.4.x.
[tgl@g3 tmp]$ uname -a
Linux g3 2.2.18-4hpmac #1 Thu Dec 21 15:16:15 MST 2000 ppc unknown[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m32.418s
user 0m0.020s
sys 0m14.020s[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m10.894s
user 0m0.000s
sys 0m4.030s[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=8 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m7.211s
user 0m0.000s
sys 0m2.200s[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=32 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m4.441s
user 0m0.020s
sys 0m0.870s[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m4.488s
user 0m0.000s
sys 0m0.640s[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m3.725s
user 0m0.000s
sys 0m0.310s[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m3.785s
user 0m0.000s
sys 0m0.290s[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.outreal 0m3.753s
user 0m0.010s
sys 0m0.300sStarting to look like we should just use ODSYNC where available, and
forget about dumping more per write ...regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
--
Sincerely Yours,
Denis Perchine
----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------
Denis Perchine <dyp@perchine.com> writes:
On Saturday 10 March 2001 08:41, Tom Lane wrote:
More numbers, these from a Powerbook G3 laptop running Linux 2.2:
Eeegghhh. Sorry... But where did you get O_DSYNC on Linux?????
bits/fcntl.h: # define O_DSYNC O_SYNC
Hm, must be. Okay, so those two sets of numbers should be taken as
fsync() and O_SYNC respectively. Still the conclusion seems pretty
clear: the open() options are way more efficient than calling fsync()
separately.
regards, tom lane
Ok, I've made changes in xlog.c and run tests:
Could you send me your diffs?
Sorry, Monday only.
Vadim
Ok, I've made changes in xlog.c and run tests: 50 clients inserted
(int4, text[1-256]) into 50 tables,
-B 16384, -wal_buffers 256, -wal_files 0.FSYNC: 257tps
O_DSYNC: 333tpsJust(?) 30% faster, -:(
First of all, if you ask me, that is one hell of an improvement :-)
It shows, that WAL write was actually the bottleneck in this particular case.
The bottleneck may now have shifted to some other resource.
It would probably also be good, to actually write more than one
page with one call instead of the current "for (;XLByteLT...)" loop
in XLogWrite. The reasoning is, 1. that for each call to write, the OS
takes your timeslice away, allowing other backends to work,
and thus reposition the disk head (for selects).
and second measurements with tfsync.c:
zeu@a82101002:~> xlc -O2 tfsync.c -DINIT_WRITE -DUSE_ODSYNC -DBUFFERS=1 -o tfsync
zeu@a82101002:~> time tfsync
real 0m26.174s
user 0m0.040s
sys 0m2.920s
zeu@a82101002:~> xlc -O2 tfsync.c -DINIT_WRITE -DUSE_ODSYNC -DBUFFERS=8 -o tfsync
zeu@a82101002:~> time tfsync
real 0m8.950s
user 0m0.010s
sys 0m2.020s
Andreas
PS: to Tom, on AIX O_SYNC and O_DSYNC does not make a difference with tfsync.c,
both are comparable to your O_DSYNC measurements, maybe this is because of the
jfs journal, where only one write to journal is necessary for all fs work (inode...).
Import Notes
Resolved by subject fallback
FSYNC: 257tps
O_DSYNC: 333tpsJust(?) 30% faster, -:(
First of all, if you ask me, that is one hell of an improvement :-)
Of course -:) But tfsync tests were more promising -:)
Probably we should update XLogWrite to write() more than 1 block,
but Tom should apply his patches first (btw, did you implement
"log file size" condition for checkpoints, Tom?).
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
Probably we should update XLogWrite to write() more than 1 block,
but Tom should apply his patches first (btw, did you implement
"log file size" condition for checkpoints, Tom?).
Yes I did. There's a variable now to specify a checkpoint every N
log segments --- I figured that was good enough resolution, and it
allowed the test to be made only when we're rolling over to a new
segment, so it's not in a time-critical path.
If you're happy with what I did so far, I'll go ahead and commit.
regards, tom lane