AW: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

Started by Zeugswetter Andreas SBalmost 25 years ago17 messages
#1Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at

This seems odd. As near as I can tell, O_SYNC is simply a command to do
fsync implicitly during each write call. It cannot save any I/O unless
I'm missing something significant. Where is the performance difference
coming from?

Yes, odd, but sure very reproducible here.

I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC

AIX has O_DSYNC (which is _FDATASYNC) too, but I assumed O_SYNC
would be more portable. Now we have two, maybe it is more widespread
than I thought.

I attach my modified version of Andreas' program. Note I do
not believe his assertion that close() implies fsync() --- on the machines I've
used, it demonstrably does not sync.

Ok, I am not sure, but essentially do we need it to sync ? The OS sure isn't
supposed to notice after closing the file, that it ran out of disk space.

Andreas

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#1)
Re: AW: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

I attach my modified version of Andreas' program. Note I do not
believe his assertion that close() implies fsync() --- on the
machines I've used, it demonstrably does not sync.

Ok, I am not sure, but essentially do we need it to sync? The OS sure isn't
supposed to notice after closing the file, that it ran out of disk space.

I believe that out-of-space would be reported during the writes, anyway,
so that's not the issue.

The point of fsync'ing after the prewrite is to ensure that the indirect
blocks are down on disk. If you trust fdatasync (or O_DSYNC) to write
indirect blocks then it's not necessary --- but I'm pretty sure I heard
somewhere that some versions of fdatasync fail to guarantee that.

In any case, the real point of the prewrite is to move work out of the
transaction commit path, and so we're better off if we can sync the
indirect blocks during prewrite.

I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC

AIX has O_DSYNC (which is _FDATASYNC) too, but I assumed O_SYNC

Oh? What speeds do you get if you use that?

regards, tom lane

#3Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Zeugswetter Andreas SB (#1)
RE: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC
(defined to do the equivalent of fdatasync()), and got truly
fascinating results. Apparently, on this platform these flags change
the kernel's buffering behavior! Observe:

Solaris 2.6 fascinates even more!!!

$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
$ time a.out

real 0m21.40s
user 0m0.02s
sys 0m0.60s

bash-2.02# gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
bash-2.02# time a.out

real 0m4.242s
user 0m0.000s
sys 0m0.450s

It's hard to believe... Writing with DSYNC takes the same time as
file initialization - ~2 sec.
Also, there is no difference if using 64k blocks.
INIT_WRITE + OSYNC gives 52 sec for 8k blocks and 5.7 sec
for 256k ones, but INIT_WRITE + DSYNC doesn't depend on block
size.
Modern IDE drive? -:))

Probably we should change code to use O_DSYNC if defined even without
changing XLogWrite to write more than 1 block at once (if requested)?

As for O_SYNC:

bash-2.02# gcc -Wall -O -DINIT_WRITE tfsync.c
bash-2.02# time a.out

real 0m54.786s
user 0m0.010s
sys 0m10.820s
bash-2.02# gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC tfsync.c
bash-2.02# time a.out

real 0m52.406s
user 0m0.020s
sys 0m0.650s

Not big win. Solaris has more optimized search for dirty blocks
than Tom' HP and Andreas' platform?

Vadim

#4Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#3)
RE: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

$ gcc -Wall -O -DINIT_WRITE tfsync.c
$ time a.out

real 1m15.11s
user 0m0.04s
sys 0m32.76s

Note the large amount of system time here, and the fact that the extra
time in INIT_WRITE is all system time. I have previously
observed that fsync() on HPUX 10.20 appears to iterate through every
kernel disk buffer belonging to the file, presumably checking their
dirtybits one by one. The INIT_WRITE form loses because each fsync in
the second loop has to iterate through a full 16Mb worth of buffers,
whereas without INIT_WRITE there will only be as many buffers as the
amount of file we've filled so far. (On this platform, it'd probably
be a win to use log segments smaller than 16Mb...) It's interesting
that there's no visible I/O cost here for the extra write pass ---
the extra I/O must be completely overlapped with the extra system time.

Tom, could you run this test for different block sizes?
Up to 32*8k?
Just curious when you get something close to

$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
$ time a.out

real 0m21.40s
user 0m0.02s
sys 0m0.60s

Vadim

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Tom, could you run this test for different block sizes?
Up to 32*8k?

You mean changing the amount written per write(), while holding the
total file size constant, right?

Yes. Currently XLogWrite writes 8k blocks one by one. From what I've seen
on Solaris we can use O_DSYNC there without changing XLogWrite to
write() more than 1 block (if > 1 block is available for writing).
But on other platforms write(BLOCKS_TO_WRITE * 8k) + fsync() probably will
be
faster than BLOCKS_TO_WRITE * write(8k) (for file opened with O_DSYNC)
if BLOCKS_TO_WRITE > 1.
I just wonder with what BLOCKS_TO_WRITE we'll see same times for both
approaches.

Okay, I changed the program to
char zbuffer[8192 * BLOCKS];
(all else the same)

and on HPUX 10.20 I get

$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
$ time a.out

real 1m18.48s
user 0m0.04s
sys 0m34.69s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=4 tfsync.c
$ time a.out

real 0m35.10s
user 0m0.01s
sys 0m9.08s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=8 tfsync.c
$ time a.out

real 0m29.75s
user 0m0.01s
sys 0m5.23s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=32 tfsync.c
$ time a.out

real 0m22.77s
user 0m0.01s
sys 0m1.80s
$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=64 tfsync.c
$ time a.out

real 0m22.08s
user 0m0.01s
sys 0m1.25s

$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=1 tfsync.c
$ time a.out

real 0m20.64s
user 0m0.02s
sys 0m0.67s
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=4 tfsync.c
$ time a.out

real 0m20.72s
user 0m0.01s
sys 0m0.57s
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=32 tfsync.c
$ time a.out

real 0m20.59s
user 0m0.01s
sys 0m0.61s
$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=64 tfsync.c
$ time a.out

real 0m20.86s
user 0m0.01s
sys 0m0.69s

So I also see that there is no benefit to writing more than one block at
a time with ODSYNC. And even at half a meg per write, DSYNC is slower
than ODSYNC with 8K per write! Note the fairly high system-time
consumption for DSYNC, too. I think this is not so much a matter of a
really good ODSYNC implementation, as a really bad DSYNC one ...

regards, tom lane

#6Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#4)
RE: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c

^^^^^^^^^^^
You should use -DUSE_OSYNC to test O_SYNC.
So you've tested N * write() + fsync(), exactly what I've asked -:)

So I also see that there is no benefit to writing more than
one block at a time with ODSYNC. And even at half a meg per write,
DSYNC is slower than ODSYNC with 8K per write! Note the fairly high
system-time consumption for DSYNC, too. I think this is not so much
a matter of a really good ODSYNC implementation, as a really bad DSYNC
one ...

So seems we can use O_DSYNC without losing log write performance
comparing with write() + fsync. Though, we didn't tested write() +
fdatasync()
yet...

Vadim

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

More numbers, these from a Powerbook G3 laptop running Linux 2.2:

[tgl@g3 tmp]$ uname -a
Linux g3 2.2.18-4hpmac #1 Thu Dec 21 15:16:15 MST 2000 ppc unknown

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m32.418s
user 0m0.020s
sys 0m14.020s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m10.894s
user 0m0.000s
sys 0m4.030s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=8 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m7.211s
user 0m0.000s
sys 0m2.200s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=32 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m4.441s
user 0m0.020s
sys 0m0.870s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m4.488s
user 0m0.000s
sys 0m0.640s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m3.725s
user 0m0.000s
sys 0m0.310s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m3.785s
user 0m0.000s
sys 0m0.290s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m3.753s
user 0m0.010s
sys 0m0.300s

Starting to look like we should just use ODSYNC where available, and
forget about dumping more per write ...

regards, tom lane

#8Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#6)
RE: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

Starting to look like we should just use ODSYNC where available, and
forget about dumping more per write ...

I'll run these tests on RedHat 7.0 tomorrow.

Vadim

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#6)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c

^^^^^^^^^^^
You should use -DUSE_OSYNC to test O_SYNC.

Ooops ... let's hear it for cut-and-paste, and for sharp-eyed readers!

Just for completeness, here are the results for O_SYNC:

$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=1 tfsync.c
$ time a.out

real 0m43.44s
user 0m0.02s
sys 0m0.74s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=4 tfsync.c
$ time a.out

real 0m26.38s
user 0m0.01s
sys 0m0.59s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=8 tfsync.c
$ time a.out

real 0m23.86s
user 0m0.01s
sys 0m0.59s

$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DBLOCKS=64 tfsync.c
$ time a.out

real 0m22.93s
user 0m0.01s
sys 0m0.66s

Better than fsync(), but still not up to O_DSYNC.

So seems we can use O_DSYNC without losing log write performance
comparing with write() + fsync. Though, we didn't tested write() +
fdatasync() yet...

Good point, we should check fdatasync() too --- although I have no
machines where it's different from fsync().

regards, tom lane

#10Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#8)
RE: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

So seems we can use O_DSYNC without losing log write performance
comparing with write() + fsync. Though, we didn't tested write() +
fdatasync() yet...

Good point, we should check fdatasync() too --- although I have no
machines where it's different from fsync().

I've tested it on Solaris - not better than O_DSYNC (expected, taking
in account that O_DSYNC results don't depend on block counts).

Ok, I've made changes in xlog.c and run tests: 50 clients inserted
(int4, text[1-256]) into 50 tables,
-B 16384, -wal_buffers 256, -wal_files 0.

FSYNC: 257tps
O_DSYNC: 333tps

Just(?) 30% faster, -:(
But I had no ability to place log on separate disk, yet...

Vadim

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#10)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Ok, I've made changes in xlog.c and run tests:

Could you send me your diffs?

regards, tom lane

#12Denis Perchine
dyp@perchine.com
In reply to: Tom Lane (#7)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

On Saturday 10 March 2001 08:41, Tom Lane wrote:

More numbers, these from a Powerbook G3 laptop running Linux 2.2:

Eeegghhh. Sorry... But where did you get O_DSYNC on Linux?????
Maybe here?

bits/fcntl.h: # define O_DSYNC O_SYNC

There is no any O_DSYNC in the kernel... Even in the latest 2.4.x.

[tgl@g3 tmp]$ uname -a
Linux g3 2.2.18-4hpmac #1 Thu Dec 21 15:16:15 MST 2000 ppc unknown

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m32.418s
user 0m0.020s
sys 0m14.020s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m10.894s
user 0m0.000s
sys 0m4.030s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=8 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m7.211s
user 0m0.000s
sys 0m2.200s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=32 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m4.441s
user 0m0.020s
sys 0m0.870s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_DSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m4.488s
user 0m0.000s
sys 0m0.640s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=1 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m3.725s
user 0m0.000s
sys 0m0.310s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=4 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m3.785s
user 0m0.000s
sys 0m0.290s

[tgl@g3 tmp]$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC -DBLOCKS=64 tfsync.c
[tgl@g3 tmp]$ time ./a.out

real 0m3.753s
user 0m0.010s
sys 0m0.300s

Starting to look like we should just use ODSYNC where available, and
forget about dumping more per write ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Denis Perchine (#12)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

Denis Perchine <dyp@perchine.com> writes:

On Saturday 10 March 2001 08:41, Tom Lane wrote:

More numbers, these from a Powerbook G3 laptop running Linux 2.2:

Eeegghhh. Sorry... But where did you get O_DSYNC on Linux?????

bits/fcntl.h: # define O_DSYNC O_SYNC

Hm, must be. Okay, so those two sets of numbers should be taken as
fsync() and O_SYNC respectively. Still the conclusion seems pretty
clear: the open() options are way more efficient than calling fsync()
separately.

regards, tom lane

#14Vadim Mikheev
vmikheev@sectorbase.com
In reply to: Mikheev, Vadim (#10)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

Ok, I've made changes in xlog.c and run tests:

Could you send me your diffs?

Sorry, Monday only.

Vadim

#15Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Mikheev, Vadim (#10)

Ok, I've made changes in xlog.c and run tests: 50 clients inserted
(int4, text[1-256]) into 50 tables,
-B 16384, -wal_buffers 256, -wal_files 0.

FSYNC: 257tps
O_DSYNC: 333tps

Just(?) 30% faster, -:(

First of all, if you ask me, that is one hell of an improvement :-)
It shows, that WAL write was actually the bottleneck in this particular case.
The bottleneck may now have shifted to some other resource.

It would probably also be good, to actually write more than one
page with one call instead of the current "for (;XLByteLT...)" loop
in XLogWrite. The reasoning is, 1. that for each call to write, the OS
takes your timeslice away, allowing other backends to work,
and thus reposition the disk head (for selects).
and second measurements with tfsync.c:

zeu@a82101002:~> xlc -O2 tfsync.c -DINIT_WRITE -DUSE_ODSYNC -DBUFFERS=1 -o tfsync
zeu@a82101002:~> time tfsync
real 0m26.174s
user 0m0.040s
sys 0m2.920s
zeu@a82101002:~> xlc -O2 tfsync.c -DINIT_WRITE -DUSE_ODSYNC -DBUFFERS=8 -o tfsync
zeu@a82101002:~> time tfsync
real 0m8.950s
user 0m0.010s
sys 0m2.020s

Andreas

PS: to Tom, on AIX O_SYNC and O_DSYNC does not make a difference with tfsync.c,
both are comparable to your O_DSYNC measurements, maybe this is because of the
jfs journal, where only one write to journal is necessary for all fs work (inode...).

#16Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Zeugswetter Andreas SB (#15)
RE: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

FSYNC: 257tps
O_DSYNC: 333tps

Just(?) 30% faster, -:(

First of all, if you ask me, that is one hell of an improvement :-)

Of course -:) But tfsync tests were more promising -:)
Probably we should update XLogWrite to write() more than 1 block,
but Tom should apply his patches first (btw, did you implement
"log file size" condition for checkpoints, Tom?).

Vadim

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#16)
Re: AW: AW: AW: WAL does not recover gracefully from ou t-of -dis k-sp ace

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Probably we should update XLogWrite to write() more than 1 block,
but Tom should apply his patches first (btw, did you implement
"log file size" condition for checkpoints, Tom?).

Yes I did. There's a variable now to specify a checkpoint every N
log segments --- I figured that was good enough resolution, and it
allowed the test to be made only when we're rolling over to a new
segment, so it's not in a time-critical path.

If you're happy with what I did so far, I'll go ahead and commit.

regards, tom lane