AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace

Started by Zeugswetter Andreas SBalmost 25 years ago6 messages

ZeugswetterA@wien.spardat.at

almost 25 years ago

A short test shows, that opening the file O_SYNC, and thus avoiding fsync()
would cut the effective time needed to sync write the xlog more than in half.
Of course we would need to buffer >= 1 xlog page before write (or commit)
to gain the full advantage.

prewrite 0 + write and fsync: 60.4 sec
sparse file + write with O_SYNC: 37.5 sec
no prewrite + write with O_SYNC: 36.8 sec
prewrite 0 + write with O_SYNC: 24.0 sec

This seems odd. As near as I can tell, O_SYNC is simply a command to do
fsync implicitly during each write call. It cannot save any I/O unless
I'm missing something significant. Where is the performance difference
coming from?

Yes, odd, but sure very reproducible here.

The reason I'm inclined to question this is that what we want is not an
fsync per write but an fsync per transaction, and we can't easily buffer
all of a transaction's XLOG writes...

Yes, that is something to consider, but it would probably be sufficient to buffer
1-3 optimal IO blocks (32-256k here).
I assumed that with a few busy clients the fsyncs would come close to
one xlog page, but that is probably too few.

Andreas

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Zeugswetter Andreas SB (#1)

Re: AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

This seems odd. As near as I can tell, O_SYNC is simply a command to do
fsync implicitly during each write call. It cannot save any I/O unless
I'm missing something significant. Where is the performance difference
coming from?

Yes, odd, but sure very reproducible here.

I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC
(defined to do the equivalent of fdatasync()), and got truly fascinating
results. Apparently, on this platform these flags change the kernel's
buffering behavior! Observe:

$ gcc -Wall -O tfsync.c
$ time a.out

real 1m0.32s
user 0m0.02s
sys 0m16.16s
$ gcc -Wall -O -DINIT_WRITE tfsync.c
$ time a.out

real 1m15.11s
user 0m0.04s
sys 0m32.76s

Note the large amount of system time here, and the fact that the extra
time in INIT_WRITE is all system time. I have previously observed that
fsync() on HPUX 10.20 appears to iterate through every kernel disk
buffer belonging to the file, presumably checking their dirtybits one by
one. The INIT_WRITE form loses because each fsync in the second loop
has to iterate through a full 16Mb worth of buffers, whereas without
INIT_WRITE there will only be as many buffers as the amount of file
we've filled so far. (On this platform, it'd probably be a win to use
log segments smaller than 16Mb...) It's interesting that there's no
visible I/O cost here for the extra write pass --- the extra I/O must be
completely overlapped with the extra system time.

$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC tfsync.c
$ time a.out

real 0m45.04s
user 0m0.02s
sys 0m0.83s

We just bought back almost all the system time. The only possible
explanation is that this way either doesn't keep the buffers from prior
blocks, or does not scan them for dirtybits. I note that the open(2)
man page is phrased so that O_SYNC is actually defined not to fsync the
whole file, but only the part you just wrote --- I wonder if it's
actually implemented that way?

$ gcc -Wall -O -DINIT_WRITE -DUSE_SPARSE tfsync.c
$ time a.out

real 1m2.96s
user 0m0.02s
sys 0m27.11s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DUSE_SPARSE tfsync.c
$ time a.out

real 1m1.34s
user 0m0.01s
sys 0m0.59s

Sparse initialization wins a little in the non-O_SYNC case, but loses
when compared with O_SYNC on. Not sure why; perhaps it alters the
amount of I/O that has to be done for indirect blocks?

$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
$ time a.out

real 0m21.40s
user 0m0.02s
sys 0m0.60s

And the piece de resistance: O_DSYNC *actually works* here, even though
I previously found that the fdatasync() call is stubbed to fsync() in
libc! This old HP box is built like a tank and has a similar lack of
attention to noise level ;-) so I can very easily tell by ear that I am
not getting back-and-forth seeks in this last case, even if the timing
didn't prove it to be true.

$ gcc -Wall -O -DUSE_ODSYNC tfsync.c
$ time a.out

real 1m1.56s
user 0m0.02s
sys 0m0.67s

Without INIT_WRITE, we are back to essentially the performance of fsync
even though we use DSYNC. This is expected since the inode must be
written to change the EOF value. Interestingly, the system time is
small, whereas in my first example it was large; but the elapsed time
is the same. Evidently the system time is nearly all overlapped with
I/O in the first example.

At least on this platform, it would be definitely worthwhile to use
O_DSYNC even if that meant fsync per write rather than per transaction.
Can anyone else reproduce these results?

I attach my modified version of Andreas' program. Note I do not believe
his assertion that close() implies fsync() --- on the machines I've
used, it demonstrably does not sync. You'll also note that I made the
lseek optional in the second loop. This appears to make no real
difference, so I didn't include timings with the lseek enabled.

regards, tom lane

Zeugswetter Andreas SB

ZeugswetterA@wien.spardat.at

almost 25 years ago

In reply to: Zeugswetter Andreas SB (#1)

Of course we would need to buffer >= 1 xlog page before write (or commit)
to gain the full advantage.

prewrite 0 + write and fsync: 60.4 sec
sparse file + write with O_SYNC: 37.5 sec
no prewrite + write with O_SYNC: 36.8 sec
prewrite 0 + write with O_SYNC: 24.0 sec

The reason I'm inclined to question this is that what we want is not an
fsync per write but an fsync per transaction, and we can't easily buffer
all of a transaction's XLOG writes...

Yes, that is something to consider, but it would probably be sufficient to buffer
1-3 optimal IO blocks (32-256k here).
I assumed that with a few busy clients the fsyncs would come close to
one xlog page, but that is probably too few.

I get best performance with eighter:
prewrite + 16k write with O_SYNC: 15.5 sec
prewrite + 32k write with O_SYNC: 11.5 sec
no prewite + 256k write with O_SYNC: 5.4 sec

But this 256k per tx would probably be very unrealistic, thus
best overall performance would probably be achieved with
a 32k (or tuneable) xlog buffer O_SYNC and prewrite.

Maybe a good thing for 7.1.1 :-)

Andreas

Import Notes

Resolved by subject fallback

Mikheev, Vadim

vmikheev@SECTORBASE.COM

almost 25 years ago

In reply to: Zeugswetter Andreas SB (#3)

RE: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace

The reason I'm inclined to question this is that what we want
is not an fsync per write but an fsync per transaction, and we can't
easily buffer all of a transaction's XLOG writes...

WAL keeps records in WAL buffers (wal-buffers parameter may be used to
increase # of buffers), so we can make write()-s buffered.

Seems that my Solaris has fdatasync, so I'll test different approaches...

Vadim

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Mikheev, Vadim (#4)

Re: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Seems that my Solaris has fdatasync, so I'll test different approaches...

A Sun guy told me that Solaris does this just the same way that HPUX
does it: fsync() scans all kernel buffers for the file, but O_SYNC
doesn't, because it knows it only needs to sync the blocks covered
by the write(). He didn't say about fdatasync/O_DSYNC but I bet the
same difference exists for those two.

The Linux 2.4 kernel allegedly is set up so that fsync() is smart enough
to only look at dirty buffers, not all the buffers of the file. So
the performance tradeoffs would be different there. But on HPUX and
probably Solaris, O_DSYNC is likely to be a big win, unless we can find
a way to stop the kernel from buffering so much of the WAL files.

regards, tom lane

Ian Lance Taylor

ian@airs.com

almost 25 years ago

In reply to: Zeugswetter Andreas SB (#1)

Re: AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace

Tom Lane <tgl@sss.pgh.pa.us> writes:

We just bought back almost all the system time. The only possible
explanation is that this way either doesn't keep the buffers from prior
blocks, or does not scan them for dirtybits. I note that the open(2)
man page is phrased so that O_SYNC is actually defined not to fsync the
whole file, but only the part you just wrote --- I wonder if it's
actually implemented that way?

Sure, why not? That's how it is implemented in the Linux kernel. If
you do a write with O_SYNC set, the write simply flushes out the
buffers it just modified. If you call fsync, the kernel has to walk
through all the buffers looking for ones associated with the file in
question.

Ian

Import Notes

Reply to msg id not found: TomLanesmessageofFri09Mar2001134224-0500