simplify register_dirty_segment()
The basic idea is to change register_dirty_segment() to
register_opened_segment().
That is, we don't care if a segment is dirty or not, if someone opened it,
then we will fsync it at checkpoint time. Currently,
register_dirty_segment() is called in mdextend(), mdwrite() and
mdtruncate(), this is costly since ForwardFsyncRequest() has to grab the
BgWriterCommLock lock exclusively each time and mdwrite() is quite frequent.
Benefits:
+ reduce BgWriterCommLock lock contention;
+ simplify code - we just need to register_opened_segment() when we open the
segment;
+ reduce the BgWriterShmem->requests[] size;
Costs:
+ have to fsync() a file even if we made no modification on it. The cost is
just open/close file, so I think this is acceptable;
Corner case:
+ what if we run out of shared memory for ForwardFsyncRequest()? In the
original way, we just fsync() the file ourselves; Now we can't do this.
Instead, we will issue and wait a checkpoint request to Bgwriter(let him
absorb the requests) and try ForwardFsyncRequest() again.
Comments?
Regards,
Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
That is, we don't care if a segment is dirty or not, if someone opened it,
then we will fsync it at checkpoint time.
On platforms that I'm familiar with, an fsync call causes the kernel
to spend a significant amount of time groveling through its buffers
to see if any are dirty. We shouldn't incur that cost to buy marginal
speedups at the application level. (In other words, "it's only an
open/close" is wrong.)
Also, it's not clear to me how this idea works at all, if a backend holds
a relation open across more than one checkpoint. What will re-register
the segment for the next cycle?
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes
On platforms that I'm familiar with, an fsync call causes the kernel
to spend a significant amount of time groveling through its buffers
to see if any are dirty. We shouldn't incur that cost to buy marginal
speedups at the application level. (In other words, "it's only an
open/close" is wrong.)
I did some tests in SunOS, Linux and windows. Basically, I create 100 files,
close them. Reopen them, write(dirty)/read(clean) 8192*100 bytes each, then
fsync() them. I mesured the fsync() time.
SunOS 5.8 + NFS + SCSI
Fsync dirty files: duration: 2404.573 ms
Fsync clean files: duration: 598.037 ms
Linux 2.4 + Ext3 + IDE
Fsync dirty files: duration: 6951.793 ms
Fsync clean files: duration: 18.132 ms
Window2000 + NTFS + IDE
Fsync dirty files: duration: 3005.000 ms
Fsync clean files: duration: 1101.000 ms
I can't figure out why it tooks so long time in windows and SunOS for clean
files - a possible reason is that they have to fsync some inode information
like last access time even for clean files. Linux is quite smart in this
sense.
Also, it's not clear to me how this idea works at all, if a backend holds
a relation open across more than one checkpoint. What will re-register
the segment for the next cycle?
You are right. A possible (but not clean) solution is like this: The
bgwriter maintain a refcount for each file. When the file is open,
refcount++, when the file is closing, refcount--. When the refcount goes to
zero, Bgwriter could safely remove it from its PendingOpsTable after
checkpoint.
Regards,
Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
I can't figure out why it tooks so long time in windows and SunOS for clean
files -
I told you why: they don't maintain bookkeeping information that allows
them to quickly identify dirty buffers belonging to a particular file.
Linux does ... but I'm not sure that makes it "smarter", since that
bookkeeping has a distributed cost, and the cost might or might not
be repaid in any particular system workload. It would be a reasonable
bet for a kernel designer to assume that fsync() is generally going to
have to wait for some I/O and so a bit of CPU overhead isn't really
going to matter.
You are right. A possible (but not clean) solution is like this: The
bgwriter maintain a refcount for each file. When the file is open,
refcount++, when the file is closing, refcount--. When the refcount goes to
zero, Bgwriter could safely remove it from its PendingOpsTable after
checkpoint.
Adjusting such a global refcount would require global locks, which is
just what you were hoping to avoid :-(
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes
It would be a reasonable
bet for a kernel designer to assume that fsync() is generally going to
have to wait for some I/O and so a bit of CPU overhead isn't really
going to matter.
Reasonable.
Adjusting such a global refcount would require global locks, which is
just what you were hoping to avoid :-(
I don't want to avoid the global locks but to alleviate it :-( Think the
frequency of open()/close() will be much less than write(). Also the shmem
space required. On further thought, I agree that this is unneccessary if for
BgWriterCommLock reason - because currently BufMgrLock doesn't bother us too
much, which is more intensively used, this lock is just nothing.
Regards,
Qingqing