Background writer process
The attached diff is another attempt for distributing the write IO.
It is a separate background process much like the checkpointer. It's
purpose is to keep the number of dirty blocks in the buffer cache at a
reasonable level and try that the buffers returned by the strategy for
replacement are allways clean. This current shot does it this way:
- get a list of all dirty blocks in strategy replacement order
- flush n percent of that list or a maximum of m buffers
(whatever is smaller)
- issue a sync()
- sleep for x milliseconds
If there is nothing to do, it will sleep for 10 seconds before checking
again at all. It acquires a checkpoint lock during the flush, so it will
yield for a real checkpoint.
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Attachments:
bgwriter.v1.difftext/plain; name=bgwriter.v1.diffDownload+301-81
On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.
Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.
Kurt
Kurt Roeckx wrote:
On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.
He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote:
Kurt Roeckx wrote:
On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.
write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Jan Wieck wrote:
Bruce Momjian wrote:
Kurt Roeckx wrote:
On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.
That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:
Jan Wieck wrote:
Bruce Momjian wrote:
He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.
So basicly on some kernels you want them to flush their dirty
buffers faster.
I have a feeling we should more make it depend on the system how
we ask them not to keep it in memory too long and that maybe the
sync(), fsync() or O_SYNC could be a fallback in case it's needed
and there are no better ways of doing it.
Maybe something as posix_fadvise() might be useful too on systems
that have it?
Kurt
Kurt Roeckx wrote:
On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:
Jan Wieck wrote:
Bruce Momjian wrote:
He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.So basicly on some kernels you want them to flush their dirty
buffers faster.I have a feeling we should more make it depend on the system how
we ask them not to keep it in memory too long and that maybe the
sync(), fsync() or O_SYNC could be a fallback in case it's needed
and there are no better ways of doing it.Maybe something as posix_fadvise() might be useful too on systems
that have it?
That is all right and as said, how often, how much and how forced we do
the IO can all be configurable and as flexible as people see fit. But
whether you use sync(), fsync(), fdatasync(), O_SYNC, O_DSYNC or
posix_fadvise(), somewhere you have to do the write(). And that write
has to be coordinated with the buffer cache replacement strategy so that
you write those buffers that are likely to be replaced soon, and don't
write those that the strategy thinks keeping for longer anyway. Except
at a checkpoint, then you have to write whatever is dirty.
The patch I posted does this write() in coordination with the strategy
in a separate background process, so that the regular backends don't
have to write under normal circumstances (there are some places in DDL
statements that call BufferSync(), that's exceptions IMHO). Can we agree
on this general outline? Or do we have any better proposals?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Kurt Roeckx wrote:
On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:
Jan Wieck wrote:
Bruce Momjian wrote:
He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.So basicly on some kernels you want them to flush their dirty
buffers faster.I have a feeling we should more make it depend on the system how
we ask them not to keep it in memory too long and that maybe the
sync(), fsync() or O_SYNC could be a fallback in case it's needed
and there are no better ways of doing it.
I think the final plan is to have a GUC variable that controls how the
kernel is _encouraged_ to write dirty buffers to disk.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Jan Wieck wrote:
That is all right and as said, how often, how much and how forced we do
the IO can all be configurable and as flexible as people see fit. But
whether you use sync(), fsync(), fdatasync(), O_SYNC, O_DSYNC or
posix_fadvise(), somewhere you have to do the write(). And that write
has to be coordinated with the buffer cache replacement strategy so that
you write those buffers that are likely to be replaced soon, and don't
write those that the strategy thinks keeping for longer anyway. Except
at a checkpoint, then you have to write whatever is dirty.The patch I posted does this write() in coordination with the strategy
in a separate background process, so that the regular backends don't
have to write under normal circumstances (there are some places in DDL
statements that call BufferSync(), that's exceptions IMHO). Can we agree
on this general outline? Or do we have any better proposals?
Agreed. Background write() is a win on all all OS's. It is just the
kernel to disk part we will have to have configurable, I think.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Friday 14 November 2003 03:05, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.
Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.
IMO, with fsync, we tell kernel that you can write this buffer. It may or may
not write it immediately, unless it is hard sync.
Since postgresql can afford lazy writes for data files, I think this could
work.
Just a thought..
Shridhar
Shridhar Daithankar wrote:
On Friday 14 November 2003 03:05, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.IMO, with fsync, we tell kernel that you can write this buffer. It may or may
not write it immediately, unless it is hard sync.
I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.
Since postgresql can afford lazy writes for data files, I think this could
work.
The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Shridhar Daithankar wrote:
On Friday 14 November 2003 03:05, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.
I think we still need sync() for WAL because sometimes backends are
going to have to write their own buffers, and we don't want them using
fsync or it will be very slow.
IMO, with fsync, we tell kernel that you can write this buffer. It may or may
not write it immediately, unless it is hard sync.Since postgresql can afford lazy writes for data files, I think this could
work.
fsync() doesn't return until the data is on the disk. It doesn't
schedule the write then return, as far as I know. sync() does schedule
the writes, I think, which can be bad, but we delay a little to wait for
it to complete.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Shridhar Daithankar wrote:
Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.
I think we still need sync() for WAL because sometimes backends are
going to have to write their own buffers, and we don't want them using
fsync or it will be very slow.
sync() for WAL is a complete nonstarter, because it gives you no
guarantees at all about whether the write has occurred. I don't really
care what you say about speed; this is a correctness point.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Shridhar Daithankar wrote:
Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.I think we still need sync() for WAL because sometimes backends are
going to have to write their own buffers, and we don't want them using
fsync or it will be very slow.sync() for WAL is a complete nonstarter, because it gives you no
guarantees at all about whether the write has occurred. I don't really
care what you say about speed; this is a correctness point.
Sorry, I meant sync() is needed for recycling WAL (checkpoint), not for
WAL writes. I assume that's what Shridhar meant, but now I am not sure.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Friday 14 November 2003 22:10, Jan Wieck wrote:
Shridhar Daithankar wrote:
On Friday 14 November 2003 03:05, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Having fsync for regular data files and sync for WAL segment a
comfortable compramise? Or this is going to use fsync for all of them.IMO, with fsync, we tell kernel that you can write this buffer. It may or
may not write it immediately, unless it is hard sync.I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.
Oops.. that's bad.
Since postgresql can afford lazy writes for data files, I think this
could work.The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.
I just made another posing on patches for a thread crossing win32-devel.
Essentially I said
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does
it. The hackery in xlog.c is not exactly trivial.)
2. Open data files normally and fsync them only in background writer process.
Now BGWriter process will flush everything at the time of checkpointing. It
does not need to flush WAL because of O_SYNC(ideally but an additional fsync
won't hurt). So it just flushes all the file decriptors touched since last
checkpoint, which should not be much of a load because it is flushing those
files intermittently anyways.
It could also work nicely if only background writer fsync the data files.
Backends can either wait or proceed to other business by the time disk is
flushed. Backends needs to wait for certain while committing and it should be
rather small delay of syncing to disk in current process as opposed to in
background process.
In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of
chekpoint. I don't know how difficult that would be.
What is different in currrent BGwriter implementation? Use of sync()?
Shridhar
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if
Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k
to disk with one call takes nearly the same time. The WAL write would need to be
done as soon as eighter 256k fill or a txn commits.
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB SD wrote:
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if
Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k
to disk with one call takes nearly the same time. The WAL write would need to be
done as soon as eighter 256k fill or a txn commits.
That means no special treatment to WAL files? If it works, great. There would be
single class of files to take care w.r.t sync. issue. Even more simpler.
Shridhar
Shridhar Daithankar wrote:
On Friday 14 November 2003 22:10, Jan Wieck wrote:
Shridhar Daithankar wrote:
On Friday 14 November 2003 03:05, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Having fsync for regular data files and sync for WAL segment a
comfortable compromise? Or this is going to use fsync for all of them.IMO, with fsync, we tell kernel that you can write this buffer. It may or
may not write it immediately, unless it is hard sync.I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.Oops.. that's bad.
Yes, one I idea I had was to do an fsync on a new file _after_ issuing
sync, hoping that this will complete after all the sync buffers are
done.
Since postgresql can afford lazy writes for data files, I think this
could work.The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.I just made another posing on patches for a thread crossing win32-devel.
Essentially I said
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does
it. The hackery in xlog.c is not exactly trivial.)
We write WAL, then fsync, so if we write multiple blocks, we can write
them and fsync once, rather than O_SYNC every write.
2. Open data files normally and fsync them only in background writer process.
Now BGWriter process will flush everything at the time of checkpointing. It
does not need to flush WAL because of O_SYNC(ideally but an additional fsync
won't hurt). So it just flushes all the file descriptors touched since last
checkpoint, which should not be much of a load because it is flushing those
files intermittently anyways.It could also work nicely if only background writer fsync the data files.
Backends can either wait or proceed to other business by the time disk is
flushed. Backends needs to wait for certain while committing and it should be
rather small delay of syncing to disk in current process as opposed to in
background process.In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of
checkpoint. I don't know how difficult that would be.What is different in current BGwriter implementation? Use of sync()?
Well, basically we are still discussing how to do this. Right now the
backend writer patch uses sync(), but the final version will use fsync
or O_SYNC, or maybe nothing.
The open items are whether a background process can keep the dirty
buffers cleaned fast enough to keep up with the maximum number of
backends. We might need to use multiple processes or threads to do
this. We certainly will have a background writer in 7.5 --- the big
question is whether _all_ write will go through it. It certainly would
be nice if it could, and Tom thinks it can, so we are still exploring
this.
If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer; that will be
slower.
Another open issue is _if_ the backend writer can't keep up with the
normal backends, do we allow normal backends to write dirty buffers, and
do they use fsync(), or can we record the file in a shared area and have
the background writer do the fsync. This is the issue of whether one
process can fsync all dirty buffers for the file or just the buffers it
wrote.
I think this is these are the basics of the current discussion.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote:
Shridhar Daithankar wrote:
On Friday 14 November 2003 22:10, Jan Wieck wrote:
Shridhar Daithankar wrote:
On Friday 14 November 2003 03:05, Jan Wieck wrote:
For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.Having fsync for regular data files and sync for WAL segment a
comfortable compromise? Or this is going to use fsync for all of them.IMO, with fsync, we tell kernel that you can write this buffer. It may or
may not write it immediately, unless it is hard sync.I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.Oops.. that's bad.
Yes, one I idea I had was to do an fsync on a new file _after_ issuing
sync, hoping that this will complete after all the sync buffers are
done.Since postgresql can afford lazy writes for data files, I think this
could work.The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.I just made another posing on patches for a thread crossing win32-devel.
Essentially I said
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does
it. The hackery in xlog.c is not exactly trivial.)We write WAL, then fsync, so if we write multiple blocks, we can write
them and fsync once, rather than O_SYNC every write.2. Open data files normally and fsync them only in background writer process.
Now BGWriter process will flush everything at the time of checkpointing. It
does not need to flush WAL because of O_SYNC(ideally but an additional fsync
won't hurt). So it just flushes all the file descriptors touched since last
checkpoint, which should not be much of a load because it is flushing those
files intermittently anyways.It could also work nicely if only background writer fsync the data files.
Backends can either wait or proceed to other business by the time disk is
flushed. Backends needs to wait for certain while committing and it should be
rather small delay of syncing to disk in current process as opposed to in
background process.In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of
checkpoint. I don't know how difficult that would be.What is different in current BGwriter implementation? Use of sync()?
Well, basically we are still discussing how to do this. Right now the
backend writer patch uses sync(), but the final version will use fsync
or O_SYNC, or maybe nothing.The open items are whether a background process can keep the dirty
buffers cleaned fast enough to keep up with the maximum number of
backends. We might need to use multiple processes or threads to do
this. We certainly will have a background writer in 7.5 --- the big
question is whether _all_ write will go through it. It certainly would
be nice if it could, and Tom thinks it can, so we are still exploring
this.
Given that fsync is blocking, the background writer has to scale up in terms of
processes/threads and load w.r.t. disk flushing.
I would vote for threads for a simple reason that, in BGWriter, threads are
needed only to flush the file. Get the fd, fsync it and get next one. No need to
make entire process thread safe.
Furthermore BGWriter has to detect the disk limit. If adding threads does not
improve fsyncing speed, it should stop adding them and wait. There is nothing to
do when disk is saturated.
If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer; that will be
slower.
Certainly. However an O_SYNC open file would not require fsync separately. I
suggested it only for WAL. But for WAL block grouping as suggested in another
post, all files with fsync might be a good idea.
Just a thought.
Shridhar
If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer;
that will be slower.
You can forget O_SYNC for datafiles for now. There would simply be too much to
do currently to allow decent performance, like scatter/gather IO, ...
Imho the reasonable target should be to write from all backends but sync (fsync)
from the background writer only. (Tune the OS if it actually waits until the
pg invoked sync (== 5 minutes per default))
Andreas
Import Notes
Resolved by subject fallback