Sync vs. fsync during checkpoint

Started by Bruce Momjianalmost 22 years ago33 messages
#1Bruce Momjian
pgman@candle.pha.pa.us

As some know, win32 doesn't have sync, and some are concerned that sync
isn't reliable enough during checkpoint anyway.

The trick is to somehow record all files modified since the last
checkpoint, and open/fsync/close each one. My idea is to stat() each
file in each directory and compare the modify time to determine if the
file has been modified since the last checkpoint. I can't think of an
easier way to efficiently collect all modified files. In this case, we
let the file system keep track of it for us.

However, on XP, I just tested if files that are kept open have their
modification times modified, and it seems they don't. If I do:

while :
echo test
sleep 5
done > x

I see the file size grow every 5 seconds, but I don't see the
modification time change. Can someone confirm this?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#1)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Bruce Momjian <pgman@candle.pha.pa.us> writes:

The trick is to somehow record all files modified since the last
checkpoint, and open/fsync/close each one. My idea is to stat() each
file in each directory and compare the modify time to determine if the
file has been modified since the last checkpoint.

This seems a complete non-starter, as stat() generally has at best
one-second resolution on mod times, even if you assume that the kernel
keeps mod time fully up-to-date at all times. In any case, it's
difficult to believe that stat'ing everything in a database directory
will be faster than keeping track of it for ourselves.

regards, tom lane

#3Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#2)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

The trick is to somehow record all files modified since the last
checkpoint, and open/fsync/close each one. My idea is to stat() each
file in each directory and compare the modify time to determine if the
file has been modified since the last checkpoint.

This seems a complete non-starter, as stat() generally has at best
one-second resolution on mod times, even if you assume that the kernel
keeps mod time fully up-to-date at all times. In any case, it's
difficult to believe that stat'ing everything in a database directory
will be faster than keeping track of it for ourselves.

Yes, we would have to have a slop factor and fsync anything more than
one second before the last checkpoint. Any ideas on how to record the
modified files without generating tones of output or locking contention?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#3)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Any ideas on how to record the
modified files without generating tones of output or locking contention?

What I've suggested before is that the bgwriter process can keep track
of all files that it's written to since the last checkpoint, and fsync
them during checkpoint (this would likely require giving the checkpoint
task to the bgwriter instead of launching a separate process for it,
but that doesn't seem unreasonable). Obviously this requires only local
storage in the bgwriter process, and hence no contention.

That leaves us still needing to account for files that are written
directly by a backend process and not by the bgwriter. However, I claim
that if the bgwriter is worth the cycles it's expending, cases in which
a backend has to write out a page for itself will be infrequent enough
that we don't need to optimize them. Therefore it would be enough to
have backends immmediately sync any write they have to do. (They might
as well use O_SYNC.) Note that backends need not sync writes to temp
files or temp tables, only genuine shared tables.

If it turns out that it's not quite *that* infrequent, a compromise
position would be to keep a small list of files-needing-fsync in shared
memory. Backends that have to evict pages from shared buffers add those
files to the list; the bgwriter periodically removes entries from the
list and fsyncs the files. Only if there is no room in the list does a
backend have to fsync for itself. If the list is touched often enough
that it becomes a source of contention, then the whole bgwriter concept
is completely broken :-(

Now this last plan does assume that an fsync applied by process X will
write pages that were dirtied by process Y through a different file
descriptor for the same file. There's been some concern raised in the
past about whether we can assume that. If not, though, the simpler
backends-must-sync-their-own-writes plan will still work.

regards, tom lane

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#4)
Re: Sync vs. fsync during checkpoint

Tom Lane wrote:

What I've suggested before is that the bgwriter process can keep track
of all files that it's written to since the last checkpoint, and fsync
them during checkpoint (this would likely require giving the checkpoint
task to the bgwriter instead of launching a separate process for it,
but that doesn't seem unreasonable). Obviously this requires only local
storage in the bgwriter process, and hence no contention.

That leaves us still needing to account for files that are written
directly by a backend process and not by the bgwriter. However, I claim
that if the bgwriter is worth the cycles it's expending, cases in which
a backend has to write out a page for itself will be infrequent enough
that we don't need to optimize them. Therefore it would be enough to
have backends immmediately sync any write they have to do. (They might
as well use O_SYNC.) Note that backends need not sync writes to temp
files or temp tables, only genuine shared tables.

If it turns out that it's not quite *that* infrequent, a compromise
position would be to keep a small list of files-needing-fsync in shared
memory. Backends that have to evict pages from shared buffers add those
files to the list; the bgwriter periodically removes entries from the
list and fsyncs the files. Only if there is no room in the list does a
backend have to fsync for itself. If the list is touched often enough
that it becomes a source of contention, then the whole bgwriter concept
is completely broken :-(

Now this last plan does assume that an fsync applied by process X will
write pages that were dirtied by process Y through a different file
descriptor for the same file. There's been some concern raised in the
past about whether we can assume that. If not, though, the simpler
backends-must-sync-their-own-writes plan will still work.

I am concerned that the bgwriter will not be able to keep up with the
I/O generated by even a single backend restoring a database, let alone a
busy system. To me, the write() performed by the bgwriter, because it
is I/O, will typically be the bottleneck on any system that is I/O bound
(especially as the kernel buffers fill) and will not be able to keep up
with active backends now freed from writes.

The idea to fallback when the bgwriter can not keep up is to have the
backends sync the data, which seems like it would just slow down an
I/O-bound system further.

I talked to Magnus about this, and we considered various ideas, but
could not come up with a clean way of having the backends communicate to
the bgwriter about their own non-sync writes. We had the ideas of using
shared memory or a socket, but these seemed like choke-points.

Here is my new idea. (I will keep throwing out ideas until I hit on a
good one.) The bgwriter it going to have to check before every write to
determine if the file is already recorded as needing fsync during
checkpoint. My idea is to have that checking happen during the bgwriter
buffer scan, rather than at write time. if we add a shared memory
boolean for each buffer, backends needing to write buffers can writer
buffers already recorded as safe to write by the bgwriter scanner. I
don't think the bgwriter is going to be able to keep up with I/O bound
backends, but I do think it can scan and set those booleans fast enough
for the backends to then perform the writes. (We might need a separate
bgwriter thread to do this or a separate process.)

As I remember, our new queue system has a list of buffers that are most
likely to be replaced, so the bgwriter can scan those first and make
sure they have their booleans set.

There is an issue that these booleans are set without locking, so there
might need to be a double-check of them by backends, first before the
write, then after just before they replace the buffer. The bgwriter
would clear the bits before the checkpoint starts.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#6Greg Stark
gsstark@mit.edu
In reply to: Bruce Momjian (#1)
Re: Sync vs. fsync during checkpoint

Bruce Momjian <pgman@candle.pha.pa.us> writes:

As some know, win32 doesn't have sync, and some are concerned that sync
isn't reliable enough during checkpoint anyway.

The trick is to somehow record all files modified since the last
checkpoint, and open/fsync/close each one.

Note that some people believe that if you do this it doesn't guarantee that
any data written to other file descriptors referring to the same files would
also get synced.

I am not one of those people however. Both Solaris and NetBSD kernel hackers
have told me those OS's would work in such a scheme and furthermore that they
cannot imagine any sane VFS that would fail.

I definitely think it's better than calling sync(2) which doesn't guarantee
the blocks are written by any particular time at all..

--
greg

#7Kevin Brown
kevin@sysexperts.com
In reply to: Bruce Momjian (#5)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Bruce Momjian wrote:

Here is my new idea. (I will keep throwing out ideas until I hit on a
good one.) The bgwriter it going to have to check before every write to
determine if the file is already recorded as needing fsync during
checkpoint. My idea is to have that checking happen during the bgwriter
buffer scan, rather than at write time. if we add a shared memory
boolean for each buffer, backends needing to write buffers can writer
buffers already recorded as safe to write by the bgwriter scanner. I
don't think the bgwriter is going to be able to keep up with I/O bound
backends, but I do think it can scan and set those booleans fast enough
for the backends to then perform the writes. (We might need a separate
bgwriter thread to do this or a separate process.)

That seems a bit excessive.

It seems to me that contention is only a problem if you keep a
centralized list of files that have been written by all the backends.
So don't do that.

Instead, have each backend maintain its own separate list in shared
memory. The only readers of a given list would be the backend it belongs
to and the bgwriter, and the only time bgwriter attempts to read the
list is at checkpoint time.

At checkpoint time, for each backend list, the bgwriter grabs a write
lock on the list, copies it into its own memory space, truncates the
list, and then releases the read lock. It then deletes the entries
out of its own list that have entries in the backend list it just read.
It then fsync()s the files that are left, under the assumption that the
backends will fsync() any file they write to directly.

The sum total size of all the lists shouldn't be that much larger than
it would be if you maintained it as a global list. I'd conjecture that
backends that touch many of the same files are not likely to be touching a
large number of files per checkpoint, and those systems that touch a large
number of files probably do so through a lot of independent backends.

One other thing: I don't know exactly how checkpoints are orchestrated
between individual backends, but it seems clear to me that you want to do
a sync() *first*, then the fsync()s. The reason is that sync() allows
the OS to order the writes across all the files in the most efficient
manner possible, whereas fsync() only takes care of the blocks belonging
to the file in question. This won't be an option under Windows, but
on Unix systems it should make a difference. On Linux it should make
quite a difference, since its sync() won't return until the buffers
have been flushed -- and then the following fsync()s will return almost
instantaneously since their data has already been written (so there
won't be any dirty blocks in those files). I suppose it's possible that
on some OSes fsync()s could interfere with a running sync(), but for
those OSes we can just drop back do doing only fsync()s.

As usual, I could be completely full of it. Take this for what it's
worth. :-)

--
Kevin Brown kevin@sysexperts.com

#8Kevin Brown
kevin@sysexperts.com
In reply to: Kevin Brown (#7)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Some Moron at sysexperts.com wrote:

At checkpoint time, for each backend list, the bgwriter grabs a write
lock on the list, copies it into its own memory space, truncates the
list, and then releases the read lock.

Sigh. I meant to say that it then releases the *write* lock.

--
Kevin Brown kevin@sysexperts.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Brown (#7)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Kevin Brown <kevin@sysexperts.com> writes:

Instead, have each backend maintain its own separate list in shared
memory. The only readers of a given list would be the backend it belongs
to and the bgwriter, and the only time bgwriter attempts to read the
list is at checkpoint time.

The sum total size of all the lists shouldn't be that much larger than
it would be if you maintained it as a global list.

I fear that is just wishful thinking. Consider the system catalogs as a
counterexample of files that are likely to be touched/modified by many
different backends.

The bigger problem though with this is that it makes the problem of
list overflow much worse. The hard part about shared memory management
is not so much that the available space is small, as that the available
space is fixed --- we can't easily change it after postmaster start.
The more finely you slice your workspace, the more likely it becomes
that one particular part will run out of space. So the inefficient case
where a backend isn't able to insert something into the appropriate list
will become considerably more of a factor.

but it seems clear to me that you want to do
a sync() *first*, then the fsync()s.

Hmm, that's an interesting thought. On a machine that's doing a lot of
stuff besides running the database, a global sync would be
counterproductive --- but we could easily make it configurable as to
whether to issue the sync() or not. It wouldn't affect correctness.

regards, tom lane

#10Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#9)
Re: Sync vs. fsync during checkpoint

I am concerned that the bgwriter will not be able to keep up with the
I/O generated by even a single backend restoring a database, let alone a
busy system. To me, the write() performed by the bgwriter, because it
is I/O, will typically be the bottleneck on any system that is I/O bound
(especially as the kernel buffers fill) and will not be able to keep up
with active backends now freed from writes.

The idea to fallback when the bgwriter can not keep up is to have the
backends sync the data, which seems like it would just slow down an
I/O-bound system further.

I talked to Magnus about this, and we considered various ideas, but
could not come up with a clean way of having the backends communicate to
the bgwriter about their own non-sync writes. We had the ideas of using
shared memory or a socket, but these seemed like choke-points.

Here is my new idea. (I will keep throwing out ideas until I hit on a
good one.) The bgwriter it going to have to check before every write to
determine if the file is already recorded as needing fsync during
checkpoint. My idea is to have that checking happen during the bgwriter
buffer scan, rather than at write time. if we add a shared memory
boolean for each buffer, backends needing to write buffers can writer
buffers already recorded as safe to write by the bgwriter scanner. I
don't think the bgwriter is going to be able to keep up with I/O bound
backends, but I do think it can scan and set those booleans fast enough
for the backends to then perform the writes. (We might need a separate
bgwriter thread to do this or a separate process.)

As I remember, our new queue system has a list of buffers that are most
likely to be replaced, so the bgwriter can scan those first and make
sure they have their booleans set.

There is an issue that these booleans are set without locking, so there
might need to be a double-check of them by backends, first before the
write, then after just before they replace the buffer. The bgwriter
would clear the bits before the checkpoint starts.

Now that no one is ill from my fsync buffer boolean idea, let me give
some implementation details. :-)

First, we need to add a bit to each shared buffer descriptor (sbufdesc)
that indicates whether the background writer (bwwriter) has recorded the
file associated with the buffer as needing fsync. This bit will be set
only by the background writer, usually during its normal buffer scan
looking for buffers to write. The background writer doesn't write all
dirty buffers on each buffer pass, but it could record the buffers that
need fsync on each pass, allowing backends to write those buffers if
buffer space becomes limited. (Not sure but perhaps the buffer bit set
could be done with only a shared lock on the buffer because no one else
sets the bit.)

(One idea would be to move the fsync bit into its own byte in shared
memory so it is more centralized and no locking is required to set the
bit. Also, should we have one byte per shared buffer to indicate dirty
buffers so the bwwriter can fine them more efficiently?)

The bit can be cleared if either the background writer writes the page,
or a backend writes the page.

Right now, the checkpoint process writes out all dirty buffers. We might
need to change this so the background writer does this because only it
can record files needing fsync. During checkpoint, the background
writer should write out all buffers. It will not be recording any new
fsync bits during this scan because it is writing every dirty buffer.
(If it did do this, it could mark an fsync bit that was written only
during or after the fsync it performs later.)

Once it is done, it should move the hash of files needing fsync to a
backup pointer and create a new empty list and do a scan so backends can
do writes. A subprocess should do fsync of all files, either using
fork() and having the child read the saved pointer hash, or for
EXEC_BACKEND, write a temp file that the child can read.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#11Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#9)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Tom Lane wrote:

Kevin Brown <kevin@sysexperts.com> writes:

Instead, have each backend maintain its own separate list in shared
memory. The only readers of a given list would be the backend it belongs
to and the bgwriter, and the only time bgwriter attempts to read the
list is at checkpoint time.

The sum total size of all the lists shouldn't be that much larger than
it would be if you maintained it as a global list.

I fear that is just wishful thinking. Consider the system catalogs as a
counterexample of files that are likely to be touched/modified by many
different backends.

Oh, I'm not arguing that there won't be a set of files touched by a lot
of backends, just that the number of such files is likely to be relatively
small -- a few tens of files, perhaps. But that admittedly can add up
fast. But see below.

The bigger problem though with this is that it makes the problem of
list overflow much worse. The hard part about shared memory management
is not so much that the available space is small, as that the available
space is fixed --- we can't easily change it after postmaster start.
The more finely you slice your workspace, the more likely it becomes
that one particular part will run out of space. So the inefficient case
where a backend isn't able to insert something into the appropriate list
will become considerably more of a factor.

Well, running out of space in the list isn't that much of a problem. If
the backends run out of list space (and the max size of the list could
be a configurable thing, either as a percentage of shared memory or as
an absolute size), then all that happens is that the background writer
might end up fsync()ing some files that have already been fsync()ed.
But that's not that big of a deal -- the fact they've already been
fsync()ed means that there shouldn't be any data in the kernel buffers
left to write to disk, so subsequent fsync()s should return quickly.
How quickly depends on the individual kernel's implementation of the
dirty buffer list as it relates to file descriptors.

Perhaps a better way to do it would be to store the list of all the
relfilenodes of everything in pg_class, with a flag for each indicating
whether or not an fsync() of the file needs to take place. When anything
writes to a file without O_SYNC or a trailing fsync(), it sets the flag
for the relfilenode of what it's writing. Then at checkpoint time, the
bgwriter can scan the list and fsync() everything that has been flagged.

The relfilenode list should be relatively small in size: at most 16
bytes per item (and that on a 64-bit machine). A database that has 4096
file objects would have a 64K list at most. Not bad.

Because each database backend can only see the class objects associated
with the database it's connected to or the global objects (if there's a
way to see all objects I'd like to know about it, but pg_class only
shows objects in the current database or objects which are visible to
all databases), the relfilenode list might have to be broken up into one
list per database, with perhaps a separate list for global objects.

The interesting question in that situation is how to handle object
creation and removal, which should be a relatively rare occurrance
(fortunately), so it supposedly doesn't have to be all that efficient.

--
Kevin Brown kevin@sysexperts.com

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Brown (#11)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Kevin Brown <kevin@sysexperts.com> writes:

Tom Lane wrote:

The more finely you slice your workspace, the more likely it becomes
that one particular part will run out of space. So the inefficient case
where a backend isn't able to insert something into the appropriate list
will become considerably more of a factor.

Well, running out of space in the list isn't that much of a problem. If
the backends run out of list space (and the max size of the list could
be a configurable thing, either as a percentage of shared memory or as
an absolute size), then all that happens is that the background writer
might end up fsync()ing some files that have already been fsync()ed.
But that's not that big of a deal -- the fact they've already been
fsync()ed means that there shouldn't be any data in the kernel buffers
left to write to disk, so subsequent fsync()s should return quickly.

Yes, it's a big deal. You're arguing as though the bgwriter is the
thing that needs to be fast, when actually what we care about is the
backends being fast. If the bgwriter isn't doing the vast bulk of the
writing (and especially the fsync waits) then we are wasting our time
having one at all. So we need a scheme that makes it as unlikely as
possible that backends will have to do their own fsyncs. Small
per-backend fsync lists aren't the way to do that.

Perhaps a better way to do it would be to store the list of all the
relfilenodes of everything in pg_class, with a flag for each indicating
whether or not an fsync() of the file needs to take place.

You're forgetting that we have a fixed-size workspace to do this in ...
and no way to know at postmaster start how many relations there are in
any of our databases, let alone predict how many there might be later on.

regards, tom lane

#13Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#12)
Re: [HACKERS] Sync vs. fsync during checkpoint

I don't think the bgwriter is going to be able to keep up with I/O bound
backends, but I do think it can scan and set those booleans fast enough
for the backends to then perform the writes.

As long as the bgwriter does not do sync writes (which it does not,
since that would need a whole lot of work to be performant) it calls
write which returns more or less at once.
So the bottleneck can only be the fsync. From those you would want
at least one per pg disk open in parallel.

But I think it should really be left to the OS when it actually does the IO
for the writes from the bgwriter inbetween checkpoints.
So Imho the target should be to have not much IO open for the checkpoint,
so the fsync is fast enough, even if serial.

Andreas

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#13)
Re: [HACKERS] Sync vs. fsync during checkpoint

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

So Imho the target should be to have not much IO open for the checkpoint,
so the fsync is fast enough, even if serial.

The best we can do is push out dirty pages with write() via the bgwriter
and hope that the kernel will see fit to write them before checkpoint
time arrives. I am not sure if that hope has basis in fact or if it's
just wishful thinking. Most likely, if it does have basis in fact it's
because there is a standard syncer daemon forcing a sync() every thirty
seconds.

That means that instead of an I/O storm every checkpoint interval,
we get a smaller I/O storm every 30 seconds. Not sure this is a big
improvement. Jan already found out that issuing very frequent sync()s
isn't a win.

People keep saying that the bgwriter mustn't write pages synchronously
because it'd be bad for performance, but I think that analysis is
faulty. Performance of what --- the bgwriter? Nonsense, the *point*
of the bgwriter is to do the slow tasks. The only argument that has
any merit is that O_SYNC or immediate fsync will prevent us from having
multiple writes outstanding and thus reduce the efficiency of disk
write scheduling. This is a valid point but there is a limit to how
many writes we need to have in flight to keep things flowing smoothly.

What I'm thinking now is that the bgwriter should issue frequent fsyncs
for its writes --- not immediate, but a lot more often than once per
checkpoint. Perhaps take one recently-written unsynced file to fsync
every time it is about to sleep. You could imagine various rules for
deciding which one to sync; perhaps the one with the most writes issued
against it since last sync. When we have tablespaces it'd make sense to
try to distribute the syncs across tablespaces, on the assumption that
the tablespaces are probably on different drives.

regards, tom lane

#15Shridhar Daithankar
shridhar@frodo.hserus.net
In reply to: Tom Lane (#14)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

On Thursday 05 February 2004 20:24, Tom Lane wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

So Imho the target should be to have not much IO open for the checkpoint,
so the fsync is fast enough, even if serial.

The best we can do is push out dirty pages with write() via the bgwriter
and hope that the kernel will see fit to write them before checkpoint
time arrives. I am not sure if that hope has basis in fact or if it's
just wishful thinking. Most likely, if it does have basis in fact it's
because there is a standard syncer daemon forcing a sync() every thirty
seconds.

There are other benefits of writing pages earlier even though they might not
get synced immediately.

It would tell kernel that this is latest copy of updated buffer. Kernel VFS
should make that copy visible to every other backend as well. The buffer
manager will fetch the updated copy from VFS cache next time. All without
going to disk actually..(Within the 30 seconds window of course..)

People keep saying that the bgwriter mustn't write pages synchronously
because it'd be bad for performance, but I think that analysis is
faulty. Performance of what --- the bgwriter? Nonsense, the *point*
of the bgwriter is to do the slow tasks. The only argument that has
any merit is that O_SYNC or immediate fsync will prevent us from having
multiple writes outstanding and thus reduce the efficiency of disk
write scheduling. This is a valid point but there is a limit to how
many writes we need to have in flight to keep things flowing smoothly.

Is it a valid assumption for platforms-that-postgresql-supports that a write
call would make changes visible across processes?

What I'm thinking now is that the bgwriter should issue frequent fsyncs
for its writes --- not immediate, but a lot more often than once per

frequent fsyncs or frequent fsyncs per file descriptor written? I thought it
was later.

Just a thought.

Shridhar

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shridhar Daithankar (#15)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Shridhar Daithankar <shridhar@frodo.hserus.net> writes:

There are other benefits of writing pages earlier even though they might not
get synced immediately.

Such as?

It would tell kernel that this is latest copy of updated buffer. Kernel VFS
should make that copy visible to every other backend as well. The buffer
manager will fetch the updated copy from VFS cache next time. All without
going to disk actually..(Within the 30 seconds window of course..)

This seems quite irrelevant given the way we handle shared buffers.

frequent fsyncs or frequent fsyncs per file descriptor written? I thought it
was later.

You can only fsync one FD at a time (too bad ... if there were a
multi-file-fsync API it'd solve the overspecified-write-ordering issue).

regards, tom lane

#17Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#16)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

People keep saying that the bgwriter mustn't write pages synchronously
because it'd be bad for performance, but I think that analysis is
faulty. Performance of what --- the bgwriter? Nonsense, the *point*

Imho that depends on the workload. For a normal OLTP workload this is
certainly correct. I do not think it is correct for mass loading,
or an otherwise IO bound db.

of the bgwriter is to do the slow tasks. The only argument that has
any merit is that O_SYNC or immediate fsync will prevent us from having
multiple writes outstanding and thus reduce the efficiency of disk
write scheduling. This is a valid point but there is a limit to how
many writes we need to have in flight to keep things flowing smoothly.

But that is imho the main point. The difference for modern disks
is 1Mb/s for random 8k vs. 20 Mb/s for random 256k.

Don't understand me wrong I think sync writing would achieve maximum performance,
but you have to try to write physically adjacent 256k, and you need a vague
idea which blocks to write in parallel. And since that is not so easy I think
we could leave it to the OS.

And as an aside I think 20-30 minute checkpoint intervals would be sufficient
with a bgwriter.

Andreas

Ps: don't most syncers have 60s intervals, not 30 ?

#18Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#14)
Re: [HACKERS] Sync vs. fsync during checkpoint

Tom Lane wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

So Imho the target should be to have not much IO open for the checkpoint,
so the fsync is fast enough, even if serial.

The best we can do is push out dirty pages with write() via the bgwriter
and hope that the kernel will see fit to write them before checkpoint
time arrives. I am not sure if that hope has basis in fact or if it's
just wishful thinking. Most likely, if it does have basis in fact it's
because there is a standard syncer daemon forcing a sync() every thirty
seconds.

Looking at the response time charts I did for showing how vacuum delay
is doing, it seems at least on Linux there is hope that that is the
case. Those charts have just a regular 5 minute checkpoint with enough
checkpoint segments for that, and no other sync effort done at all.

The system has a hard time to handle a larger scaled test DB, so it is
definitely well saturated with IO. The charts are here:

http://developer.postgresql.org/~wieck/vacuum_cost/

That means that instead of an I/O storm every checkpoint interval,
we get a smaller I/O storm every 30 seconds. Not sure this is a big
improvement. Jan already found out that issuing very frequent sync()s
isn't a win.

In none of those charts I can see any checkpoint caused IO storm any
more. Charts I'm currently doing for 7.4.1 show extremely clear spikes
at checkpoints. If someone is interested in those as well I will put
them up.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#19Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#12)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Tom Lane wrote:

Kevin Brown <kevin@sysexperts.com> writes:

Well, running out of space in the list isn't that much of a problem. If
the backends run out of list space (and the max size of the list could
be a configurable thing, either as a percentage of shared memory or as
an absolute size), then all that happens is that the background writer
might end up fsync()ing some files that have already been fsync()ed.
But that's not that big of a deal -- the fact they've already been
fsync()ed means that there shouldn't be any data in the kernel buffers
left to write to disk, so subsequent fsync()s should return quickly.

Yes, it's a big deal. You're arguing as though the bgwriter is the
thing that needs to be fast, when actually what we care about is the
backends being fast. If the bgwriter isn't doing the vast bulk of the
writing (and especially the fsync waits) then we are wasting our time
having one at all. So we need a scheme that makes it as unlikely as
possible that backends will have to do their own fsyncs. Small
per-backend fsync lists aren't the way to do that.

Ah, okay. Pardon me, I was writing on low sleep at the time.

If we want to make the backends as fast as possible then they should
defer synchronous writes to someplace else. But that someplace else
could easily be a process forked by the backend in question whose sole
purpose is to go through the list of files generated by its parent backend
and fsync() them. The backend can then go about its business and upon
receipt of the SIGCHLD notify anyone that needs to be notified that the
fsync()s have completed. This approach on any reasonable OS will have
minimal overhead because of copy-on-write page handling in the kernel
and the fact that the child process isn't going to exec() or write to
a bunch of memory. The advantage is that each backend can maintain its
own list in per-process memory instead of using shared memory. The
disadvantage is that a given file could have multiple simultaneous (or
close to simultaneous) fsync()s issued against it. As noted previously,
that might not be such a big deal.

You could still build a list in shared memory of the files that backends
are accessing but it would then be a cache of sorts because it would
be fixed in size. As soon as you run out of space in the shared list,
you'll have to expire some entries. An expired entry simply means
that multiple fsync()s might be issued for the file being referred to.
But I suspect that such a list would have far too much contention,
and that it would be more efficient to simply risk issuing multiple
fsync()s against the same file by multiple backend children.

Another advantage to the child-of-backend-fsync()s approach is that it
would cause simultaneous fsync()s to happen, and on more advanced OSes
the OS itself should be able to coalesce the work to be done into a more
efficient pattern of writes to the disk. That won't be possible if
fsync()s are serialized by PG. It's not as good as a syscall that would
allow you to fsync() a bunch of file descriptors simultaneously, but it
might be close.

I have no idea whether or not this approach would work in Windows.

Perhaps a better way to do it would be to store the list of all the
relfilenodes of everything in pg_class, with a flag for each indicating
whether or not an fsync() of the file needs to take place.

You're forgetting that we have a fixed-size workspace to do this in ...
and no way to know at postmaster start how many relations there are in
any of our databases, let alone predict how many there might be later on.

Unfortunately, this is going to apply to most any approach. The number
of blocks being dealt with is not fixed, because even though the cache
itself is fixed in size, the number of block writes it represents (and
thus the number of files involved) is not. The list of files itself is
not fixed in size, either.

However, this *does* suggest another possible approach: you set up a
fixed size list and fsync() the batch when it fills up.

It sounds like we need to define the particular behavior we want first.
We're optimizing for some combination of throughput and responsiveness,
and those aren't necessarily the same thing. I suppose this means that
the solution chosen has to have enough knobs to allow the DBA to pick
where on the throughput/responsiveness curve he wants to be.

--
Kevin Brown kevin@sysexperts.com

#20Kevin Brown
kevin@sysexperts.com
In reply to: Kevin Brown (#19)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

I wrote:

But that someplace else
could easily be a process forked by the backend in question whose sole
purpose is to go through the list of files generated by its parent backend
and fsync() them. The backend can then go about its business and upon
receipt of the SIGCHLD notify anyone that needs to be notified that the
fsync()s have completed.

Duh, what am I thinking? Of course, the right answer is to have the
child notify anyone that needs notification that fsync()s are done. No
need for involvement of the parent (i.e., the backend in question)
unless the architecture of PG requires it somehow.

--
Kevin Brown kevin@sysexperts.com

#21Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Kevin Brown (#20)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Kevin Brown wrote:

I have no idea whether or not this approach would work in Windows.

The win32 API has ReadFileScatter/WriteFileScatter, which was developed
to handle these types of problems. These two functions were added for
the sole purpose of making SQL server run faster. They are always
asynchronous and are very efficient. Perhaps the win32 port could just
deal with the synchronization with an eye for future optimizations down
the line?

Merlin

#22Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Jan Wieck (#18)
Re: [HACKERS] Sync vs. fsync during checkpoint

Jan Wieck wrote:

Tom Lane wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

So Imho the target should be to have not much IO open for the checkpoint,
so the fsync is fast enough, even if serial.

The best we can do is push out dirty pages with write() via the bgwriter
and hope that the kernel will see fit to write them before checkpoint
time arrives. I am not sure if that hope has basis in fact or if it's
just wishful thinking. Most likely, if it does have basis in fact it's
because there is a standard syncer daemon forcing a sync() every thirty
seconds.

Looking at the response time charts I did for showing how vacuum delay
is doing, it seems at least on Linux there is hope that that is the
case. Those charts have just a regular 5 minute checkpoint with enough
checkpoint segments for that, and no other sync effort done at all.

The system has a hard time to handle a larger scaled test DB, so it is
definitely well saturated with IO. The charts are here:

http://developer.postgresql.org/~wieck/vacuum_cost/

That means that instead of an I/O storm every checkpoint interval,
we get a smaller I/O storm every 30 seconds. Not sure this is a big
improvement. Jan already found out that issuing very frequent sync()s
isn't a win.

In none of those charts I can see any checkpoint caused IO storm any
more. Charts I'm currently doing for 7.4.1 show extremely clear spikes
at checkpoints. If someone is interested in those as well I will put
them up.

So, Jan, are you basically saying that the background writer has solved
the checkpoint I/O flood problem, and we just need to deal with changing
sync to multiple fsync's at checkpoint?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#23Kevin Brown
kevin@sysexperts.com
In reply to: Merlin Moncure (#21)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Merlin Moncure wrote:

Kevin Brown wrote:

I have no idea whether or not this approach would work in Windows.

The win32 API has ReadFileScatter/WriteFileScatter, which was developed
to handle these types of problems. These two functions were added for
the sole purpose of making SQL server run faster. They are always
asynchronous and are very efficient. Perhaps the win32 port could just
deal with the synchronization with an eye for future optimizations down
the line?

The problem with the approach I described on win32 is that fast fork()s
are required for it to not significantly impact the running backends.
That is, fork() has to return quickly when called. I don't know whether
the implementation of fork() under win32 would be fast enough for
this purpose. It might be -- I don't have any experience with fork()
on win32 platforms so I can't say.

--
Kevin Brown kevin@sysexperts.com

#24Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#22)
Re: [HACKERS] Sync vs. fsync during checkpoint

Bruce Momjian wrote:

Jan Wieck wrote:

Tom Lane wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

So Imho the target should be to have not much IO open for the checkpoint,
so the fsync is fast enough, even if serial.

The best we can do is push out dirty pages with write() via the bgwriter
and hope that the kernel will see fit to write them before checkpoint
time arrives. I am not sure if that hope has basis in fact or if it's
just wishful thinking. Most likely, if it does have basis in fact it's
because there is a standard syncer daemon forcing a sync() every thirty
seconds.

Looking at the response time charts I did for showing how vacuum delay
is doing, it seems at least on Linux there is hope that that is the
case. Those charts have just a regular 5 minute checkpoint with enough
checkpoint segments for that, and no other sync effort done at all.

The system has a hard time to handle a larger scaled test DB, so it is
definitely well saturated with IO. The charts are here:

http://developer.postgresql.org/~wieck/vacuum_cost/

That means that instead of an I/O storm every checkpoint interval,
we get a smaller I/O storm every 30 seconds. Not sure this is a big
improvement. Jan already found out that issuing very frequent sync()s
isn't a win.

In none of those charts I can see any checkpoint caused IO storm any
more. Charts I'm currently doing for 7.4.1 show extremely clear spikes
at checkpoints. If someone is interested in those as well I will put
them up.

So, Jan, are you basically saying that the background writer has solved
the checkpoint I/O flood problem, and we just need to deal with changing
sync to multiple fsync's at checkpoint?

ISTM that the background writer at least has the ability to lower the
impact of a checkpoint significantly enough that one might not care
about it any more. "Has the ability" means, it needs to be adjusted to
the actual DB usage. The charts I produced where not done with the
default settings, but rather after making the bgwriter a bit more
agressive against dirty pages.

The whole sync() vs. fsync() discussion is in my opinion nonsense at
this point. Without the ability to limit the amount of files to a
reasonable number, by employing tablespaces in the form of larger
container files, the risk of forcing excessive head movement is simply
too high.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#24)
Re: [HACKERS] Sync vs. fsync during checkpoint

Jan Wieck <JanWieck@Yahoo.com> writes:

The whole sync() vs. fsync() discussion is in my opinion nonsense at
this point.

The sync vs fsync discussion is not about performance, it is about
correctness. You can't simply dismiss the fact that we don't know
whether a checkpoint is really complete when we write the checkpoint
record.

I liked the idea put forward by (I think) Kevin Brown, that we issue
sync to start the I/O and then a bunch of fsyncs to wait for it to
finish. If sync behaves per spec ("all the I/O is scheduled upon
return") then the fsyncs will not affect I/O ordering in the least.
But they will ensure that we don't proceed until the I/O is all done.

Also there is the Windows-port problem of not having sync available.
Doing the fsyncs only will provide an adequate, if possibly
lower-performing, solution there.

regards, tom lane

#26Greg Stark
gsstark@mit.edu
In reply to: Jan Wieck (#24)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Jan Wieck <JanWieck@Yahoo.com> writes:

The whole sync() vs. fsync() discussion is in my opinion nonsense at this
point. Without the ability to limit the amount of files to a reasonable number,
by employing tablespaces in the form of larger container files, the risk of
forcing excessive head movement is simply too high.

I don't think there was any suggestion of conflating tablespaces with
implementing a filesystem in postgres.

Tablespaces are just a database entity that database stored objects like
tables and indexes are associated to. They group database stored objects and
control the storage method and location.

The existing storage mechanism, namely a directory with a file for each
database object, is perfectly adequate and doesn't have to be replaced to
implement tablespaces. All that's needed is that the location of the directory
be associated with the "tablespace" of the object rather than be a global
constant.

Implementing an Oracle-style filesystem is just one more temptation to
reimplement OS services in the database. Personally I think it's an awful
idea. But even if postgres did it as an option, it wouldn't necessarily have
anything to do with tablespaces.

--
greg

#27Jan Wieck
JanWieck@Yahoo.com
In reply to: Greg Stark (#26)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Greg Stark wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

The whole sync() vs. fsync() discussion is in my opinion nonsense at this
point. Without the ability to limit the amount of files to a reasonable number,
by employing tablespaces in the form of larger container files, the risk of
forcing excessive head movement is simply too high.

I don't think there was any suggestion of conflating tablespaces with
implementing a filesystem in postgres.

Tablespaces are just a database entity that database stored objects like
tables and indexes are associated to. They group database stored objects and
control the storage method and location.

The existing storage mechanism, namely a directory with a file for each
database object, is perfectly adequate and doesn't have to be replaced to
implement tablespaces. All that's needed is that the location of the directory
be associated with the "tablespace" of the object rather than be a global
constant.

Implementing an Oracle-style filesystem is just one more temptation to
reimplement OS services in the database. Personally I think it's an awful
idea. But even if postgres did it as an option, it wouldn't necessarily have
anything to do with tablespaces.

Doing this is not just what you call it. In a system with let's say 500
active backends on a database with let's say 1000 things that are
represented as a file, you'll need half a million virtual file descriptors.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#27)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Jan Wieck <JanWieck@Yahoo.com> writes:

Doing this is not just what you call it. In a system with let's say 500
active backends on a database with let's say 1000 things that are
represented as a file, you'll need half a million virtual file descriptors.

[shrug] We've been dealing with virtual file descriptors for years.
I've seen no indication that they create any performance bottlenecks.

regards, tom lane

#29Florian Weimer
fw@deneb.enyo.de
In reply to: Tom Lane (#16)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Tom Lane wrote:

You can only fsync one FD at a time (too bad ... if there were a
multi-file-fsync API it'd solve the overspecified-write-ordering issue).

What about aio_fsync()?

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Weimer (#29)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Florian Weimer <fw@deneb.enyo.de> writes:

Tom Lane wrote:

You can only fsync one FD at a time (too bad ... if there were a
multi-file-fsync API it'd solve the overspecified-write-ordering issue).

What about aio_fsync()?

(1) it's unportable; (2) it's not clear that it's any improvement over
fsync(). The Single Unix Spec says aio_fsync "returns when the
synchronisation request has been initiated or queued to the file or
device". Depending on how the implementation works, this may mean that
all the dirty blocks have been scheduled for I/O and will be written
ahead of subsequently scheduled blocks --- if so, the results are not
really different from fsync()'ing the files in the same order.

The best idea I've heard so far is the one about sync() followed by
a bunch of fsync()s. That seems to be correct, efficient, and dependent
only on very-long-established Unix semantics.

regards, tom lane

#31Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#30)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

Tom Lane wrote:

The best idea I've heard so far is the one about sync() followed by
a bunch of fsync()s. That seems to be correct, efficient, and dependent
only on very-long-established Unix semantics.

Agreed.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#32Sailesh Krishnamurthy
sailesh@cs.berkeley.edu
In reply to: Kevin Brown (#11)
Re: [pgsql-hackers-win32] Sync vs. fsync during

"Kevin" == Kevin Brown <kevin@sysexperts.com> writes:

The bigger problem though with this is that it makes the
problem of list overflow much worse. The hard part about
shared memory management is not so much that the available
space is small, as that the available space is fixed --- we
can't easily change it after postmaster start. The more finely

Again, I can suggest the shared memory MemoryContext we use in
TelegraphCQ that is based on the OSSP libmm memory manager. We use it
to grow and shrink shared memory at will.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#33Noname
pgsql@mohawksoft.com
In reply to: Jan Wieck (#27)
Re: [pgsql-hackers-win32] Sync vs. fsync during

Greg Stark wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

The whole sync() vs. fsync() discussion is in my opinion nonsense at
this
point. Without the ability to limit the amount of files to a reasonable
number,
by employing tablespaces in the form of larger container files, the
risk of
forcing excessive head movement is simply too high.

I don't think there was any suggestion of conflating tablespaces with
implementing a filesystem in postgres.

Tablespaces are just a database entity that database stored objects like
tables and indexes are associated to. They group database stored objects
and
control the storage method and location.

The existing storage mechanism, namely a directory with a file for each
database object, is perfectly adequate and doesn't have to be replaced
to
implement tablespaces. All that's needed is that the location of the
directory
be associated with the "tablespace" of the object rather than be a
global
constant.

Implementing an Oracle-style filesystem is just one more temptation to
reimplement OS services in the database. Personally I think it's an
awful
idea. But even if postgres did it as an option, it wouldn't necessarily
have
anything to do with tablespaces.

Doing this is not just what you call it. In a system with let's say 500
active backends on a database with let's say 1000 things that are
represented as a file, you'll need half a million virtual file
descriptors.

I'm sort of a purist, I think that operating systems should be operating
systems and applications should be applications. Whenever you try to do
application like things in an OS, it is a mistake. Whenever you try to do
OS like things in an application, it - also, is a mistake.

Say a database has close to a active thousand files and you do have 100
concurrent user's. Why do you think that this could be handled better in
an application? Are you saying that PostgreSQL could do a better job at
managing 1/2 million shared file descriptors than the OS?

Your example, IMHO, points out why you *shouldn't* try to have a dedicated
file system.