Potential Large Performance Gain in WAL synching
I've been looking at the TODO lists and caching issues and think there may
be a way to greatly improve the performance of the WAL.
I've made the following assumptions based on my reading in the manual and
the WAL archives since about November 2000:
1) WAL is currently fsync'd before commit succeeds. This is done to ensure
that the D in ACID is satisfied.
2) The wait on fsync is the biggest time cost for inserts or updates.
3) fsync itself probably increases contention for file i/o on the same file
since some OS file system cache structures must be locked as part of fsync.
Depending on the file system this could be a significant choke on total i/o
throughput.
The issue is that there must be a definite record in durable storage for the
log before one can be certain that a transaction has succeeded.
I'm not familiar with the exact WAL implementation in PostgreSQL but am
familiar with others including ARIES II, however, it seems that it comes
down to making sure that the write to the WAL log has been positively
written to disk.
So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
and then use aio_write for all log writes? A transaction would simple do all
the log writing using aio_write and block until all the last log aio request
has completed using aio_waitcomplete. The call to aio_waitcomplete won't
return until the log record has been written to the disk. Opening with
O_DSYNC ensures that when i/o completes the write has been written to the
disk, and aio_write with O_APPEND opened files ensures that writes append in
the order they are received, hence when the aio_write for the last log entry
for a transaction completes, the transaction can be sure that its log
records are in durable storage (IDE problems aside).
It seems to me that this would:
1) Preserve the required D semantics.
2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.
3) Obviate the need for commit_delay, since there is no blocking and the
file system and the disk controller can put multiple writes to the log
together as the drive is waiting for the end of the log file to come under
one of the heads.
Here are the relevant TODO's:
Delay fsync() when other backends are about to commit too [fsync]
Determine optimal commit_delay value
Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options
Allow multiple blocks to be written to WAL with one write()
Am I missing something?
Curtis Faith
Principal
Galt Capital, LLP
------------------------------------------------------------------
Galt Capital http://www.galtcapital.com
12 Wimmelskafts Gade
Post Office Box 7549 voice: 340.776.0144
Charlotte Amalie, St. Thomas fax: 340.776.0244
United States Virgin Islands 00801 cell: 340.643.5368
"Curtis Faith" <curtis@galtair.com> writes:
So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
and then use aio_write for all log writes?
We already offer an O_DSYNC option. It's not obvious to me what
aio_write brings to the table (aside from loss of portability).
You still have to wait for the final write to complete, no?
2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.
I'm missing something. There is no useful work that a transaction can
do between writing its commit record and reporting completion, is there?
It has to wait for that record to hit disk.
regards, tom lane
tom lane replies:
"Curtis Faith" <curtis@galtair.com> writes:
So, why don't we use files opened with O_DSYNC | O_APPEND for
the WAL log
and then use aio_write for all log writes?
We already offer an O_DSYNC option. It's not obvious to me what
aio_write brings to the table (aside from loss of portability).
You still have to wait for the final write to complete, no?
Well, for starters by the time the write which includes the commit
log entry is written, much of the writing of the log for the
transaction will already be on disk, or in a controller on its
way.
I don't see any O_NONBLOCK or O_NDELAY references in the sources
so it looks like the log writes are blocking. If I read correctly,
XLogInsert calls XLogWrite which calls write which blocks. If these
assumptions are correct, there should be some significant gain here but I
won't know how much until I try to change it. This issue only affects the
speed of a given back-ends transaction processing capability.
The REAL issue and the one that will greatly affect total system
throughput is that of contention on the file locks. Since fsynch needs to
obtain a write lock on the file descriptor, as does the write calls which
originate from XLogWrite as the writes are written to the disk, other
back-ends will block while another transaction is committing if the
log cache fills to the point where their XLogInsert results in a
XLogWrite call to flush the log cache. I'd guess this means that one
won't gain much by adding other back-end processes past three or four
if there are a lot of inserts or updates.
The method I propose does not result in any blocking because of writes
other than the final commit's write and it has the very significant
advantage of allowing other transactions (from other back-ends) to
continue until they enter commit (and blocking waiting for their final
commit write to complete).
2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.I'm missing something. There is no useful work that a transaction can
do between writing its commit record and reporting completion, is there?
It has to wait for that record to hit disk.
The key here is that a thread that has not committed and therefore is
not blocking can do work while "other threads" (should have said back-ends
or processes) are waiting on their commit writes.
- Curtis
P.S. If I am right in my assumptions about the way the current system
works, I'll bet the change would speed up inserts in Shridhar's huge
database test by at least a factor of two or three, perhaps even an
order of magnitude. :-)
Show quoted text
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, October 03, 2002 7:17 PM
To: Curtis Faith
Cc: Pgsql-Hackers
Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching"Curtis Faith" <curtis@galtair.com> writes:
So, why don't we use files opened with O_DSYNC | O_APPEND for
the WAL log
and then use aio_write for all log writes?
We already offer an O_DSYNC option. It's not obvious to me what
aio_write brings to the table (aside from loss of portability).
You still have to wait for the final write to complete, no?2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.I'm missing something. There is no useful work that a transaction can
do between writing its commit record and reporting completion, is there?
It has to wait for that record to hit disk.regards, tom lane
Curtis Faith wrote:
The method I propose does not result in any blocking because of writes
other than the final commit's write and it has the very significant
advantage of allowing other transactions (from other back-ends) to
continue until they enter commit (and blocking waiting for their final
commit write to complete).2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.I'm missing something. There is no useful work that a transaction can
do between writing its commit record and reporting completion, is there?
It has to wait for that record to hit disk.The key here is that a thread that has not committed and therefore is
not blocking can do work while "other threads" (should have said back-ends
or processes) are waiting on their commit writes.
I may be missing something here, but other backends don't block while
one writes to WAL. Remember, we are proccess based, not thread based,
so the write() call only blocks the one session. If you had threads,
and you did a write() call that blocked other threads, I can see where
your idea would be good, and where async i/o becomes an advantage.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
"Curtis Faith" <curtis@galtair.com> writes:
The REAL issue and the one that will greatly affect total system
throughput is that of contention on the file locks. Since fsynch needs to
obtain a write lock on the file descriptor, as does the write calls which
originate from XLogWrite as the writes are written to the disk, other
back-ends will block while another transaction is committing if the
log cache fills to the point where their XLogInsert results in a
XLogWrite call to flush the log cache.
But that's exactly *why* we have a log cache: to ensure we can buffer a
reasonable amount of log data between XLogFlush calls. If the above
scenario is really causing a problem, doesn't that just mean you need
to increase wal_buffers?
regards, tom lane
Bruce Momjian wrote:
I may be missing something here, but other backends don't block while
one writes to WAL.
I don't think they'll block until they get to the fsync or XLogWrite
call while another transaction is fsync'ing.
I'm no Unix filesystem expert but I don't see how the OS can
handle multiple writes and fsyncs to the same file descriptors without
blocking other processes from writing at the same time. It may be that
there are some clever data structures they use but I've not seen huge
praise for most of the file systems. A well written file system could
minimize this contention but I'll bet it's there with most of the ones
that PostgreSQL most commonly runs on.
I'll have to write a test and see if there really is a problem.
- Curtis
Show quoted text
-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Friday, October 04, 2002 12:44 AM
To: Curtis Faith
Cc: Tom Lane; Pgsql-Hackers
Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synchingCurtis Faith wrote:
The method I propose does not result in any blocking because of writes
other than the final commit's write and it has the very significant
advantage of allowing other transactions (from other back-ends) to
continue until they enter commit (and blocking waiting for their final
commit write to complete).2) Allow transactions to complete and do work while other
threads are
waiting on the completion of the log write.
I'm missing something. There is no useful work that a transaction can
do between writing its commit record and reportingcompletion, is there?
It has to wait for that record to hit disk.
The key here is that a thread that has not committed and therefore is
not blocking can do work while "other threads" (should havesaid back-ends
or processes) are waiting on their commit writes.
I may be missing something here, but other backends don't block while
one writes to WAL. Remember, we are proccess based, not thread based,
so the write() call only blocks the one session. If you had threads,
and you did a write() call that blocked other threads, I can see where
your idea would be good, and where async i/o becomes an advantage.-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
I wrote:
The REAL issue and the one that will greatly affect total system
throughput is that of contention on the file locks. Sincefsynch needs to
obtain a write lock on the file descriptor, as does the write
calls which
originate from XLogWrite as the writes are written to the disk, other
back-ends will block while another transaction is committing if the
log cache fills to the point where their XLogInsert results in a
XLogWrite call to flush the log cache.
tom lane wrote:
But that's exactly *why* we have a log cache: to ensure we can buffer a
reasonable amount of log data between XLogFlush calls. If the above
scenario is really causing a problem, doesn't that just mean you need
to increase wal_buffers?
Well, in cases where there are a lot of small transactions the contention
will not be on the XLogWrite calls from caches getting full but from
XLogWrite calls from transaction commits which will happen very frequently.
I think this will have a detrimental effect on very high update frequency
performance.
So while larger WAL caches will help in the case of cache flushing because
of its being full I don't think it will make any difference for the
potentially
more common case of transaction commits.
- Curtis
Curtis Faith wrote:
Bruce Momjian wrote:
I may be missing something here, but other backends don't block while
one writes to WAL.I don't think they'll block until they get to the fsync or XLogWrite
call while another transaction is fsync'ing.I'm no Unix filesystem expert but I don't see how the OS can
handle multiple writes and fsyncs to the same file descriptors without
blocking other processes from writing at the same time. It may be that
there are some clever data structures they use but I've not seen huge
praise for most of the file systems. A well written file system could
minimize this contention but I'll bet it's there with most of the ones
that PostgreSQL most commonly runs on.I'll have to write a test and see if there really is a problem.
Yes, I can see some contention, but what does aio solve?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
I wrote:
I'm no Unix filesystem expert but I don't see how the OS can
handle multiple writes and fsyncs to the same file descriptors without
blocking other processes from writing at the same time. It may be that
there are some clever data structures they use but I've not seen huge
praise for most of the file systems. A well written file system could
minimize this contention but I'll bet it's there with most of the ones
that PostgreSQL most commonly runs on.I'll have to write a test and see if there really is a problem.
Bruce Momjian wrote:
Yes, I can see some contention, but what does aio solve?
Well, theoretically, aio lets the file system handle the writes without
requiring any locks being held by the processes issuing those reads.
The disk i/o scheduler can therefore issue the writes using spinlocks or
something very fast since it controls the timing of each of the actual
writes. In some systems this is handled by the kernal and can be very
fast.
I suspect that with large RAID controllers or intelligent disk systems
like EMC this is even more important because they should be able to
handle a much higher level of concurrent i/o.
Now whether or not the common file systems handle this well, I can't say,
Take a look at some comments on how Oracle uses asynchronous I/O
http://www.ixora.com.au/notes/redo_write_multiplexing.htm
http://www.ixora.com.au/notes/asynchronous_io.htm
http://www.ixora.com.au/notes/raw_asynchronous_io.htm
It seems that OS support for this will likely increase and that this
issue will become more and more important as uses contemplate SMP systems
or if threading is added to certain PostgreSQL subsystems.
It might be easier for me to implement the change I propose and then
see what kind of difference it makes.
I wanted to run the idea past this group first. We can all postulate
whether or not it will work but we won't know unless we try it. My real
issue is one of what happens in the event that it does work.
I've had very good luck implementing this sort of thing for other systems
but I don't yet know the range of i/o requests that PostgreSQL makes.
Assuming we can demonstrate no detrimental effects on system reliability
and that the change is implemented in such a way that it can be turned
on or off easily, will a 50% or better increase in speed for updates
justify the sort or change I am proposing. 20%? 10%?
- Curtis
Curtis Faith wrote:
Yes, I can see some contention, but what does aio solve?
Well, theoretically, aio lets the file system handle the writes without
requiring any locks being held by the processes issuing those reads.
The disk i/o scheduler can therefore issue the writes using spinlocks or
something very fast since it controls the timing of each of the actual
writes. In some systems this is handled by the kernal and can be very
fast.
I am again confused. When we do write(), we don't have to lock
anything, do we? (Multiple processes can write() to the same file just
fine.) We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed. Does aio more easily allow the
kernel to order those write? Is that the issue? Well, certainly the
kernel already order the writes. Just because we write() doesn't mean
it goes to disk. Only fsync() or the kernel do that.
I suspect that with large RAID controllers or intelligent disk systems
like EMC this is even more important because they should be able to
handle a much higher level of concurrent i/o.Now whether or not the common file systems handle this well, I can't say,
Take a look at some comments on how Oracle uses asynchronous I/O
http://www.ixora.com.au/notes/redo_write_multiplexing.htm
http://www.ixora.com.au/notes/asynchronous_io.htm
http://www.ixora.com.au/notes/raw_asynchronous_io.htm
Yes, but Oracle is threaded, right, so, yes, they clearly could win with
it. I read the second URL and it said we could issue separate writes
and have them be done in an optimal order. However, we use the file
system, not raw devices, so don't we already have that in the kernel
with fsync()?
It seems that OS support for this will likely increase and that this
issue will become more and more important as uses contemplate SMP systems
or if threading is added to certain PostgreSQL subsystems.
Probably. Having seen the Informix 5/7 debacle, I don't want to fall
into the trap where we add stuff that just makes things faster on
SMP/threaded systems when it makes our code _slower_ on single CPU
systems, which is exaclty what Informix did in Informix 7, and we know
how that ended (lost customers, bought by IBM). I don't think that's
going to happen to us, but I thought I would mention it.
Assuming we can demonstrate no detrimental effects on system reliability
and that the change is implemented in such a way that it can be turned
on or off easily, will a 50% or better increase in speed for updates
justify the sort or change I am proposing. 20%? 10%?
Yea, let's see what boost we get, and the size of the patch, and we can
review it. It is certainly worth researching.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote:
I am again confused. When we do write(), we don't have to lock
anything, do we? (Multiple processes can write() to the same file just
fine.) We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed. Does aio more easily allow the
kernel to order those write? Is that the issue? Well, certainly the
kernel already order the writes. Just because we write() doesn't mean
it goes to disk. Only fsync() or the kernel do that.
"We" don't have to lock anything, but most file systems can't process
fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks. The fsync call is more
contentious than typical writes because its duration is usually
longer so it holds the locks longer over more pages and structures.
That is the real issue. The contention caused by fsync'ing very frequently
which blocks other writers and readers.
For the buffer manager, the blocking of readers is probably even more
problematic when the cache is a small percentage (say < 10% to 15%) of
the total database size because most leaf node accesses will result in
a read. Each of these reads will have to wait on the fsync as well. Again,
a very well written file system probably can minimize this but I've not
seen any.
Further comment on:
<We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed.
Writing out a bunch of calls at the end, after having consumed a lot
of CPU cycles and then waiting is not as efficient as writing them out,
while those CPU cycles are being used. We are currently waisting the
time it takes for a given process to write.
The thinking probably has been that this is no big deal because other
processes, say B, C and D can use the CPU cycles while process A blocks.
This is true UNLESS the other processes are blocking on reads or
writes caused by process A doing the final writes and fsync.
Yes, but Oracle is threaded, right, so, yes, they clearly could win with
it. I read the second URL and it said we could issue separate writes
and have them be done in an optimal order. However, we use the file
system, not raw devices, so don't we already have that in the kernel
with fsync()?
Whether by threads or multiple processes, there is the same contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.
Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.
Probably. Having seen the Informix 5/7 debacle, I don't want to fall
into the trap where we add stuff that just makes things faster on
SMP/threaded systems when it makes our code _slower_ on single CPU
systems, which is exaclty what Informix did in Informix 7, and we know
how that ended (lost customers, bought by IBM). I don't think that's
going to happen to us, but I thought I would mention it.
Yes, I hate "improvements" that make things worse for most people. Any
changes I'd contemplate would be simply another configuration driven
optimization that could be turned off very easily.
- Curtis
"Curtis Faith" <curtis@galtair.com> writes:
... most file systems can't process fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks.
Oh? That would be a serious problem, but I've never heard that asserted
before. Please provide some evidence.
On a filesystem that does have that kind of problem, can't you avoid it
just by using O_DSYNC on the WAL files? Then there's no need to call
fsync() at all, except during checkpoints (which actually issue sync()
not fsync(), anyway).
Whether by threads or multiple processes, there is the same contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.
AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
on safety grounds (we want to be sure we have a consistent WAL up to the
end of what we've written). Even if we can allow some reordering when a
single transaction puts out a large volume of WAL data, I fail to see
where any large gain is going to come from. We're going to be issuing
those writes sequentially and that ought to match the disk layout about
as well as can be hoped anyway.
Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.
We do not fsync buffer pages; in fact a transaction commit doesn't write
buffer pages at all. I think the above is just a misunderstanding of
what's really happening. We have synchronous WAL writing, agreed, but
we want that AFAICS. Data block writes are asynchronous (between
checkpoints, anyway).
There is one thing in the current WAL code that I don't like: if the WAL
buffers fill up then everybody who would like to make WAL entries is
forced to wait while some space is freed, which means a write, which is
synchronous if you are using O_DSYNC. It would be nice to have a
background process whose only task is to issue write()s as soon as WAL
pages are filled, thus reducing the probability that foreground
processes have to wait for WAL writes (when they're not committing that
is). But this could be done portably with one more postmaster child
process; I see no real need to dabble in aio_write.
regards, tom lane
... most file systems can't process fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks.Oh? That would be a serious problem, but I've never heard that asserted
before. Please provide some evidence.On a filesystem that does have that kind of problem, can't you avoid it
just by using O_DSYNC on the WAL files?
To make this competitive, the WAL writes would need to be improved to
do more than one block (up to 256k or 512k per write) with one write call
(if that much is to be written for this tx to be able to commit).
This should actually not be too difficult since the WAL buffer is already
contiguous memory.
If that is done, then I bet O_DSYNC will beat any other config we currently
have.
With this, a separate disk for WAL and large transactions you shoud be able
to see your disks hit the max IO figures they are capable of :-)
Andreas
Import Notes
Resolved by subject fallback
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
To make this competitive, the WAL writes would need to be improved to
do more than one block (up to 256k or 512k per write) with one write call
(if that much is to be written for this tx to be able to commit).
This should actually not be too difficult since the WAL buffer is already
contiguous memory.
Hmmm ... if you were willing to dedicate a half meg or meg of shared
memory for WAL buffers, that's doable. I was originally thinking of
having the (still hypothetical) background process wake up every time a
WAL page was completed and available to write. But it could be set up
so that there is some "slop", and it only wakes up when the number of
writable pages exceeds N, for some N that's still well less than the
number of buffers. Then it could write up to N sequential pages in a
single write().
However, this would only be a win if you had few and large transactions.
Any COMMIT will force a write of whatever we have so far, so the idea of
writing hundreds of K per WAL write can only work if it's hundreds of K
between commit records. Is that a common scenario? I doubt it.
If you try to set it up that way, then it's more likely that what will
happen is the background process seldom awakens at all, and each
committer effectively becomes responsible for writing all the WAL
traffic since the last commit. Wouldn't that lose compared to someone
else having written the previous WAL pages in background?
We could certainly build the code to support this, though, and then
experiment with different values of N. If it turns out N==1 is best
after all, I don't think we'd have wasted much code.
regards, tom lane
I wrote:
... most file systems can't process fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks.
tom lane replies:
Oh? That would be a serious problem, but I've never heard that asserted
before. Please provide some evidence.
Well I'm basing this on past empirical testing and having read some man
pages that describe fsync under this exact scenario. I'll have to write
a test to prove this one way or another. I'll also try and look into
the linux/BSD source for the common file systems used for PostgreSQL.
On a filesystem that does have that kind of problem, can't you avoid it
just by using O_DSYNC on the WAL files? Then there's no need to call
fsync() at all, except during checkpoints (which actually issue sync()
not fsync(), anyway).
No, they're not exactly the same thing. Consider:
Process A File System
--------- -----------
Writes index buffer .idling...
Writes entry to log cache .
Writes another index buffer .
Writes another log entry .
Writes tuple buffer .
Writes another log entry .
Index scan .
Large table sort .
Writes tuple buffer .
Writes another log entry .
Writes .
Writes another index buffer .
Writes another log entry .
Writes another index buffer .
Writes another log entry .
Index scan .
Large table sort .
Commit .
File Write Log Entry .
.idling... Write to cache
File Write Log Entry .idling...
.idling... Write to cache
File Write Log Entry .idling...
.idling... Write to cache
File Write Log Entry .idling...
.idling... Write to cache
Write Commit Log Entry .idling...
.idling... Write to cache
Call fsync .idling...
.idling... Write all buffers to device.
.DONE.
In this case, Process A is waiting for all the buffers to write
at the end of the transaction.
With asynchronous I/O this becomes:
Process A File System
--------- -----------
Writes index buffer .idling...
Writes entry to log cache Queue up write - move head to cylinder
Writes another index buffer Write log entry to media
Writes another log entry Immediate write to cylinder since head is
still there.
Writes tuple buffer .
Writes another log entry Queue up write - move head to cylinder
Index scan .busy with scan...
Large table sort Write log entry to media
Writes tuple buffer .
Writes another log entry Queue up write - move head to cylinder
Writes .
Writes another index buffer Write log entry to media
Writes another log entry Queue up write - move head to cylinder
Writes another index buffer .
Writes another log entry Write log entry to media
Index scan .
Large table sort Write log entry to media
Commit .
Write Commit Log Entry Immediate write to cylinder since head is
still there.
.DONE.
Effectively the real work of writing the cache is done while the CPU
for the process is busy doing index scans, sorts, etc. With the WAL
log on another device and SCSI I/O the log writing should almost always be
done except for the final commit write.
Whether by threads or multiple processes, there is the same
contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
on safety grounds (we want to be sure we have a consistent WAL up to the
end of what we've written). Even if we can allow some reordering when a
single transaction puts out a large volume of WAL data, I fail to see
where any large gain is going to come from. We're going to be issuing
those writes sequentially and that ought to match the disk layout about
as well as can be hoped anyway.
My comment was applying to reads and writes of other processes not the
WAL log. In my original email, recall I mentioned using the O_APPEND
open flag which will ensure that all log entries are done sequentially.
Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.We do not fsync buffer pages; in fact a transaction commit doesn't write
buffer pages at all. I think the above is just a misunderstanding of
what's really happening. We have synchronous WAL writing, agreed, but
we want that AFAICS. Data block writes are asynchronous (between
checkpoints, anyway).
Hmm, I keep hearing that buffer block writes are asynchronous but I don't
read that in the code at all. There are simple "write" calls with files
that are not opened with O_NOBLOCK, so they'll be done synchronously. The
code for this is relatively straighforward (once you get past the
storage manager abstraction) so I don't see what I might be missing.
It's true that data blocks are not required to be written before the
transaction commits, so they are in some sense asynchronous to the
transactions. However, they still later on block the process that
is requesting a new block when it happens to be dirty forcing a write
of the block in the cache.
It looks to me like BufferAlloc will simply result in a call to
BufferReplace > smgrblindwrt > write for md storage manager objects.
This means that a process will block while the write of dirty cache
buffers takes place.
I'm happy to be wrong on this but I don't see any hard evidence
of asynch file calls anywhere in the code. Unless I am missing something
this is a huuuuge problem.
There is one thing in the current WAL code that I don't like: if the WAL
buffers fill up then everybody who would like to make WAL entries is
forced to wait while some space is freed, which means a write, which is
synchronous if you are using O_DSYNC. It would be nice to have a
background process whose only task is to issue write()s as soon as WAL
pages are filled, thus reducing the probability that foreground
processes have to wait for WAL writes (when they're not committing that
is). But this could be done portably with one more postmaster child
process; I see no real need to dabble in aio_write.
Hmm, well, another process writing the log would accomplish the same thing
but isn't that what a file system is? ISTM that aio_write is quite a bit
easier and higher performance? This is especially true for those OS's which
have KAIO support.
- Curtis
"Curtis Faith" <curtis@galtair.com> writes:
It looks to me like BufferAlloc will simply result in a call to
BufferReplace > smgrblindwrt > write for md storage manager objects.This means that a process will block while the write of dirty cache
buffers takes place.
I think Tom was suggesting that when a buffer is written out, the
write() call only pushes the data down into the filesystem's buffer --
which is free to then write the actual blocks to disk whenever it
chooses to. In other words, the write() returns, the backend process
can continue with what it was doing, and at some later time the blocks
that we flushed from the Postgres buffer will actually be written to
disk. So in some sense of the word, that I/O is asynchronous.
Cheers,
Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?
Here's the evidence:
Code from: /usr/src/sys/syscalls/vfs_syscalls
int
fsync(p, uap)
struct proc *p;
struct fsync_args /* {
syscallarg(int) fd;
} */ *uap;
{
register struct vnode *vp;
struct file *fp;
vm_object_t obj;
int error;
if ((error = getvnode(p->p_fd, SCARG(uap, fd), &fp)) != 0)
return (error);
vp = (struct vnode *)fp->f_data;
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
if (VOP_GETVOBJECT(vp, &obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);
if ((error = VOP_FSYNC(vp, fp->f_cred, MNT_WAIT, p)) == 0 &&
vp->v_mount && (vp->v_mount->mnt_flag & MNT_SOFTDEP) &&
bioops.io_fsync)
error = (*bioops.io_fsync)(vp);
VOP_UNLOCK(vp, 0, p);
return (error);
}
Notice the calls to:
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
..
VOP_UNLOCK(vp, 0, p);
surrounding the call to VOP_FSYNC.
From the man pages for VOP_UNLOCK:
HEADER STUFF .....
VOP_LOCK(struct vnode *vp, int flags, struct proc *p);
int
VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);
int
VOP_ISLOCKED(struct vnode *vp, struct proc *p);
int
vn_lock(struct vnode *vp, int flags, struct proc *p);
DESCRIPTION
These calls are used to serialize access to the filesystem, such as to
prevent two writes to the same file from happening at the same time.
The arguments are:
vp the vnode being locked or unlocked
flags One of the lock request types:
LK_SHARED Shared lock
LK_EXCLUSIVE Exclusive lock
LK_UPGRADE Shared-to-exclusive upgrade
LK_EXCLUPGRADE First shared-to-exclusive upgrade
LK_DOWNGRADE Exclusive-to-shared downgrade
LK_RELEASE Release any type of lock
LK_DRAIN Wait for all lock activity to end
The lock type may be or'ed with these lock flags:
LK_NOWAIT Do not sleep to wait for lock
LK_SLEEPFAIL Sleep, then return failure
LK_CANRECURSE Allow recursive exclusive lock
LK_REENABLE Lock is to be reenabled after drain
LK_NOPAUSE No spinloop
The lock type may be or'ed with these control flags:
LK_INTERLOCK Specify when the caller already has a simple
lock (VOP_LOCK will unlock the simple lock
after getting the lock)
LK_RETRY Retry until locked
LK_NOOBJ Don't create object
p process context to use for the locks
Kernel code should use vn_lock() to lock a vnode rather than calling
VOP_LOCK() directly.
Hmmm ... if you were willing to dedicate a half meg or meg of shared
memory for WAL buffers, that's doable.
Yup, configuring Informix to three 2 Mb buffers (LOGBUF 2048) here.
However, this would only be a win if you had few and large transactions.
Any COMMIT will force a write of whatever we have so far, so the idea of
writing hundreds of K per WAL write can only work if it's hundreds of K
between commit records. Is that a common scenario? I doubt it.
It should help most for data loading, or mass updating, yes.
Andreas
Import Notes
Resolved by subject fallback
On Fri, 2002-10-04 at 18:03, Neil Conway wrote:
"Curtis Faith" <curtis@galtair.com> writes:
It looks to me like BufferAlloc will simply result in a call to
BufferReplace > smgrblindwrt > write for md storage manager objects.This means that a process will block while the write of dirty cache
buffers takes place.I think Tom was suggesting that when a buffer is written out, the
write() call only pushes the data down into the filesystem's buffer --
which is free to then write the actual blocks to disk whenever it
chooses to. In other words, the write() returns, the backend process
can continue with what it was doing, and at some later time the blocks
that we flushed from the Postgres buffer will actually be written to
disk. So in some sense of the word, that I/O is asynchronous.
Isn't that true only as long as there is buffer space available? When
there isn't buffer space available, seems the window for blocking comes
into play?? So I guess you could say it is optimally asynchronous and
worse case synchronous. I think the worse case situation is one which
he's trying to address.
At least that's how I interpret it.
Greg
Curtis Faith writes:
I'm no Unix filesystem expert but I don't see how the OS can handle
multiple writes and fsyncs to the same file descriptors without
blocking other processes from writing at the same time.
Why not? Other than the necessary synchronisation for attributes such
as file size and modification times, multiple processes can readily
write to different areas of the same file at the "same" time.
fsync() may not return until after the buffers it schedules are
written, but it doesn't have to block subsequent writes to different
buffers in the file either. (Note too Tom Lane's responses about
when fsync() is used and not used.)
I'll have to write a test and see if there really is a problem.
Please do. I expect you'll find things aren't as bad as you fear.
In another posting, you write:
Hmm, I keep hearing that buffer block writes are asynchronous but I don't
read that in the code at all. There are simple "write" calls with files
that are not opened with O_NOBLOCK, so they'll be done synchronously. The
code for this is relatively straighforward (once you get past the
storage manager abstraction) so I don't see what I might be missing.
There is a confusion of terminology here: the write() is synchronous
from the point of the application only in that the data is copied into
kernel buffers (or pages remapped, or whatever) before the system call
returns. For files opened with O_DSYNC the write() would wait for the
data to be written to disk. Thus O_DSYNC is "synchronous" I/O, but
there is no equivalently easy name for the regular "flush to disk
after write() returns" that the Unix kernel has done ~forever.
The asynchronous I/O that you mention ("aio") is a third thing,
different from both regular write() and write() with O_DSYNC. I
understand that with aio the data is not even transferred to the
kernel before the aio_write() call returns, but I've never programmed
with aio and am not 100% sure how it works.
Regards,
Giles