CommitDelay performance improvement
Looking at the XLOG stuff, I notice that we already have a field
(logRec) in the per-backend PROC structures that shows whether a
transaction is currently in progress with at least one change made
(ie at least one XLOG entry written).
It would be very easy to extend the existing code so that the commit
delay is not done unless there is at least one other backend with
nonzero logRec --- or, more generally, at least N other backends with
nonzero logRec. We cannot tell if any of them are actually nearing
their commits, but this seems better than just blindly waiting. Larger
values of N would presumably improve the odds that at least one of them
is nearing its commit.
A further refinement, still quite cheap to implement since the info is
in the PROC struct, would be to not count backends that are blocked
waiting for locks. These guys are less likely to be ready to commit
in the next few milliseconds than the guys who are actively running;
indeed they cannot commit until someone else has committed/aborted to
release the lock they need.
Comments? What should the threshold N be ... or do we need to make
that a tunable parameter?
regards, tom lane
Looking at the XLOG stuff, I notice that we already have a field
(logRec) in the per-backend PROC structures that shows whether a
transaction is currently in progress with at least one change made
(ie at least one XLOG entry written).It would be very easy to extend the existing code so that the commit
delay is not done unless there is at least one other backend with
nonzero logRec --- or, more generally, at least N other backends with
nonzero logRec. We cannot tell if any of them are actually nearing
their commits, but this seems better than just blindly waiting. Larger
values of N would presumably improve the odds that at least one of them
is nearing its commit.
Why not just set a flag in there when someone nears commit and clear
when they are about to commit?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Why not just set a flag in there when someone nears commit and clear
when they are about to commit?
Define "nearing commit", in such a way that you can specify where you
plan to set that flag.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Why not just set a flag in there when someone nears commit and clear
when they are about to commit?Define "nearing commit", in such a way that you can specify where you
plan to set that flag.
Is there significant time between entry of CommitTransaction() and the
fsync()? Maybe not.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Is there significant time between entry of CommitTransaction() and the
fsync()? Maybe not.
I doubt it. No I/O anymore, anyway, unless the commit record happens to
overrun an xlog block boundary.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Is there significant time between entry of CommitTransaction() and the
fsync()? Maybe not.I doubt it. No I/O anymore, anyway, unless the commit record happens to
overrun an xlog block boundary.
That's what I was afraid of. Since we don't write the dirty blocks to
the kernel anymore, we don't really have much happening before someone
says they are about to commit. In the old days, we were write()'ing
those buffers, and we had some delay and kernel calls in there.
Guess that idea is dead.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Fri, Feb 23, 2001 at 11:32:21AM -0500, Tom Lane wrote:
A further refinement, still quite cheap to implement since the info is
in the PROC struct, would be to not count backends that are blocked
waiting for locks. These guys are less likely to be ready to commit
in the next few milliseconds than the guys who are actively running;
indeed they cannot commit until someone else has committed/aborted to
release the lock they need.Comments? What should the threshold N be ... or do we need to make
that a tunable parameter?
Once you make it tuneable, you're stuck with it. You can always add
a knob later, after somebody discovers a real need.
Nathan Myers
ncm@zembu.com
On Fri, Feb 23, 2001 at 11:32:21AM -0500, Tom Lane wrote:
A further refinement, still quite cheap to implement since the info is
in the PROC struct, would be to not count backends that are blocked
waiting for locks. These guys are less likely to be ready to commit
in the next few milliseconds than the guys who are actively running;
indeed they cannot commit until someone else has committed/aborted to
release the lock they need.Comments? What should the threshold N be ... or do we need to make
that a tunable parameter?Once you make it tuneable, you're stuck with it. You can always add
a knob later, after somebody discovers a real need.
I wonder if Tom should implement it, but leave it at zero until people
can report that a non-zero helps. We already have the parameter, we can
just make it smarter and let people test it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
ncm@zembu.com (Nathan Myers) writes:
Comments? What should the threshold N be ... or do we need to make
that a tunable parameter?
Once you make it tuneable, you're stuck with it. You can always add
a knob later, after somebody discovers a real need.
If we had a good idea what the default level should be, I'd be willing
to go without a knob. I'm thinking of a default of about 5 (ie, at
least 5 other active backends to trigger a commit delay) ... but I'm not
so confident of that that I think it needn't be tunable. It's really
dependent on your average and peak transaction lengths, and that's
going to vary across installations, so unless we want to try to make it
self-adjusting, a knob seems like a good idea.
A self-adjusting delay might well be a great idea, BTW, but I'm trying
to be conservative about how much complexity we should add right now.
regards, tom lane
ncm@zembu.com (Nathan Myers) writes:
Comments? What should the threshold N be ... or do we need to make
that a tunable parameter?Once you make it tuneable, you're stuck with it. You can always add
a knob later, after somebody discovers a real need.If we had a good idea what the default level should be, I'd be willing
to go without a knob. I'm thinking of a default of about 5 (ie, at
least 5 other active backends to trigger a commit delay) ... but I'm not
so confident of that that I think it needn't be tunable. It's really
dependent on your average and peak transaction lengths, and that's
going to vary across installations, so unless we want to try to make it
self-adjusting, a knob seems like a good idea.A self-adjusting delay might well be a great idea, BTW, but I'm trying
to be conservative about how much complexity we should add right now.
OH, so you are saying N backends should have dirtied buffers before
doing the delay? Hmm, that seems almost untunable to me.
Let's suppose we decide to sleep. When we wake up, can we know that
someone else has fsync'ed for us? And if they have, should we be more
likely to fsync() in the future?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
A self-adjusting delay might well be a great idea, BTW, but I'm trying
to be conservative about how much complexity we should add right now.
OH, so you are saying N backends should have dirtied buffers before
doing the delay? Hmm, that seems almost untunable to me.
Let's suppose we decide to sleep. When we wake up, can we know that
someone else has fsync'ed for us?
XLogFlush will find that it has nothing to do, so yes we can.
And if they have, should we be more
likely to fsync() in the future?
You mean less likely. My thought for a self-adjusting delay was to
ratchet the delay up a little every time it succeeds in avoiding an
fsync, and down a little every time it fails to do so. No change when
we don't delay at all (because of no other active backends). But
testing this and making sure it behaves reasonably seems like more work
than we should try to accomplish before 7.1.
regards, tom lane
And if they have, should we be more
likely to fsync() in the future?
I meant more likely to sleep().
You mean less likely. My thought for a self-adjusting delay was to
ratchet the delay up a little every time it succeeds in avoiding an
fsync, and down a little every time it fails to do so. No change when
we don't delay at all (because of no other active backends). But
testing this and making sure it behaves reasonably seems like more work
than we should try to accomplish before 7.1.
It could be tough. Imagine the delay increasing to 3 seconds? Seems
there has to be an upper bound on the sleep. The more you delay, the
more likely you will be to find someone to fsync you. Are we waking
processes up after we have fsync()'ed them? If so, we can keep
increasing the delay.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Fri, Feb 23, 2001 at 05:18:19PM -0500, Tom Lane wrote:
ncm@zembu.com (Nathan Myers) writes:
Comments? What should the threshold N be ... or do we need to make
that a tunable parameter?Once you make it tuneable, you're stuck with it. You can always add
a knob later, after somebody discovers a real need.If we had a good idea what the default level should be, I'd be willing
to go without a knob. I'm thinking of a default of about 5 (ie, at
least 5 other active backends to trigger a commit delay) ... but I'm not
so confident of that that I think it needn't be tunable. It's really
dependent on your average and peak transaction lengths, and that's
going to vary across installations, so unless we want to try to make it
self-adjusting, a knob seems like a good idea.A self-adjusting delay might well be a great idea, BTW, but I'm trying
to be conservative about how much complexity we should add right now.
When thinking about tuning N, I like to consider what are the interesting
possible values for N:
0: Ignore any other potential committers.
1: The minimum possible responsiveness to other committers.
5: Tom's guess for what might be a good choice.
10: Harry's guess.
~0: Always delay.
I would rather release with N=1 than with 0, because it actually responds
to conditions. What N might best be, >1, probably varies on a lot of
hard-to-guess parameters.
It seems to me that comparing various choices (and other, more interesting,
algorithms) to the N=1 case would be more productive than comparing them
to the N=0 case, so releasing at N=1 would yield better statistics for
actually tuning in 7.2.
Nathan Myers
ncm@zembu.com
Bruce Momjian <pgman@candle.pha.pa.us> writes:
It could be tough. Imagine the delay increasing to 3 seconds? Seems
there has to be an upper bound on the sleep. The more you delay, the
more likely you will be to find someone to fsync you.
Good point, and an excellent illustration of the fact that
self-adjusting algorithms aren't that easy to get right the first
time ;-)
Are we waking processes up after we have fsync()'ed them?
Not at the moment. That would be another good mechanism to investigate
for 7.2; but right now there's no infrastructure that would allow a
backend to discover which other ones were sleeping for fsync.
regards, tom lane
When thinking about tuning N, I like to consider what are the interesting
possible values for N:0: Ignore any other potential committers.
1: The minimum possible responsiveness to other committers.
5: Tom's guess for what might be a good choice.
10: Harry's guess.
~0: Always delay.I would rather release with N=1 than with 0, because it actually responds
to conditions. What N might best be, >1, probably varies on a lot of
hard-to-guess parameters.It seems to me that comparing various choices (and other, more interesting,
algorithms) to the N=1 case would be more productive than comparing them
to the N=0 case, so releasing at N=1 would yield better statistics for
actually tuning in 7.2.
We don't release code becuase it has better tuning oportunities for
later releases. What we can do is give people parameters where the
default is safe, and they can play and report to us.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
It could be tough. Imagine the delay increasing to 3 seconds? Seems
there has to be an upper bound on the sleep. The more you delay, the
more likely you will be to find someone to fsync you.Good point, and an excellent illustration of the fact that
self-adjusting algorithms aren't that easy to get right the first
time ;-)
I see. I am concerned that anything done to 7.1 at this point may cause
problems with performance under certain circumstances. Let's see what
the new code shows our testers.
Are we waking processes up after we have fsync()'ed them?
Not at the moment. That would be another good mechanism to investigate
for 7.2; but right now there's no infrastructure that would allow a
backend to discover which other ones were sleeping for fsync.
Can we put the backends to sleep waiting for a lock, and have them wake
up later?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Can we put the backends to sleep waiting for a lock, and have them wake
up later?
Locks don't have timeouts. There is no existing mechanism that will
serve this purpose; we'll have to create a new one.
regards, tom lane
On Fri, Feb 23, 2001 at 06:37:06PM -0500, Bruce Momjian wrote:
When thinking about tuning N, I like to consider what are the interesting
possible values for N:0: Ignore any other potential committers.
1: The minimum possible responsiveness to other committers.
5: Tom's guess for what might be a good choice.
10: Harry's guess.
~0: Always delay.I would rather release with N=1 than with 0, because it actually
responds to conditions. What N might best be, >1, probably varies on
a lot of hard-to-guess parameters.It seems to me that comparing various choices (and other, more
interesting, algorithms) to the N=1 case would be more productive
than comparing them to the N=0 case, so releasing at N=1 would yield
better statistics for actually tuning in 7.2.We don't release code because it has better tuning opportunities for
later releases. What we can do is give people parameters where the
default is safe, and they can play and report to us.
Perhaps I misunderstood. I had perceived N=1 as a conservative choice
that was nevertheless preferable to N=0.
Nathan Myers
ncm@zembu.com
It seems to me that comparing various choices (and other, more
interesting, algorithms) to the N=1 case would be more productive
than comparing them to the N=0 case, so releasing at N=1 would yield
better statistics for actually tuning in 7.2.We don't release code because it has better tuning opportunities for
later releases. What we can do is give people parameters where the
default is safe, and they can play and report to us.Perhaps I misunderstood. I had perceived N=1 as a conservative choice
that was nevertheless preferable to N=0.
I think zero delay is the conservative choice at this point, unless we
hear otherwise from testers.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Can we put the backends to sleep waiting for a lock, and have them wake
up later?Locks don't have timeouts. There is no existing mechanism that will
serve this purpose; we'll have to create a new one.
That is what I suspected.
Having thought about it, We currently have a few options:
1) let every backend fsync on its own
2) try to delay backends so they all fsync() at the same time
3) delay fsync until after commit
Items 2 and 3 attempt to bunch up fsyncs. Option 2 has backends waiting
to fsync() on the expectation that some other backend may commit soon.
Option 3 I may turn out to be the best solution. No matter how smart we
make the code, we will never know for sure if someone is about to commit
and whether it is worth waiting.
My idea would be to let committing backends return "COMMIT" to the user,
and set a need_fsync flag that is guaranteed to cause an fsync within X
milliseconds. This way, if other backends commit in the next X
millisecond, they can all use one fsync().
Now, I know many will complain that we are returning commit while not
having the stuff on the platter. But consider, we only lose data from a
OS crash or hardware failure. Do people who commit something, and then
the machines crashes 2 milliseconds after the commit, really expect the
data to be on the disk when they restart? Maybe they do, but it seems
the benefit of grouped fsyncs() is large enough that many will say they
would rather have this option.
This was my point long ago that we could offer sub-second reliability
with no-fsync performance if we just had some process running that wrote
dirty pages and fsynced every 20 milliseconds.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
At 14:57 23/02/01 -0800, Nathan Myers wrote:
When thinking about tuning N, I like to consider what are the interesting
possible values for N:
It may have been much earler in the debate, but has anyone checked to see
what the maximum possible gains might be - or is it self-evident to people
who know the code?
Would it be worth considering creating a test case with no flush in
RecordTransactionCommit, and rely on checkpointing to flush? I realize this
is never an option in production, but is it possible to modify the code in
this way? I *should* give an upper limit on the gains that can be made by
flushing at the best possible time.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
At 21:31 23/02/01 -0500, Bruce Momjian wrote:
Now, I know many will complain that we are returning commit while not
having the stuff on the platter.
You're definitely right there.
Maybe they do, but it seems
the benefit of grouped fsyncs() is large enough that many will say they
would rather have this option.
I'd prefer to wait for a lock manager that supports timeouts and contention
notification.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
At 11:32 23/02/01 -0500, Tom Lane wrote:
Looking at the XLOG stuff, I notice that we already have a field
(logRec) in the per-backend PROC structures that shows whether a
transaction is currently in progress with at least one change made
(ie at least one XLOG entry written).
Would it be worth adding a field 'waiting for fsync since xxx', so the
second process can (a) log that it is expecting someone else to FSYNC (for
perf stats, if we want them), and (b) wait for (xxx + delta)ms/us etc?
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
At 21:31 23/02/01 -0500, Bruce Momjian wrote:
Now, I know many will complain that we are returning commit while not
having the stuff on the platter.You're definitely right there.
Maybe they do, but it seems
the benefit of grouped fsyncs() is large enough that many will say they
would rather have this option.I'd prefer to wait for a lock manager that supports timeouts and contention
notification.
I understand, and if that was going to fix the problem completely, but
it isn't. It is just going to allow us more flexibility at guessing who
may be about to commit.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
At 21:31 23/02/01 -0500, Bruce Momjian wrote:
Now, I know many will complain that we are returning commit while not
having the stuff on the platter.You're definitely right there.
Maybe they do, but it seems
the benefit of grouped fsyncs() is large enough that many will say they
would rather have this option.I'd prefer to wait for a lock manager that supports timeouts and contention
notification.
There is one more thing. Even though the kernel says the data is on the
platter, it still may not be there. Some OS's may return from fsync
when the data is _queued_ to the disk, rather than actually wanting for
the drive return code to say it completed. Second, some disks report
back that the data is on the disk when it is actually in the disk memory
buffer, not really on the disk.
Basically, I am not sure how much we lose by doing the delay after
returning COMMIT, and I know we gain quite a bit by enabling us to group
fsync calls.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Fri, Feb 23, 2001 at 09:05:20PM -0500, Bruce Momjian wrote:
It seems to me that comparing various choices (and other, more
interesting, algorithms) to the N=1 case would be more productive
than comparing them to the N=0 case, so releasing at N=1 would yield
better statistics for actually tuning in 7.2.We don't release code because it has better tuning opportunities for
later releases. What we can do is give people parameters where the
default is safe, and they can play and report to us.Perhaps I misunderstood. I had perceived N=1 as a conservative choice
that was nevertheless preferable to N=0.I think zero delay is the conservative choice at this point, unless we
hear otherwise from testers.
I see, I had it backwards: N=0 corresponds to "always delay", and
N=infinity (~0) is "never delay", or what you call zero delay. N=1 is
not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting,
where M is the number of backends, or the number of backends with begun
transactions, or something. N=10 would be conservative (and maybe
pointless) just because it would hardly ever trigger a delay.
Nathan Myers
ncm@zembu.com
At 23:14 23/02/01 -0500, Bruce Momjian wrote:
There is one more thing. Even though the kernel says the data is on the
platter, it still may not be there.
This is true, but it does not mean we should say 'the disk is slightly
unreliable, so we can be too'. Also, IIRC, the last time this was
discussed, someone commented that buying expensive disks and a UPS gets you
reliability (barring a direct lightining strike) - it had something to do
with write-ordering and hardware caches. In any case, I'd hate to see DB
design decisions based closely on harware capability. At least two of my
customers use high performance ram disks for databases - do these also
suffer from 'flush is not really flush' problems?
Basically, I am not sure how much we lose by doing the delay after
returning COMMIT, and I know we gain quite a bit by enabling us to group
fsync calls.
If included, this should be an option only, and not the default option. In
fact I'd quite like to see such a feature, although I'd not only do a
'flush every X ms', but I'd also do a 'flush every X transactions' - this
way a DBA can say 'I dont mind losing the last 20 TXs in a crash'. Bear in
mind that on a fast system, 20ms is a lot of transactions.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
At 23:14 23/02/01 -0500, Bruce Momjian wrote:
There is one more thing. Even though the kernel says the data is on the
platter, it still may not be there.This is true, but it does not mean we should say 'the disk is slightly
unreliable, so we can be too'. Also, IIRC, the last time this was
discussed, someone commented that buying expensive disks and a UPS gets you
reliability (barring a direct lightining strike) - it had something to do
with write-ordering and hardware caches. In any case, I'd hate to see DB
design decisions based closely on harware capability. At least two of my
customers use high performance ram disks for databases - do these also
suffer from 'flush is not really flush' problems?
Well, I am saying we are being pretty rigid here when we may be on top
of a system that is not, meaning that our rigidity is buying us little.
Basically, I am not sure how much we lose by doing the delay after
returning COMMIT, and I know we gain quite a bit by enabling us to group
fsync calls.If included, this should be an option only, and not the default option. In
fact I'd quite like to see such a feature, although I'd not only do a
'flush every X ms', but I'd also do a 'flush every X transactions' - this
way a DBA can say 'I dont mind losing the last 20 TXs in a crash'. Bear in
mind that on a fast system, 20ms is a lot of transactions.
Yes, I can see this as a good option for many users. My old complaint
was that we allowed only two very extreme options, fsync() all the time,
or fsync() never and recover from a crash.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
My idea would be to let committing backends return "COMMIT" to the user,
and set a need_fsync flag that is guaranteed to cause an fsync within X
milliseconds. This way, if other backends commit in the next X
millisecond, they can all use one fsync().
Guaranteed by what? We have no mechanism available to make an fsync
happen while the backend is waiting for input.
Now, I know many will complain that we are returning commit while not
having the stuff on the platter.
I think that's unacceptable on its face. A remote client may take
action on the basis that COMMIT was returned. If the server then
crashes, the client is unlikely to realize this for some time (certainly
at least one TCP timeout interval). It won't look like a "milliseconds
later" situation to that client. In fact, the client might *never*
realize there was a problem; what if it disconnects after getting the
COMMIT?
If the dbadmin thinks he doesn't need fsync before commit, he'll likely
be running with fsync off anyway. For the ones who do think they need
fsync, I don't believe that we get to rearrange the fsync to occur after
commit.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
My idea would be to let committing backends return "COMMIT" to the user,
and set a need_fsync flag that is guaranteed to cause an fsync within X
milliseconds. This way, if other backends commit in the next X
millisecond, they can all use one fsync().Guaranteed by what? We have no mechanism available to make an fsync
happen while the backend is waiting for input.
We would need a separate binary that can look at shared memory and fsync
is someone requested it. Again, nothing for 7.1.X.
Now, I know many will complain that we are returning commit while not
having the stuff on the platter.I think that's unacceptable on its face. A remote client may take
action on the basis that COMMIT was returned. If the server then
crashes, the client is unlikely to realize this for some time (certainly
at least one TCP timeout interval). It won't look like a "milliseconds
later" situation to that client. In fact, the client might *never*
realize there was a problem; what if it disconnects after getting the
COMMIT?If the dbadmin thinks he doesn't need fsync before commit, he'll likely
be running with fsync off anyway. For the ones who do think they need
fsync, I don't believe that we get to rearrange the fsync to occur after
commit.
I can see someone wanting some fsync, but not take the hit. My argument
is that having this ability, there would be no need to turn off fsync.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Philip Warner <pjw@rhyme.com.au> writes:
It may have been much earler in the debate, but has anyone checked to see
what the maximum possible gains might be - or is it self-evident to people
who know the code?
fsync off provides an upper bound to the speed achievable from being
smarter about when to fsync... I doubt that fsync-once-per-checkpoint
would be much different.
regards, tom lane
Preliminary results from experimenting with an
N-transactions-must-be-running-to-cause-commit-delay heuristic are
attached. It seems to be a pretty definite win. I'm currently running
a more extensive set of cases on another machine for comparison.
The test case is pgbench, unmodified, but run at scalefactor 10
to reduce write contention on the 'branch' rows. Postmaster
parameters are -N 100 -B 1024 in all cases. The fsync-off (with,
of course, no commit delay either) case is shown for comparison.
"commit siblings" is the number of other backends that must be
running active (unblocked, at least one XLOG entry made) transactions
before we will do a precommit delay.
commit delay=1 is effectively commit delay=10000 (10msec) on this
hardware. Interestingly, it seems that we can push the delay up
to two or three clock ticks without degradation, given positive N.
regards, tom lane
ncm@zembu.com (Nathan Myers) writes:
I see, I had it backwards: N=0 corresponds to "always delay", and
N=infinity (~0) is "never delay", or what you call zero delay. N=1 is
not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting,
where M is the number of backends, or the number of backends with begun
transactions, or something. N=10 would be conservative (and maybe
pointless) just because it would hardly ever trigger a delay.
Why is N=1 not interesting? That requires at least one other backend
to be in a transaction before you'll delay. That would seem to be
the minimum useful value --- N=0 (always delay) seems clearly to be
too stupid to be useful.
regards, tom lane
Philip Warner <pjw@rhyme.com.au> writes:
It may have been much earler in the debate, but has anyone checked to see
what the maximum possible gains might be - or is it self-evident to people
who know the code?fsync off provides an upper bound to the speed achievable from being
smarter about when to fsync... I doubt that fsync-once-per-checkpoint
would be much different.
That was my point, people should be doing fsync once per checkpoint
rather than never.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Sat, Feb 24, 2001 at 01:07:17AM -0500, Tom Lane wrote:
ncm@zembu.com (Nathan Myers) writes:
I see, I had it backwards: N=0 corresponds to "always delay", and
N=infinity (~0) is "never delay", or what you call zero delay. N=1 is
not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting,
where M is the number of backends, or the number of backends with begun
transactions, or something. N=10 would be conservative (and maybe
pointless) just because it would hardly ever trigger a delay.Why is N=1 not interesting? That requires at least one other backend
to be in a transaction before you'll delay. That would seem to be
the minimum useful value --- N=0 (always delay) seems clearly to be
too stupid to be useful.
N=1 seems arbitrarily aggressive. It assumes any open transaction will
commit within a few milliseconds; otherwise the delay is wasted. On a
fairly busy system, it seems to me to impose a strict upper limit on
transaction rate for any client, regardless of actual system I/O load.
(N=0 would impose that strict upper limit even for a single client.)
Delaying isn't free, because it means that the client can't turn around
and do even a cheap query for a while. In a sense, when you delay you are
charging the committer a tax to try to improve overall throughput. If the
delay lets you reduce I/O churn enough to increase the total bandwidth,
then it was worthwhile; if not, you just cut system performance, and
responsiveness to each client, for nothing.
The above suggests that maybe N should depend on recent disk I/O activity,
so you get a larger N (and thus less likely delay and more certain payoff)
for a more lightly-loaded system. On a system that has maxed its I/O
bandwidth, clients will suffer delays anyhow, so they might as well
suffer controlled delays that result in better total throughput. On a
lightly-loaded system there's no need, or payoff, for such throttling.
Can we measure disk system load by averaging the times taken for fsyncs?
Nathan Myers
ncm@zembu.com
Attached are graphs from more thorough runs of pgbench with a commit
delay that occurs only when at least N other backends are running active
transactions.
My initial try at this proved to be too noisy to tell much. The noise
seems to be coming from WAL checkpoints that occur during a run and
push down the reported TPS value for the particular case that's running.
While we'd need to include WAL checkpoints to make an honest performance
comparison against another RDBMS, I think they are best ignored for the
purpose of figuring out what the commit-delay behavior ought to be.
Accordingly, I modified my test script to minimize the occurrence of
checkpoint activity during runs (see attached script). There are still
some data points that are unexpectedly low compared to their neighbors;
presumably these were affected by checkpoints or other system activity.
It's not entirely clear what set of parameters is best, but it is
absolutely clear that a flat zero-commit-delay policy is NOT best.
The test conditions are postmaster options -N 100 -B 1024, pgbench scale
factor 10, pgbench -t (transactions per client) 100. (Hence the results
for a single client rely on only 100 transactions, and are pretty noisy.
The noise level should decrease as the number of clients increases.)
Comments anyone?
regards, tom lane
Attachments:
hppabench.gifimage/gifDownload
GIF87a � �R- �� � ����rV� �� �� � , ���I��8����`(�di�h��l��p,�tm�x��|����pH,���r�l:���tJ�Z���v��z���xL.����z�n����|N�����~����������������������������[ �� �������9��������������������|��������W�������������������������������������������a������������
H� �/��!c�)RCHE��p��F�$T()�����<&9d��&�,[�LID/�44�PM�)e������A?
4�OQ�=q��s�O�>O{%:��R�U�D��UO�<_����X;g�����iY�X�B|����������W^)�
L����P^���c��D~L��e���d����g^����L���Yv��>���k����~M��m�no���{�l#�{N�Jp"��+_��b��G������J ��R
�o������^�����kT/��{K-Y������0���AH�� ���-�-��/��������
(��i��*��_�
fp�� �(��$�h��#D�����K �H���X�L��e�a
6H�: �����h$)6b�$�n����#{jQI����es���P�^ND
RN��h����[��#B�g�t:�&�1���h�i��u��kbp$��� ���6Z��n0@�L�hyV�����i�"XJ��������*ak|������A�1�*�{��i
�������������������Z������`���>�l������R{����+���p,������
B���������A�dk����-������!l���+���0����� (�����G<#��l��L��g�o��~�C�����8�|�.�L��3L,���l�H����J�j�2F
5�%�\��J �u�0���M_
���.������Z����`�w�u��A�#S�B�}3�,O�s�����ti��z���
V� 8��l9����y��-���x��u
��09 4���s���M� �Z|��G�]��(�����W����-.�L
#��^E�������9<���||� 4;z��;���o����Ci���"t=p�c?X����}��O��J� H�P�0r2���0�@t��:��
� }�b���s��9�W�{�8�����.��\)pcJ� DC�)X7�����
np7k�
����� �HL� S�@l����)�hA�,)�I�E����.R��]\Z�8�:s��,C�H�.�������5Q �;L
g��^q��a�pD�Nq�~���z����q�{� �h����k�E�M��@R�P(��CD���� �7H�o'��,g�9��|�$����)K^,D���vO���A�����V�2�PZ����>����_��.��L ���4A+�B!d@ ����S�QI.�����T�����(yM9i�?���6�N|�%��B���$(��z�E+�i}d�(F1�J/NQ��'<�i��kCJ:d��@B�f1KzhD-x��9����~��|��=D��>AL��K�3�
����d�;���*]�*z��B��E%�Q#�I `�K�L�5N��`�U%��:��+�kXu0V�����h�J��z��n�>t9
��
~%�]�
<���{�A_�jm���$� ���BU"2� ���Y�>�,�0����a�'������5��O=�DD�h� ��Cd%�@Y�V���h_Yp������
�p9���ywkT5� �K������M�z���������nt��[� e0 ��Y��V�����6�Uk�=�{���W��/�[8�����������y�mS Lb���� yG���!����6�!h6�����n���[��v�k������j�,�!����U��89� ��\��
y�����������
P���Q�f�]Y`��+�@�_�zY�`s�CD��f�p�srK3+�e�VyoZ�G���q �*nN<��Y������ i����.A�i�e$��s��L��P�b��|6��A�hM���l�t6C��Q���(x� "��
��������h�y�a����j������^��Q��bO�� ���=^S7Zp���=��V{����s�j���'= W��fs�e�vcm\���n��{:]!�[r�N�/�m�<�C�N���8@w�A���+�)P�K�p@�������+�4���U}�c3(DGA���#���� �.y�L�0����si']N���@�Uc�Y����� 9'7�y�h���m����]��dM�����pF�������l�����y����l ����py�|v�@����W:��,��9������-��9p���/5�g���ZIY��z���Zx8�.��)8�^�^u����|��}������{�����0��D� PHz�J�-�<�l��#�k��h{y���r��&�� �[�%�=6���{^��!e�m(�z���4�<�~�����C@p�B!@DFf]s7~;G8��y��s`�t2�q��W�vzU���$z�D �H%F��@dJ��7�z�S�R��C1����G��j�T��~�{/h`X~H��d�,�5����6B.�I!(H�L�7O��O�����������gv�W�J�I��v�.�+W�k �bC�X
�T�w;pV�n����rJM�O��ES�3�����$;�t���hK��)��8�hxME"����'w� �y�Gy�wH�]��g �� g
���9�EUAD�"%����~a�r3���9f��XC/���G�{�6�D(oS30F�!�� |�\[�GQH��"O�g����YXH�tz���0 �T&��B�fs�T����P�R Pu���S�q8�{Gr�������r�6��2��U�k����!�J)�c��9��#�,h{�f��(�%53b��h�iR ��4iq����\��>�Yzr ��d@cH��Z)��SW8T�W,!V6p����"@�������t�{*0�w�7��!:`@��^ [�1��S�p�, �*Yg�b�W��@��Dd-�|:�~V�L�1�[-�r=�ilij���b�jD����|1 �,�e- �M����G�w�Y�o0��X@�W��'c)�&w�[��
�!���ME{�)�=����*���Y��iV����yT���c�f*�yA�Y��������������.$>z��YP���9���Y2�Ng�] ���^����, ��(����m��]�������/�B��p(�"�R��c�m���q��^:�4`�����fA�u��V
���[ @{���-���e~g|�#��g��9��Y�bx�����9�-Qb���C{���c 7Z��=z�I�Y ���w.j����&�Y`w�2�@��2B�o�d-T��o���U��PE@[�g�y�/�N��D*��e)jDE$=���4�Z�v���*�hn��F�8�j�eWQU@��q���.�I���E'�������D7G��D8�6m� ��0V4�* 2J�4g�Z>�8@�*�������`���*7�*�u�@��R�4��#����7��$[��o��������w��Y�I"]9�GW���Y�������5���r�}�-�cE��X�z�w<�^X�����w��c����U�2�9��K�,H��7�juL�����\:��j�=P^��$[[��v�* "�X"�i�Da�*wQ���#�x�v�v��u��Z������C�B�j�5�iT�K<eZ�W"�~�D �|fw�jVK����}�/�m:�^�S�7[y;i~q�����%I��Z�h�lj��3J^��-�K;�h��f��R��# ~��w�*�c�SW���BI�uo��c��.`-RK�u�LQ�d��V}K�?���fc��[3�G�����+������m\���G�����r��I�P�k�+�;!�E���L�g�[�Z ���uj�20�������S8*����&����K���."����K�^����x)���{��n�������X���[�.��t�������R�:���]>�^II�l����a���2
��;���Nbf�(���;\�(z�Wz���!?&�
��"|�<�jF���P��1 p�9�P+��y�4R�L�\:���{�j����!��L��������|����� K<��;)�e1�y���y8n��HsJ������I�2�a�Tc|ql�t|�����XR+�(9.�A�����WL,N�E/U^"�� loT�kLv���=0r[���kx��X%e�GlQ��Gk\�|���Ba��h|�c�i�����l ��:+�,�U����K�J����Z+CY\���Xx0���<\9��Vl�����������������)Pk����Q�L��������Y�+�r����e���If^����a�Z=b�(�������z�m�=���X�
2�X8�[��,
����.���AW�J�)/�iu��*c7�*��)�{�����F�~��Y}>!H��X�1�� 2����`���Z?�������FM�H-��B^������w��s,Snn S�� ��S�����<�2=�/]�������N��-X��w�Gq *rU9�*0������42C�#����F:a���� C+]fF��}�� ��� i�����������<����"����vl���o�����i �5|����
���g��0��eOw��L��������(-�,0��]���:����;�Q���{���� W������&��U
�A��*���!���v��}#���1����I���������4.b3@p���#c�|�R�i�����=���uT�U�������5�7�69~�!)�&>n~������������O���b�o��'5�S.qY~���G5PN���2P�2^�K�k�H��� ��(l�����;�d8��iw��6���D>wD������Vc�yeA�z�ne���N"����Qq
��=�P������N���u���w-��p��^���[k����\���G=��_���T0�������^�>0��.P �� ���(����'�(B�on�����.��^����o���+�u�a�������3��w[�N.jA��������o��O��b���s����:V���qq��P&�^T ����/a0���h��$������v��.��7���)�����^���������~����.Y�=XM�������[�������D;@�G�Y0�<��S{��*U(-�!�!�W!��{�_������~�;n9N�U�t~�/5_�io��N����~��1m�_�]ko��`�K��4�Y-��CZ�g�n�U�N�I�;��O��{�e�`�����xg�� E/�G�*�d�^�z��gE������;�N����Mo*��_�K����V
����G���j/�����/sp��{���IM\�W���K��/�����/���?l7n���TN�:`�� ���H_D ���^�����KF����;�V����7�����n`�� ���d6��zQ�WV�]�_���ketZ�f��o �=w��w�2�:J�
�&����(��,�
&�������
14��N����</2��R�O,�3Y�\�]R;��_����?�����*��S������g�0�T�e�PYZ���mb��!�\/��Xv��Z6�}��}��K�#p`A� 9#��F9g�VqH �5,�D��q�%�|�#��
�A�qWB�DY4�2h[�~9������O�%�3�h�C|����������{��,��M'^1\=����0IA���=�<i����K��y�3�M_��6�d�YD��0N�P���4vTi;�p��i�ZX�"��q�6He�����a���r �d6���i?����������Iq����X��0���8r�z��o=b�B}���e:�
�=����vr��{��X�a����6���X������ � �������0�
��"B �X�0P&@�*H��12B+������ O5t��B @���h��������r��
�`�`�o(p@<�@&~+�?%t����PH������r�$d:+!�*���4����tC<!HRgKi���8����qx��}�"O=��0���rH��DP�##Er8E �RU�RN���
*{��N�H��E��,���L������1/X��6k���|��
d��Kk�b��DRV�@���X5��kIHu�G%�6Sl��v��,�t[$wJ:��R�ec���l�
@ A��,9s��l����U���um��
�6�2�-� Fbg�!� ��ZQ)}��p�%�Sl<���?]7��q�e�c�y��m��u�-$X�3{I!9d�8���uN�Z�Tx�Y���P'�A!�?�����]��)dy���%�[�IFy���P �2r6�n���D�gR�YA��S-�n�5W������{j���5��#@�Zl���{��[N�(��m�?OTQ��-����������T_cg�
B�}'���%v��M��h<j!!�|�� ����KH���s��}���`�����15��������O�m=�����vp�y}G�"_=F�0���5�=a{����C<&NrN����!�d��l7?���a�J
��OqE����$g0�*U2������c��� A4�hz���[RC�
S�Y� �u��9�)tH�>�����o���zh+=v�S�ve��4�q��!�@��Y�8\��xGE1���#��t��J����/Z�VbD�Y�w�7��`Kdr��?~�wr<m�GP�Q�zt"��Up� �,�7@3�O��:�#h���W�r�S�e(pp�B��ECy����P|���E U>!��`J�!Cn(�^�����=b2.q�4N/w��H&2��$��Lx�C�Ldf)���p�
�����MEjs���vKo|�����9��A���
�'/�O�&�����2�������'g�2
���cj�\����$#�9H�T�[�({2�S�t49B��3��pVej��S$A�o!<�K������5�)A���^u�{y'P�z�z&���(>�����&b<�K+��gI7�Uw9� z��aZ���e=kc��V�lm]J����~���d�V�!A�L%=vZ^M�U�v�F��a���X] �ulm� ���vK)4���d��9�$=�������)mT���^C����]\�K���mw�%Yz���=��+����n����i/������{��N����h���lD
�^��{W�o��.��� �������
w5��,XRF�F��(J����X�+f1~YbP��&:��u���9�x�]p�����Tv��kl��z��~�@� l��0�L@*��5���[�.�pa�$��x�c�4?�����W{$8�|`��T�L@���W}ES�l��L� �!�w s��u�4/��e���",��d�"�p �F't�=�I���u1,Z/���>kc6y�[t�mM!�/�X�`o��kM����p@� ��0� ��h��@U���w��^!=0Y�.�G���o]i��;9n��4�Z%��=�6���@�?��$6DZ��h�P���uI�x��^���7P
�Y�6�5hO��h���fW&H�{HG��.6��<E��t�k�V ��b��{����u�M�TB�J�����
�3����5� �;�'���h�C� �c�k��o�|ZR�O��9�0&��6��}����2xb�W' �t�[B�1� = ps���r'������|%z���}
h���"�����N����@�f��TP���������b!���WW�x3���L��7��w�b��|�A0{=<%<b�S�"� �2���5yA��5�Ol'0����Hu3]���� �/�_@}�/���e��wqHfox<0uC�#�C����D����wd��D��?yy�/���
m 4���>�8��������{��<���36�+���s`��<�!��("�s;��?��.�S�JY"�hB��KA�y��)�qA���t�=�:���t�;�����x��? ,���%���
I4��� �7� �T�����)���-L����s��������d�L� t"B�S��b&$1����B
� (,�)������+��q=�qk�B2���a1�B5+��A;L�DM�DN�D�6tC��C���*�@z8B�&=;>�:s����!D�X�C8�QA��t���DJD��y�3��4d9P��8�`@#�"4��������/�E���c�����\��b����n�G<�a��w�����C:X��G��9����j��k$;��8�{�?�Z����AR!�r���1��
J��c=��HK����G���&
�tH�[��C���GQ�G����R�C<$Hu.�,o�p�&H������;(1F�a��Q����S��<���5��4���1nT/�X?<XH��c��$��TJQ1�|D���G����1�p��>�IU������D�����}�����;�LL�QK�cL�hJ��G�9��JT<�& BA�P��N�K&�G �#C.����PL�4+�$����������������M���&�����\%`��,!1��\��xM`p�a�A�EF�L�L��D���{�*� ��M������8�"�dN�L�O�#�t����%��P(T �u�}<��)H���%�lO����� �G�L�Jj��\�����p��D'�����(�
QL������B���Nt15���X��-�������TQ�"���e
�NB:����xh�Vt������5��Q��Q*e���+��O0R��P%]��X����u�*MS5uO'�R�4Q�K/����� �K];2}��Q{\S@
T��R�$T`pH"E�MRt���� 8SM���SA�TL��65T^�C�|Em���;G��=�������L}UX]7��YU�"�9u�UR-�S���������X=Vd=N=�e�D�����L�2(�%�'j�T.��d�Vo�<��LT/
M^��S
aL��o}Wx��f��y��q��Pm�s-�*SY������G#X+�WZE��4\�N��S���^U�Wt"���v-�53��X,�UY Y<��8u�|je������V��6C���eS��% ��� ���Aa��}���������u�p�����r��_(Up�
�I�M�[�T���SV����3����D�M����Wu����Y'�
Vp�Z��4mu�����{>�eG�5��=Q:[qm����"�Y$�;����-�[����-
�-�_�[���4@Ik5���5���N���ah\����B�=+�
��H.��a�\�4���^����D�����%���.-���X78��\����}�����;�F4J���/�D0�D!�
���"�%\�M[�d%������ ���%�@�^��.���D���a�n�^��0��u��"�N`^`n��D�e+&`��=��k[R�_q�W�����-n�����[��\������y�T�����%�LH���OE"��P���n, ����t$FnG��^�AG�I������
�*6C�P��e�6��k�K!)�!b��_l�d���]� �`�H��c�Y")�)�����T���s�M|�Y��1&��*��bJCJR �6�,��|m?P�P@V���D6�x��M���p�{�G��}������]�a����3��c�eZ�Q&@ ����2��`'pe��eb O�eP:fd��\F+R�V'�����V�M%e^f'�fmf�g�eH��
�fV��&�b.&��f�XvN�<��h���4].�Rp^t�����}���f���f�p�����4�J�3v�[v
���(���D�fg>� �dr���� Hgt��3�b���[>�;*����'���E����������QUR�NNi�^i�ni:�izF����J����"Av'�5��i,�h��gwai4Eh��i��=������38����
�i�% j���*��U�j:���������F>��k�fk4����������i�������^�cZk�F��.�hL���gjh���f1f�X\���m��ke��/��{�k��%H���<�bm�kw�mP��CQ#1m�Jm�����8��h�4b��m]�k��'��VZ���
q�����Cjj��5��}��� <p�����.�����������������n�����b���Fo;o�No����[�~�T��������@T{��o��o p����\k
p���h%l�Tl#����i�pR�p/�
W��p\kq����8�n�X^;$�����]q��p g�������T6Y����q����r������9[� p������%fd�l%/j�n��������Uq*���s�\����$���� \e]1G�0��lo�pU���Lj�F�g:���)���"i:_N;�\%��&�\��79��&�(�O��������~�%��mI����tKtAw[G�������[���a�_WG�}#P�`3n����
�V�����Xw���%C'��\���w|�K��y���9vF@CW
�fv7��}j���)l��R�?F'�� oa��� `�w�ja���(��x����Y�5�s�j&?�ww����QX�w���Y���]d�J�_^/vw�oywdy~E��74 )�n_
���'3y�v�On��w*�vo�wp�o?a$fo�Q��/���A`��t��u����)+�\<�����{mx:��soc.��o{�g{���A_DE�y�Ev���%�i5�y�������r�������7F�e�?IL��n������ +�k��_|Ho�en�e���A}pM�3O���o�p��g�����Bz�oO��h�Wv�w����{$�@��0��������'��zv��j���������o���}��v��sT�S�x����3��'���a%}�W��� ��?���]�����G8��j��z��?�#Y�'��+�~U��3]�7��;��'