AW: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
I have to agree with Alfred here: this does not sound like a feature,
it sounds like a horrid hack. You're giving up *all* consistency
guarantees for a performance gain that is really going to be pretty
minimal in the WAL context.
The "buffered log" still guarantees consistency within the database.
The only thing that is a problem, is that the client had his transaction
committed, but the whole transaction may be lost if the machine crashes
hard.
This is a point where a lot of people sit on an impression of false safety.
To actually guarantee that a committed transaction is safe, at least
two separate machines with an independent set of disks on two different
locations with synchronous replication (of at least the log) are needed.
In all other situations there is still a too high risc of loosing committed
transactions.
I have made the experience, that telling your programmers and clients,
that a few of their latest transactions are not guaranteed to be safe leads
to a lot higher robustness overall. There is nothing worse that a false
feeling of safety.
Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any
guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...
This is not a solution to the perfomance aspect with the same impact as
"buffered log", since if you tune that time up every commit has to wait longer.
There is a tradeoff between the wait time and the time fsync needs.
Thus we are talking about rather short times in the milliseconds here.
Andreas
Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any
guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...
Already implemented (without ability to tune this parameter -
xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
backend sleeps 1/200 sec before checking/forcing log fsync.
Vadim
Import Notes
Resolved by subject fallback
[ Charset ISO-8859-1 unsupported, converting... ]
Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any
guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...Already implemented (without ability to tune this parameter -
xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
backend sleeps 1/200 sec before checking/forcing log fsync.
But it returns _completed_ to the client before sleeping, right?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any
guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...Already implemented (without ability to tune this parameter -
xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
backend sleeps 1/200 sec before checking/forcing log fsync.But it returns _completed_ to the client before sleeping, right?
No.
Vadim
Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any
guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...Already implemented (without ability to tune this parameter -
xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
backend sleeps 1/200 sec before checking/forcing log fsync.
Should definitely make that tuneable (per installation is imho sufficient),
no use in waiting if the dba knows there is only very little concurrency.
IIRC DB/2 defaults to not using this "commit pooling".
Andreas
Import Notes
Resolved by subject fallback
[ Charset ISO-8859-1 unsupported, converting... ]
Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any
guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...Already implemented (without ability to tune this parameter -
xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
backend sleeps 1/200 sec before checking/forcing log fsync.But it returns _completed_ to the client before sleeping, right?
No.
Ewe, so we have this 1/200 second delay for every transaction. Seems
bad to me.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote:
[ Charset ISO-8859-1 unsupported, converting... ]
Earlier, Vadim was talking about arranging to share fsyncs of the WAL
log file across transactions (after writing your commit record to the
log, sleep a few milliseconds to see if anyone else fsyncs before you
do; if not, issue the fsync yourself). That would offer less-than-
one-fsync-per-transaction performance without giving up any
guarantees.
If people feel a compulsion to have a tunable parameter, let 'em tune
the length of the pre-fsync sleep ...Already implemented (without ability to tune this parameter -
xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
backend sleeps 1/200 sec before checking/forcing log fsync.But it returns _completed_ to the client before sleeping, right?
No.
Ewe, so we have this 1/200 second delay for every transaction. Seems
bad to me.
I think as long as it becomes a tunable this isn't a bad idea at
all. Fixing it at 1/200 isn't so great because people not wrapping
large amounts of inserts/updates with transaction blocks will
suffer.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."
At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote:
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote:
Ewe, so we have this 1/200 second delay for every transaction. Seems
bad to me.I think as long as it becomes a tunable this isn't a bad idea at
all. Fixing it at 1/200 isn't so great because people not wrapping
large amounts of inserts/updates with transaction blocks will
suffer.
I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).
- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.
At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote:
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote:
Ewe, so we have this 1/200 second delay for every transaction. Seems
bad to me.I think as long as it becomes a tunable this isn't a bad idea at
all. Fixing it at 1/200 isn't so great because people not wrapping
large amounts of inserts/updates with transaction blocks will
suffer.I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).
I just talked to Tom Lane about this. I think a sleep(0) just before
the flush would be the best. It would reliquish the cpu slice if
another process is ready to run. If no other backend is running, it
probably just returns. If there is another one, it gives it a chance to
complete. On return from sleep(0), it can check if it still needs to
flush. This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).I just talked to Tom Lane about this. I think a sleep(0) just before
the flush would be the best. It would reliquish the cpu slice if
another process is ready to run. If no other backend is running, it
probably just returns. If there is another one, it gives it a chance to
complete. On return from sleep(0), it can check if it still needs to
flush. This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.
This sounds like an interesting approach, yes.
- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.
* Don Baccus <dhogaza@pacifier.com> [001116 13:46]:
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).I just talked to Tom Lane about this. I think a sleep(0) just before
the flush would be the best. It would reliquish the cpu slice if
another process is ready to run. If no other backend is running, it
probably just returns. If there is another one, it gives it a chance to
complete. On return from sleep(0), it can check if it still needs to
flush. This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.This sounds like an interesting approach, yes.
Question: Is sleep(0) guaranteed to at least give up control?
The way I read my UnixWare 7's man page, it might not, since alarm(0)
just cancels the alarm...
Larry
- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).I just talked to Tom Lane about this. I think a sleep(0) just before
the flush would be the best. It would reliquish the cpu slice if
another process is ready to run. If no other backend is running, it
probably just returns. If there is another one, it gives it a chance to
complete. On return from sleep(0), it can check if it still needs to
flush. This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.This sounds like an interesting approach, yes.
In OS kernel design, you try to avoid process herding bottlenecks.
Here, we want them herded, and giving up the CPU may be the best way to
do it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
* Don Baccus <dhogaza@pacifier.com> [001116 13:46]:
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).I just talked to Tom Lane about this. I think a sleep(0) just before
the flush would be the best. It would reliquish the cpu slice if
another process is ready to run. If no other backend is running, it
probably just returns. If there is another one, it gives it a chance to
complete. On return from sleep(0), it can check if it still needs to
flush. This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.This sounds like an interesting approach, yes.
Question: Is sleep(0) guaranteed to at least give up control?
The way I read my UnixWare 7's man page, it might not, since alarm(0)
just cancels the alarm...
Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
call return.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]:
This sounds like an interesting approach, yes.
Question: Is sleep(0) guaranteed to at least give up control?
The way I read my UnixWare 7's man page, it might not, since alarm(0)
just cancels the alarm...Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
call return.
BUT, do we know for sure that sleep(0) is not optimized in the library
to just return?
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 11:59] wrote:
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).I just talked to Tom Lane about this. I think a sleep(0) just before
the flush would be the best. It would reliquish the cpu slice if
another process is ready to run. If no other backend is running, it
probably just returns. If there is another one, it gives it a chance to
complete. On return from sleep(0), it can check if it still needs to
flush. This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.This sounds like an interesting approach, yes.
In OS kernel design, you try to avoid process herding bottlenecks.
Here, we want them herded, and giving up the CPU may be the best way to
do it.
Yes, but if everyone yeilds you're back where you started, and with
128 or more backends do you really want to cause possibly that many
context switches per fsync?
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."
In OS kernel design, you try to avoid process herding bottlenecks.
Here, we want them herded, and giving up the CPU may be the best way to
do it.Yes, but if everyone yeilds you're back where you started, and with
128 or more backends do you really want to cause possibly that many
context switches per fsync?
You are going to kernel call/yield anyway to fsync, so why not try and
if someone does the fsync, we don't need to do it. I am suggesting
re-checking the need for fsync after the return from sleep(0).
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
* Larry Rosenman <ler@lerctr.org> [001116 12:09] wrote:
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]:
This sounds like an interesting approach, yes.
Question: Is sleep(0) guaranteed to at least give up control?
The way I read my UnixWare 7's man page, it might not, since alarm(0)
just cancels the alarm...Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
call return.BUT, do we know for sure that sleep(0) is not optimized in the library
to just return?
sleep(3) should conform to POSIX specification, if anyone has the
reference they can check it to see what the effect of sleep(0)
should be.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]:
This sounds like an interesting approach, yes.
Question: Is sleep(0) guaranteed to at least give up control?
The way I read my UnixWare 7's man page, it might not, since alarm(0)
just cancels the alarm...Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
call return.BUT, do we know for sure that sleep(0) is not optimized in the library
to just return?
We can only do our best here. I think guessing whether other backends
are _about_ to commit is pretty shaky, and sleeping every time is a
waste. This seems the cleanest.
Funny you should mention the optimization. I just checked BSDI and saw:
u_int
sleep(secs)
u_int secs;
{
struct timeval nt, ot;
long diff;
int rc;
if (secs == 0)
return (0);
So maybe we need another _fake_ kernel call, or a select/usleep with a
very small value.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian writes:
The way I read my UnixWare 7's man page, it might not, since alarm(0)
just cancels the alarm...Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
call return.
In glibc, sleep(0) just does "return 0;", so if the compiler has a good
day the call will disappear completely.
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
BUT, do we know for sure that sleep(0) is not optimized in
the library to just return?We can only do our best here. I think guessing whether other backends
are _about_ to commit is pretty shaky, and sleeping every time is a
waste. This seems the cleanest.
A long ago you, Bruce, made me gift - book about transaction processing
(thanks again -:)). This sleeping before fsync in commit is described
there as standard technique. And the reason is cleanest.
Men, cost of fsync is very high! { write (64 bytes) + fsync() }
takes ~ 1/50 sec. Yes, additional 1/200 sec or so results in worse
performance when there is only one backend running but greatly
increase overall performance for 100 simultaneous backends. Ie this
delay is trade off to gain better scalability.
I agreed that it must be configurable, smaller or probably 0 by
default, use approximate # of simultaneously running backends for
guessing (postmaster could maintain this number in shmem and
backends could just read it without any locking - exact number is
not required), good described as tuning patameter in documentation.
Anyway I object sleep(0).
Vadim
Import Notes
Resolved by subject fallback