Reducing sema usage (was Postmaster dies with many child processes)
I said:
Another thing we ought to look at is changing the use of semaphores so
that Postgres uses a fixed number of semaphores, not a number that
increases as more and more backends are started. Kernels are
traditionally configured with very low limits for the SysV IPC
resources, so having a big appetite for semaphores is a Bad Thing.
I've been looking into this issue today, and it looks possible but messy.
The source of the problem is the lock manager
(src/backend/storage/lmgr/proc.c), which wants to be able to wake up a
specific process that is blocked on a lock. I had first thought that it
would be OK to wake up any one of the processes waiting for a lock, but
after looking at the lock manager that seems a bad idea --- considerable
thought has gone into the queuing order of waiting processes, and we
don't want to give that up. So we need to preserve this ability.
The way it's currently done is that each extant backend has its own
SysV-style semaphore, and when you want to wake up a particular backend
you just V() its semaphore. (BTW, the semaphores get allocated in
chunks of 16, so an out-of-semaphores condition will always occur when
trying to start the 16*N+1'th backend...) This is simple and reliable
but fails if you want to have more backends than the kernel has SysV
semaphores. Unfortunately kernels are usually configured with not
very many semaphores --- 64 or so is typical. Also, running the system
down to nearly zero free semaphores is likely to cause problems for
other subsystems even if Postgres itself doesn't run out.
What seems practical to do instead is this:
* At postmaster startup, allocate a fixed number of semaphores for
use by all child backends. ("Fixed" can really mean "configurable",
of course, but the point is we won't ask for more later.)
* The semaphores aren't dedicated to use by particular backends.
Rather, when a backend needs to block, it finds a currently free
semaphore and grabs it for the duration of its wait. The number
of the semaphore a backend is using to wait with would be recorded
in its PROC struct, and we'd also need an array of per-sema data
to keep track of free and in-use semaphores.
* This works with very little extra overhead until we have more
simultaneously-blocked backends than we have semaphores. When that
happens (which we hope is really seldom), we overload semaphores ---
that is, we use the same sema to block two or more backends. Then
the V() operation by the lock's releaser might wake the wrong backend.
So, we need an extra field in the LOCK struct to identify the intended
wake-ee. When a backend is released in ProcSleep, it has to look at
the lock it is waiting on to see if it is supposed to be wakened
right now. If not, it V()s its shared semaphore a second time (to
release the intended wakee), then P()s the semaphore again to go
back to sleep itself. There probably has to be a delay in here,
to ensure that the intended wakee gets woken and we don't have its
bed-mates indefinitely trading wakeups among the wrong processes.
This is why we don't want this scenario happening often.
I think this could be made to work, but it would be a delicate and
hard-to-test change in what is already pretty subtle code.
A considerably more straightforward approach is just to forget about
incremental allocation of semaphores and grab all we could need at
postmaster startup. ("OK, Mac, you told me to allow up to N backends?
Fine, I'm going to grab N semaphores at startup, and if I can't get them
I won't play.") This would force the DB admin to either reconfigure the
kernel or reduce MaxBackendId to something the kernel can support right
off the bat, rather than allowing the problem to lurk undetected until
too many clients are started simultaneously. (Note there are still
potential gotchas with running out of processes, swap space, or file
table slots, so we wouldn't have really guaranteed that N backends can
be started safely.)
If we make MaxBackendId settable from a postmaster command-line switch
then this second approach is probably not too inconvenient, though it
surely isn't pretty.
Any thoughts about which way to jump? I'm sort of inclined to take
the simpler approach myself...
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofFri29Jan1999131354-050025947.917633634@sss.pgh.pa.us
A considerably more straightforward approach is just to forget about
incremental allocation of semaphores and grab all we could need at
postmaster startup. ("OK, Mac, you told me to allow up to N backends?
Fine, I'm going to grab N semaphores at startup, and if I can't get them
I won't play.") This would force the DB admin to either reconfigure the
kernel or reduce MaxBackendId to something the kernel can support right
off the bat, rather than allowing the problem to lurk undetected until
too many clients are started simultaneously. (Note there are still
potential gotchas with running out of processes, swap space, or file
table slots, so we wouldn't have really guaranteed that N backends can
be started safely.)If we make MaxBackendId settable from a postmaster command-line switch
then this second approach is probably not too inconvenient, though it
surely isn't pretty.Any thoughts about which way to jump? I'm sort of inclined to take
the simpler approach myself...
Semaphore are hard enough without overloading them. I say just gram
them on startup. They are cheap. Many databases use semaphores for
every row/page they lock, and boy that can be a lot of semaphores. We
are only getting a few.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Sat, 30 Jan 1999, Tom Lane wrote:
I said:
Another thing we ought to look at is changing the use of semaphores so
that Postgres uses a fixed number of semaphores, not a number that
increases as more and more backends are started. Kernels are
traditionally configured with very low limits for the SysV IPC
resources, so having a big appetite for semaphores is a Bad Thing.I've been looking into this issue today, and it looks possible but messy.
The source of the problem is the lock manager
(src/backend/storage/lmgr/proc.c), which wants to be able to wake up a
specific process that is blocked on a lock. I had first thought that it
would be OK to wake up any one of the processes waiting for a lock, but
after looking at the lock manager that seems a bad idea --- considerable
thought has gone into the queuing order of waiting processes, and we
don't want to give that up. So we need to preserve this ability.The way it's currently done is that each extant backend has its own
SysV-style semaphore, and when you want to wake up a particular backend
you just V() its semaphore. (BTW, the semaphores get allocated in
chunks of 16, so an out-of-semaphores condition will always occur when
trying to start the 16*N+1'th backend...) This is simple and reliable
but fails if you want to have more backends than the kernel has SysV
semaphores. Unfortunately kernels are usually configured with not
very many semaphores --- 64 or so is typical. Also, running the system
down to nearly zero free semaphores is likely to cause problems for
other subsystems even if Postgres itself doesn't run out.What seems practical to do instead is this:
* At postmaster startup, allocate a fixed number of semaphores for
use by all child backends. ("Fixed" can really mean "configurable",
of course, but the point is we won't ask for more later.)
* The semaphores aren't dedicated to use by particular backends.
Rather, when a backend needs to block, it finds a currently free
semaphore and grabs it for the duration of its wait. The number
of the semaphore a backend is using to wait with would be recorded
in its PROC struct, and we'd also need an array of per-sema data
to keep track of free and in-use semaphores.
* This works with very little extra overhead until we have more
simultaneously-blocked backends than we have semaphores. When that
happens (which we hope is really seldom), we overload semaphores ---
that is, we use the same sema to block two or more backends. Then
the V() operation by the lock's releaser might wake the wrong backend.
So, we need an extra field in the LOCK struct to identify the intended
wake-ee. When a backend is released in ProcSleep, it has to look at
the lock it is waiting on to see if it is supposed to be wakened
right now. If not, it V()s its shared semaphore a second time (to
release the intended wakee), then P()s the semaphore again to go
back to sleep itself. There probably has to be a delay in here,
to ensure that the intended wakee gets woken and we don't have its
bed-mates indefinitely trading wakeups among the wrong processes.
This is why we don't want this scenario happening often.I think this could be made to work, but it would be a delicate and
hard-to-test change in what is already pretty subtle code.A considerably more straightforward approach is just to forget about
incremental allocation of semaphores and grab all we could need at
postmaster startup. ("OK, Mac, you told me to allow up to N backends?
Fine, I'm going to grab N semaphores at startup, and if I can't get them
I won't play.") This would force the DB admin to either reconfigure the
kernel or reduce MaxBackendId to something the kernel can support right
off the bat, rather than allowing the problem to lurk undetected until
too many clients are started simultaneously. (Note there are still
potential gotchas with running out of processes, swap space, or file
table slots, so we wouldn't have really guaranteed that N backends can
be started safely.)If we make MaxBackendId settable from a postmaster command-line switch
then this second approach is probably not too inconvenient, though it
surely isn't pretty.Any thoughts about which way to jump? I'm sort of inclined to take
the simpler approach myself...
I'm inclined to agree...get rid of the 'hard coded' max, make it a
settable option on run time, and 'reserve the semaphores' on startup...
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Bruce Momjian <maillist@candle.pha.pa.us> writes:
A further thought: we could leave the semaphore management as-is,
and instead try to make running out of semaphores a less catastrophic
failure.
If they asked for 64 backends, we better be able go give them to them,
and not crash or fail under a load. 64 semaphores is nothing.
That argument would be pretty convincing if pre-grabbing the semaphores
was sufficient to ensure we could start N backends, but of course it's
not sufficient. The system could also run out of processes or file
descriptors, and I doubt that it's reasonable to grab all of those
instantly at postmaster startup.
The consensus seems clear not to go for the complex solution I described
at first. But I'm still vacillating whether to do pre-reservation of
semaphores or just fix the postmaster to reject a connection cleanly if
no more can be gotten. An advantage of the latter is that it would more
readily support on-the-fly changes of the max backend limit. (Which I
am *not* proposing to support now; I only plan to make it settable at
postmaster startup; but someday we might want to change it on the fly.)
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofSat30Jan1999204523-0500199901310145.UAA01565@candle.pha.pa.us | Resolved by subject fallback
Tom Lane wrote:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
If they asked for 64 backends, we better be able go give them to them,
and not crash or fail under a load. 64 semaphores is nothing.That argument would be pretty convincing if pre-grabbing the semaphores
was sufficient to ensure we could start N backends, but of course it's
not sufficient. The system could also run out of processes or file
descriptors, and I doubt that it's reasonable to grab all of those
instantly at postmaster startup.
The major problem at the moment is not that a new backend fails, but
that it brings down everything else with it. How about having a new
backend set a one-byte flag in shared memory when it has
finished setting itself up? as long as the flag is unset, the
backend is still starting itself up, and a failure will not require
other backends to be brought down.
--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
PGP key from public servers; key ID 32B8FAA1
========================================
"Jesus saith unto him, I am the way, the truth, and the
life; no man cometh unto the Father, but by me."
John 14:6
"Oliver Elphick" <olly@lfix.co.uk> writes:
The major problem at the moment is not that a new backend fails, but
that it brings down everything else with it.
Agreed.
How about having a new backend set a one-byte flag in shared memory
when it has finished setting itself up? as long as the flag is unset,
the backend is still starting itself up, and a failure will not
require other backends to be brought down.
Not much win to be had there, I suspect. The main problem is that as
soon as a new backend starts altering shared memory, you have potential
corruption issues to worry about if it goes down. And there's not
really very much the new backend can do before it alters shared memory.
In fact, it can't do much of *anything* until it's made an entry for
itself in the lock manager's PROC array, because it cannot find out
anything interesting without locking shared structures.
Hmm. If that's true, then the failure to get a sema would occur very
early in the new backend's lifetime, before it's had a chance to create
any trouble. Maybe the very easiest solution to the sema issue is to
make the new backend send a failure report to its client and then
exit(0) instead of exit(1), so that the postmaster considers it a clean
exit rather than a crash...
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofSun31Jan1999213335+0000199901312133.VAA31820@linda.lfix.co.uk | Resolved by subject fallback
Tom Lane wrote:
I said:
Another thing we ought to look at is changing the use of semaphores so
that Postgres uses a fixed number of semaphores, not a number that
increases as more and more backends are started. Kernels are
traditionally configured with very low limits for the SysV IPC
resources, so having a big appetite for semaphores is a Bad Thing.
...
Any thoughts about which way to jump? I'm sort of inclined to take
the simpler approach myself...
Could we use sigpause (or something like this) to block
and some signal to wake up?
Vadim
Hmm. If that's true, then the failure to get a sema would occur very
early in the new backend's lifetime, before it's had a chance to
create any trouble. Maybe the very easiest solution to the sema issue
is to make the new backend send a failure report to its client and
then exit(0) instead of exit(1), so that the postmaster considers it a
clean exit rather than a crash...
Sounds like the cleanest solution too. If it pans out, I like it...
- Tom