Postmaster dies with many child processes (spinlock/semget failed)

Started by Patrick Verdonalmost 27 years ago10 messages

patrick@kan.co.uk

almost 27 years ago

Hi,

I sent the following message to the pgsql-general
list on the 24th but haven't received any answers
from PostgreSQL developers, only from other people
who are experiencing the same problems.

I would say the errors I am describing are quite
serious and I was wondering whether there was any
chance of them being addressed in the forthcoming
6.5 release.

The problem is very easy to reproduce - here are
the necessary steps:

1. Install PostgreSQL 6.4.2
2. Install Perl 5.005_02
3. Install Perl modules: DBI 1.06; DBD-Pg 0.90; ApacheDBI-0.81
3. Download apache 1.3.4
4. Download mod_perl 1.17+ in same directory
5. Extract distributions
6. cd mod_perl-1.17
7. perl Makefile.PL EVERYTHING=1 && make && make test && make install
8. Set the following directives in Apache's httpd.conf:
MinSpareServers 100
MaxSpareServers 100
StartServers 100
MaxClients 100
9. PerlRequire /usr/local/apache/conf/startup.pl where
startup.pl contains:
use Apache::Registry ();
use Apache::DBI ();
Apache::DBI->connect_on_init("DBI:Pg:dbname=template1", "", "");
1;
10. Start Apache: apachectl start

Note that this example makes use of no custom
application code and is using the template1
database.

Check Apache's error_log and you will see error
messages and eventually the postmaster will die
with something like:

FATAL: s_lock(28001065) at spin.c:125, stuck spinlock. Aborting.

The magic number seems to be 48. If I start 49
httpd/postgres processes everything falls apart
but if I start 48 everything is fine. I'm
running on FreeBSD 2.2.8 and I've increased
maxusers to 512 - no difference.

I'd appreciate some feedback from the guys who
are making PostgreSQL happen. Can these issues
be addressed? PostgreSQL is a great database but
this is a show stopper for people developing big
Web applications.

If you need any more information don't hesitate
to contact me.

Cheers.

Patrick

Sent to pgsql-general list on January 24th 1999:

Hi,

I've been doing some benchmarking with PostgreSQL
under mod_perl and I've been getting some rather
disturbing results. To achieve the maximum benefit
from persistent connections I am using a method
called 'connect_on_init' that comes with a Perl
module called Apache::DBI. Using this method,
when the Web server is first started - each child
process establishes a persistent connection with
the database. When using PostgreSQL as the database,
this causes there to be as many 'postgres'
processes are there are 'httpd' processes
for a given database.

As part of my benchmarking I've been testing the
number of httpd processes that my server can
support. The machine is a 450 MHz PII/256 MB RAM.
As an excercise I tried to start 100 httpd
processes. Doing this consistently results in the
following PostgreSQL errors and the backend usually
dies:

IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600
NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.

FATAL: s_lock(28001065) at spin.c:125, stuck spinlock. Aborting.

Note that the 'no space left on device' is
misleading as there is a minimum of 400 MB
available on each file-system on the server.

This is obviously bad news, especially as we are
hoping to develop some fairly large-scale
applications with PostgreSQL. Note that this
happens when connecting to a single database.
We were hoping to connect to several databases
from each httpd process!!

The frustrating thing is we have the resources.
If I only start 30 processes (which seems to be
the approximate limit) there is about 100 MB
of RAM that is not being used.

Are there any configuration values that control
the number of postgres processes? Do you have
any idea why this is happening?

Is anyone else using Apache/mod_perl and PostgreSQL
successfully in a demanding environment?

Any help would be greatly appreciated.

Cheers.

Patrick

#===============================#
\ KAN Design & Publishing Ltd /
/ T: +44 (0)1223 511134 \
\ F: +44 (0)1223 571968 /
/ E: mailto:patrick@kan.co.uk \
\ W: http://www.kan.co.uk /
#===============================#

The Hermit Hacker

scrappy@hub.org

almost 27 years ago

In reply to: Patrick Verdon (#1)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

On Thu, 28 Jan 1999, Patrick Verdon wrote:

IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600
NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.

FATAL: s_lock(28001065) at spin.c:125, stuck spinlock. Aborting.

Note that the 'no space left on device' is
misleading as there is a minimum of 400 MB
available on each file-system on the server.

My first guess is that you don't have enough semaphores enabled in your
kernel...increase that from the default, and I'm *guessing* that you'll
get past your 48...

Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Vadim Mikheev

vadim@krs.ru

almost 27 years ago

In reply to: Patrick Verdon (#1)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

Patrick Verdon wrote:

Check Apache's error_log and you will see error
messages and eventually the postmaster will die
with something like:

FATAL: s_lock(28001065) at spin.c:125, stuck spinlock. Aborting.

Try to increase S_MAX_BUSY in src/backend/storage/buffer/s_lock.c:

#define S_MAX_BUSY 500 * S_NSPINCYCLE
^^^
try with 10000.

Vadim

Oleg Bartunov

oleg@sai.msu.su

almost 27 years ago

In reply to: Patrick Verdon (#1)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

I don't think this is a Postgres problem. I got the same
problem you described when upgrading Apache from 1.3.3 to 1.3.4
I had to return to 1.3.3
Probably I will try modperl 1.18+Apache 1.3.4

Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Patrick Verdon

patrick@kan.co.uk

almost 27 years ago

In reply to: Oleg Bartunov (#4)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

Tatsuo, Vadim, Oleg, Scrappy,

Many thanks for the response.

A couple of you weren't convinced that this
is a Postgres problem so let me try to clear
the water a little bit. Maybe the use of
Apache and mod_perl is confusing the issue -
the point I was trying to make is that if
there are 49+ concurrent postgres processes
on a normal machine (i.e. where kernel
parameters are the defaults, etc.) the
postmaster dies in a nasty way with
potentially damaging results.

Here's a case without Apache/mod_perl that
causes exactly the same behaviour. Simply
enter the following 49 times:

kandinsky:patrick> psql template1 &

Note that I tried to automate this without
success:

perl -e 'for ( 1..49 ) { system("/usr/local/pgsql/bin/psql template1 &"); }'

The 49th attempt to initiate a connection
fails:

Connection to database 'template1' failed.
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally before or while processing the request.

and the error_log says:

InitPostgres
IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600
proc_exit(3) [#0]
shmem_exit(3) [#0]
exit(3)
/usr/local/pgsql/bin/postmaster: reaping dead processes...
/usr/local/pgsql/bin/postmaster: CleanupProc: pid 1521 exited with status 768
/usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process 1518
NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.

FATAL: s_lock(dfebe065) at spin.c:125, stuck spinlock. Aborting.

Even if there is a hard limit there is no way that
Postgres should die in this spectacular fashion.
I wouldn't have said that it was unreasonable for
some large applications to peak at >48 processes
when using powerful hardware with plenty of RAM.

The other point is that even if one had 1 GB RAM,
Postgres won't scale beyond 48 processes, using
probably less than 100 MB of RAM. Would it be
possible to make the 'MaxBackendId' configurable
for those who have the resources?

I have reproduced this behaviour on both
FreeBSD 2.2.8 and Intel Solaris 2.6 using
version 6.4.x of PostgreSQL.

I'll try to change some of the parameters
suggested and see how far I get but the bottom
line is Postgres shouldn't be dying like this.

Let me know if you need any more info.

Cheers.

Patrick

Import Notes

Resolved by subject fallback

Hannu Krosing

hannu@trust.ee

almost 27 years ago

In reply to: Patrick Verdon (#5)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

Patrick Verdon wrote:

Even if there is a hard limit there is no way that
Postgres should die in this spectacular fashion.

[snip]

I have reproduced this behaviour on both
FreeBSD 2.2.8 and Intel Solaris 2.6 using
version 6.4.x of PostgreSQL.

I'll try to change some of the parameters
suggested and see how far I get but the bottom
line is Postgres shouldn't be dying like this.

We definitely need a chapter on tuning postgres in some of the manuals.

It should contain not only the parameters that one can change in
PostgreSQL - for either better response or for taking a larger load -
but also the ways one can tune the underlying OS, being it Linux, *BSD,
Solaris or whatever.

Even commercial databases (at least Oracle) tend to rebuild kernel
during installation (obsereved with Oracle 7.1 on Solaris)

When I once needed the info about setting shared memory limits on
solaris I cried out here and got the example lines (I actually had them
already copied from a macine where oracle was running)

But the same info, and possibly more(increasing limits for max
files per process/globally, shared mem config, ... whatever else
is needed) seems to be essential part of setting up a serious DB
server on any system.

---------------
Hannu

Tom Lane

tgl@sss.pgh.pa.us

almost 27 years ago

In reply to: Hannu Krosing (#6)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

Patrick Verdon <patrick@kan.co.uk> writes:

the point I was trying to make is that if there are 49+ concurrent
postgres processes on a normal machine (i.e. where kernel parameters
are the defaults, etc.) the postmaster dies in a nasty way with
potentially damaging results.

Right. It looks to me like your problem is running out of SysV
semaphores:

IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600

(read the man page for semget(2):
[ENOSPC] A semaphore identifier is to be created, but the
system-imposed limit on the maximum number of
allowed semaphore identifiers system wide would be
exceeded.
Old bad habit of Unix kernel programmers: re-use closest available error
code, rather than deal with the hassle of inventing a new kernel errno.)

You can increase the kernel's number-of-semaphores parameter (on my box,
both SEMMNI and SEMMNS need to be changed), but it'll probably take a
kernel rebuild to do it.

Even if there is a hard limit there is no way that
Postgres should die in this spectacular fashion.

Well, running out of resources is something that it's hard to guarantee
recovery from. Postgres is designed on the assumption that it's better
to try to prevent corruption of the database than to try to limp along
after a failure --- so the crash recovery behavior is exactly what you
see, mutual mass suicide of all surviving backends. Restarting all your
clients is a pain in the neck, agreed, but would you rather have
database corruption spreading invisibly?

The other point is that even if one had 1 GB RAM,
Postgres won't scale beyond 48 processes, using
probably less than 100 MB of RAM. Would it be
possible to make the 'MaxBackendId' configurable
for those who have the resources?

MaxBackendId is 64 by default, so that's not the limit you're hitting.

It should be easier to configure MaxBackendId --- probably it should be
an option to the configure script. I've put this on my personal to-do
list. (I don't think it's a good idea to have *no* upper limit, even
if it were easy to do in the code --- otherwise an unfriendly person
could run you out of memory by starting more and more clients. If he
stops just short of exhausting swap space, then Postgres is perfectly
happy, but all the rest of your system starts misbehaving ... not cool.)

Another thing we ought to look at is changing the use of semaphores so
that Postgres uses a fixed number of semaphores, not a number that
increases as more and more backends are started. Kernels are
traditionally configured with very low limits for the SysV IPC
resources, so having a big appetite for semaphores is a Bad Thing.

Right now it looks like we use a sema per backend to support spinlocks.
Perhaps we could just use a single sema that all backends block on when
waiting for a spinlock? This might be marginally slower, or it might
not, but hopefully one is not blocking on spinlocks too often anyway.
Or, given that the system seems to contain only a small fixed number of
spinlocks, maybe a sema per spinlock would work best.

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofFri29Jan1999160528+000036B1DC48.8C52FD92@kan.co.uk | Resolved by subject fallback

Tatsuo Ishii

t-ishii@sra.co.jp

almost 27 years ago

In reply to: Tom Lane (#7)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

MaxBackendId is 64 by default, so that's not the limit you're hitting.

It should be easier to configure MaxBackendId --- probably it should be
an option to the configure script. I've put this on my personal to-do
list. (I don't think it's a good idea to have *no* upper limit, even

Or even better, MaxBackendId can be set at the run time such as
postmaster's option. Also, it would be nice if we could monitor number
of backends currently running. Maybe we should have a new protocol for
this kind of puropose?

BTW, as I pointed out before, PostgreSQL will have serious problem
once hitting the MaxBackendId. My patches I proposed for this seem
still under discussion. I think we should solve the problem in the
next release in whatever way, however.
---
Tatsuo Ishii

Import Notes

Reply to msg id not found: YourmessageofFri29Jan1999131354EST.25947.917633634@sss.pgh.pa.us | Resolved by subject fallback

The Hermit Hacker

scrappy@hub.org

almost 27 years ago

In reply to: Patrick Verdon (#5)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

On Fri, 29 Jan 1999, Patrick Verdon wrote:

Tatsuo, Vadim, Oleg, Scrappy,

Many thanks for the response.

A couple of you weren't convinced that this
is a Postgres problem so let me try to clear
the water a little bit. Maybe the use of
Apache and mod_perl is confusing the issue -
the point I was trying to make is that if
there are 49+ concurrent postgres processes
on a normal machine (i.e. where kernel
parameters are the defaults, etc.) the
postmaster dies in a nasty way with
potentially damaging results.

Here's a case without Apache/mod_perl that
causes exactly the same behaviour. Simply
enter the following 49 times:

kandinsky:patrick> psql template1 &

Note that I tried to automate this without
success:

perl -e 'for ( 1..49 ) { system("/usr/local/pgsql/bin/psql template1 &"); }'

The 49th attempt to initiate a connection
fails:

Connection to database 'template1' failed.
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally before or while processing the request.

and the error_log says:

InitPostgres
IpcSemaphoreCreate: semget failed (No space left on device) key=5432017, num=16, permission=600

this error indicates taht you are out of semaphores...you have enough
configures to allow for 48 processes, but not the 49th...

I have reproduced this behaviour on both
FreeBSD 2.2.8 and Intel Solaris 2.6 using
version 6.4.x of PostgreSQL.

Both of them have "default" settings for semaphores...I don't recall what
they are, but the error you are seeing about IPCSemaphoreCreate indicates
that you are exceeding it...

I'll try to change some of the parameters
suggested and see how far I get but the bottom
line is Postgres shouldn't be dying like this.

PostgreSQL cannot allocate past what the operating sytem has hardcoded as
the max...maybe a more graceful exit should be in order, though? Or is
that what you mean?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 27 years ago

In reply to: The Hermit Hacker (#9)

Re: [HACKERS] Postmaster dies with many child processes (spinlock/semget failed)

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

BTW, as I pointed out before, PostgreSQL will have serious problem
once hitting the MaxBackendId. My patches I proposed for this seem
still under discussion.

Not sure why that didn't get applied before, but I just put it in,
and verified that you can start exactly MaxBackendId backends
(assuming that you don't hit any kernel resource limits on the way).

BTW, we do recover quite gracefully from hitting MAXUPRC (kernel
limit on processes for one userid) :-). But that's just because the
postmaster's initial fork() fails. A failure any later than that
in backend startup will be treated as a backend crash ...

I agree with Hannu Krosing's remark that we really need some
documentation about kernel parameters that have to be checked when
setting up a non-toy database server. I've personally run into
NFILES limits, for instance, with not all that many backends running.

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofSat30Jan1999101852+0900199901300118.KAA00528@ext16.sra.co.jp | Resolved by subject fallback