Getting FATAL: terminating connection due to administrator command
Hi
Since some days we are getting the above message.
The system is a current CentOS 5.5, x86_64, Postgresql 8.4 as it comes with the packages postgresql84, postgresql84-libs etc. PostGIS is enabled, as it comes from http://www.argeo.org/linux/argeo-el.
The error message appears from time to time. The exact same request, coming from a PHP applications, sometimes works, sometimes fails. This happens in different points of our applications, tipically, but not only, when large data portions are queried, as in geometric queries, using PostGIS.
The server is only slightly loaded.
Also in the PostgreSQL logs we get:
FATAL: terminating connection due to administrator command
repeated multiple times.
The server is from Dell, Dell's hardware monitoring, OpenManage, says that the hardware, in particular memory and disk, are ok.
We do have a nearly identical server, with the identical version of PostgreSQL/PostGIS, but that was last updated one or two months ago and which is intensly used as our testing and development server, which never gave us the same error message.
Where could I start to troubleshoot this problem.
Peter Hopfgartner
On Wed, Sep 15, 2010 at 02:55:39PM +0200, Peter Hopfgartner wrote:
Where could I start to troubleshoot this problem.
First with staff, then with unauthorized access, then with
failover software.
Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
Since some days we are getting the above message.
Also in the PostgreSQL logs we get:
FATAL: terminating connection due to administrator command
This is a result of something sending SIGTERM to the backend process.
I have heard reports of "load management" software that SIGTERM's
processes more or less at random whenever it decides the system is
overloaded. If you have any such junkware installed on your server,
try disabling it.
The server is from Dell, Dell's hardware monitoring, OpenManage, says that the hardware, in particular memory and disk, are ok.
Never dealt with OpenManage before, but you should cast a wary eye
upon any Dell-specific software on the machine. This behavior is
definitely not normal for Unix systems, so you need to look for
nonstandard software (and what's more, nonstandard software running with
root privileges, else it couldn't SIGTERM postgres processes).
regards, tom lane
On 15/09/2010 10:07 PM, Tom Lane wrote:
The server is from Dell, Dell's hardware monitoring, OpenManage, says that the hardware, in particular memory and disk, are ok.
Never dealt with OpenManage before, but you should cast a wary eye
upon any Dell-specific software on the machine.
(A bit of a digression, but):
Personally I'd suggest being wary of any software supplied by the entity
that will be responsible for the costs of any warranty work. They won't
be at *all* sad if their software deflects blame and you don't discover
a fault until your server is out of warranty.
I've seen enough HDD vendor utilities report that a disk is just peachy,
thanks, when it's developing and reallocating bad sectors at a rate of
one every few minutes. ("Hey, you didn't need that boot block, I've
allocated you a shiny new one full of zeroes that's just as good.") The
S.M.A.R.T. "health check" tends to say everything's fine, too ... but if
you examine the fine print in the vendor attributes you see very high
reallocated sector counts, ECC error levels, and other signs of a dying
disk. I see this with so-called "enterprise" disks, not just consumer
SATA drives.
HDD vendors are certainly a particularly bad case, but nonetheless -
don't trust vendor diagnostic software in general. If it says the device
is broken I'll believe it because I trust them to make sure it won't
report expensive false positives - but if it says it's OK I'll merely
consider it not proven broken yet. False negatives work in their favour.
Find 3rd party diagnostic tools where possible, and where not possible
don't trust the overall health assessment provided by the vendor tools,
dig into the fine print in the diagnostics and see what the details are
like.
For hard disks, smartctl from smartmontools is a lifesaver. Your issue
doesn't sound HDD related, but it's worth mentioning for the future.
--
Craig Ringer
Tech-related writing at http://soapyfrogs.blogspot.com/
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 15.09.2010 16:07
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
Since some days we are getting the above message.
Also in the PostgreSQL logs we get:
FATAL: terminating connection due to administrator commandThis is a result of something sending SIGTERM to the backend process.
I have heard reports of "load management" software that SIGTERM's
processes more or less at random whenever it decides the system is
overloaded. If you have any such junkware installed on your server,
try disabling it.
The server is a rather bare bone server for web mapping, so basically PostgreSQL/PostGIS, Apache, PHP, Tomcat and little other stuff. The Dell software was the only which did not come from CentOS/EPEL/argeo/in-house RPM packages. I've removed the Dell stuff completely, but the problem is still there.
The server is from Dell, Dell's hardware monitoring, OpenManage, says
that the hardware, in particular memory and disk, are ok.
Never dealt with OpenManage before, but you should cast a wary eye
upon any Dell-specific software on the machine. This behavior is
definitely not normal for Unix systems, so you need to look for
nonstandard software (and what's more, nonstandard software running with
root privileges, else it couldn't SIGTERM postgres processes).
Other informations: disks are costly SAS drives in a RAID 1 array, memory is with ECC.
Security level is disabled
SELinux is Permissive.
The server acts as a XEN host
Is it reasonable to restrict the problem to kernel/hardware and/or PostgreSQL/PostGIS itself?
Can I trace where the SIGTERM comes from?
regards, tom lane
Regards,
Peter
Import Notes
Resolved by subject fallback
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
This is a result of something sending SIGTERM to the backend process.
Can I trace where the SIGTERM comes from?
If this is a recent Red-Hat-based release, I think that systemtap could
probably be used to determine that. There's a script here that solves
a related problem:
http://sourceware.org/systemtap/examples/process/sigmon.stp
regards, tom lane
On Wednesday 15 September 2010, Peter Hopfgartner elucidated thus:
The server is a rather bare bone server for web mapping, so basically
PostgreSQL/PostGIS, Apache, PHP, Tomcat and little other stuff. The
Dell software was the only which did not come from
CentOS/EPEL/argeo/in-house RPM packages. I've removed the Dell stuff
completely, but the problem is still there.
Are you running out of memory and getting killed by the OOM killer?
j
--
Joshua Kugler
Part-Time System Admin/Programmer
http://www.eeinternet.com - Fairbanks, AK
PGP Key: http://pgp.mit.edu/ �ID 0x73B13B6A
"Joshua J. Kugler" <joshua@eeinternet.com> writes:
On Wednesday 15 September 2010, Peter Hopfgartner elucidated thus:
The server is a rather bare bone server for web mapping, so basically
PostgreSQL/PostGIS, Apache, PHP, Tomcat and little other stuff. The
Dell software was the only which did not come from
CentOS/EPEL/argeo/in-house RPM packages. I've removed the Dell stuff
completely, but the problem is still there.
Are you running out of memory and getting killed by the OOM killer?
The OOM killer hits its victims with SIGKILL (kill -9), so we can rule
that out as not matching Peter's symptoms. This is definitely an
unwanted SIGTERM not SIGKILL.
regards, tom lane
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 15.09.2010 17:40
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
This is a result of something sending SIGTERM to the backend process.
Can I trace where the SIGTERM comes from?
If this is a recent Red-Hat-based release, I think that systemtap could
probably be used to determine that. There's a script here that solves
a related problem:
http://sourceware.org/systemtap/examples/process/sigmon.stpregards, tom lane
As it seems I need to recompile the kernel in order to have the debug infos, http://sourceware.org/systemtap/wiki/SystemTapOnCentOS.
Tomorrow I will start the probes.
Thanks,
Peter
Import Notes
Resolved by subject fallback
Hi -
peter.hopfgartner wrote:
As it seems I need to recompile the kernel in order to have the debug
infos, http://sourceware.org/systemtap/wiki/SystemTapOnCentOS.
Tomorrow I will start the probes.
Check first on http://debuginfo.centos.org/; they may have saved the
goods for your particular kernel build. (version/architecture?)
It may be also possible to craft a similar script that doesn't rely on
debuginfo.
- FChE
Import Notes
Resolved by subject fallback
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 15.09.2010 17:40
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
This is a result of something sending SIGTERM to the backend process.
Can I trace where the SIGTERM comes from?
If this is a recent Red-Hat-based release, I think that systemtap could
probably be used to determine that. There's a script here that solves
a related problem:
http://sourceware.org/systemtap/examples/process/sigmon.stp
Ok, I was able to create the packages with the debug info, adapted the script to report SIGTERM sent so any process, see attachment, tested this sending some SIGTERM and seeing them reported by systemtap.
Anyway, currently we are not able to reproduce the problem on the server, magically everything seems to work on this machine. The only changes should be the 2 debug-info packages (kernel-debuginfo-common-2.6.18-194.11.3.el5.x86_64.rpm and kernel-xen-debuginfo-2.6.18-194.11.3.el5.x86_64.rpm).
In the same time, the problem appeared twice today on our testing server. This suggests to drop the hardware failure hypothesis.
I'm running systemtap on the testing server now, too.
Show quoted text
regards, tom lane
Attachments:
Import Notes
Resolved by subject fallback
--------Peter Hopfgartner <peter.hopfgartner@r3-gis.com> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 16:56
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to
administrator commandDate: 15.09.2010 17:40
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
This is a result of something sending SIGTERM to the backend process.
Can I trace where the SIGTERM comes from?
If this is a recent Red-Hat-based release, I think that systemtap could
probably be used to determine that. There's a script here that solves
a related problem:
Now we had the error, but systemtap did not report any SIGTERM. Is it possible to have this error without a SIGTERM being involved? As mentioned in a previous mail, I've modified the script to report SIGTERM sent to any process.
Peter
Import Notes
Resolved by subject fallback
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?
Hmph. I would have said not, but ...
What PG version is this exactly?
regards, tom lane
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 17:37
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?Hmph. I would have said not, but ...
What PG version is this exactly?
regards, tom lane
It says:
"PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit"
and should be identical to the current Red Hat implementation of postgresql84-server as in RHEL 5.5
Peter
Import Notes
Resolved by subject fallback
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?Hmph. I would have said not, but ...
What PG version is this exactly?
"PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit"
Well, I just took another careful look at the 8.4.x source code. The
only occurrence of that error string is in ProcessInterrupts() when it
sees ProcDiePending set, and so far as I can see ProcDiePending is only
set by the die() signal handler, and in postmaster child processes die()
is only used as a SIGTERM handler. So it's pretty hard to come to the
conclusion that it could be anything else.
I don't know systemtap well at all, so I can't suggest any reason why
your script might be missing SIGTERM events, but it seems that it must be.
Perhaps you can find a systemtap expert who can comment on that.
regards, tom lane
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 18:49
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?Hmph. I would have said not, but ...
What PG version is this exactly?
"PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit"
Well, I just took another careful look at the 8.4.x source code. The
only occurrence of that error string is in ProcessInterrupts() when it
sees ProcDiePending set, and so far as I can see ProcDiePending is only
set by the die() signal handler, and in postmaster child processes die()
is only used as a SIGTERM handler. So it's pretty hard to come to the
conclusion that it could be anything else.I don't know systemtap well at all, so I can't suggest any reason why
your script might be missing SIGTERM events, but it seems that it must
be.
Perhaps you can find a systemtap expert who can comment on that.regards, tom lane
Is there any other way I can attack this issue? Attach GDB? How (I haven't done any serious C programming in this millenium).
Peter
Import Notes
Resolved by subject fallback
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
[...]
Now we had the error, but systemtap did not report any SIGTERM. Is
it possible to have this error without a SIGTERM being involved? As
mentioned in a previous mail, I've modified the script to report
SIGTERM sent to any process.
There are some other possibilities. It's possible that the version of
stap you're using is not expanding signal.send to all possible paths
of the kernel dispatching signals to your process.
So one might try a few different things:
------------------------------------------------------------------------
# see what die() is getting to work with
probe process("/usr/bin/postgres").function("die") {
printf("%s[%d] received %d\n", execname(), pid(), $postgres_signal_arg)
}
# check for another process sending SIGTERM
probe syscall.kill {
if (sig == 15) {
printf("%s[%d] sending %s\n", execname(), pid(), argstr)
print_ubacktrace()
}
}
# heck, trace the whole statement sequence during the signal handling
probe process("/usr/bin/postgres").statement("die@*:*"),
process("/usr/bin/postgres").statement("ProcessInterrupts@*:*") {
printf("%s %s\n", pp(), $$vars)
}
------------------------------------------------------------------------
You can run that in the background. The second probe will give
systemwide SIGTERM activity, so you may need to filter it a bit.
If you know the appropriate postmaster process-id, you could change
the syscall.kill probe:
< if (sig == 15) {
if (sig == 15 && pid == target_pid()) {
and invoke the script with stap ... -x PID_OF_YOUR_POSTGRES_SERVER
(In this case, "sig" and "pid" come from the syscall arguments, that
is represent the intended signal recepient, rather than the sender;
see also 'stap -L signal.send'.)
Note that postgres does sometimes send signals to itself, so don't be
surprised to see post* processes show up there.
(A more modern system compiler & systemtap would give you much better
variable-value dumping options.)
- FChE
Is there any other way [than SystemTap] I can attack this issue
[figuring out who sent a SIGTERM]? Attach GDB?
gdb won't tell you who sent the signal.
I don't really see any specific reason why sigmon.stp wouldn't show a SIGTERM if
one was sent.
An alternative could be to use auditd with something like this:
# auditctl -a entry,always -S kill -F a1=15
Regards,
Adrien Kunysz
Import Notes
Resolved by subject fallback
--------"Frank Ch. Eigler" <fche@redhat.com> wrote--------
Subject: Re: Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 22:59
Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:
[...]
Now we had the error, but systemtap did not report any SIGTERM. Is
it possible to have this error without a SIGTERM being involved? As
mentioned in a previous mail, I've modified the script to report
SIGTERM sent to any process.There are some other possibilities. It's possible that the version of
stap you're using is not expanding signal.send to all possible paths
of the kernel dispatching signals to your process.So one might try a few different things:
------------------------------------------------------------------------
(...)
# check for another process sending SIGTERM
probe syscall.kill {
if (sig == 15) {
printf("%s[%d] sending %s\n", execname(), pid(), argstr)
print_ubacktrace()
}
}
I'm trying this one, currently.
(...)
The sad thing, at the moment, is, that the problem has not been occuring since days, now. I did not change anything in the setup of the server, only that the system might be used slightly less, then when it occured, so it is statistically less likely that it occurs.
(A more modern system compiler & systemtap would give you much better
variable-value dumping options.)
Is there an easy way to install those on a 5.5 system?
- FChE
Regards,
Peter
Import Notes
Resolved by subject fallback