Getting FATAL: terminating connection due to administrator command

Started by Peter Hopfgartnerover 15 years ago19 messagesgeneral
Jump to latest
#1Peter Hopfgartner
peter.hopfgartner@r3-gis.com

Hi

Since some days we are getting the above message.

The system is a current CentOS 5.5, x86_64, Postgresql 8.4 as it comes with the packages postgresql84, postgresql84-libs etc. PostGIS is enabled, as it comes from http://www.argeo.org/linux/argeo-el.

The error message appears from time to time. The exact same request, coming from a PHP applications, sometimes works, sometimes fails. This happens in different points of our applications, tipically, but not only, when large data portions are queried, as in geometric queries, using PostGIS.

The server is only slightly loaded.

Also in the PostgreSQL logs we get:

FATAL: terminating connection due to administrator command

repeated multiple times.

The server is from Dell, Dell's hardware monitoring, OpenManage, says that the hardware, in particular memory and disk, are ok.

We do have a nearly identical server, with the identical version of PostgreSQL/PostGIS, but that was last updated one or two months ago and which is intensly used as our testing and development server, which never gave us the same error message.

Where could I start to troubleshoot this problem.

Peter Hopfgartner

#2Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Peter Hopfgartner (#1)
Re: Getting FATAL: terminating connection due to administrator command

On Wed, Sep 15, 2010 at 02:55:39PM +0200, Peter Hopfgartner wrote:

Where could I start to troubleshoot this problem.

First with staff, then with unauthorized access, then with
failover software.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Hopfgartner (#1)
Re: Getting FATAL: terminating connection due to administrator command

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

Since some days we are getting the above message.
Also in the PostgreSQL logs we get:
FATAL: terminating connection due to administrator command

This is a result of something sending SIGTERM to the backend process.

I have heard reports of "load management" software that SIGTERM's
processes more or less at random whenever it decides the system is
overloaded. If you have any such junkware installed on your server,
try disabling it.

The server is from Dell, Dell's hardware monitoring, OpenManage, says that the hardware, in particular memory and disk, are ok.

Never dealt with OpenManage before, but you should cast a wary eye
upon any Dell-specific software on the machine. This behavior is
definitely not normal for Unix systems, so you need to look for
nonstandard software (and what's more, nonstandard software running with
root privileges, else it couldn't SIGTERM postgres processes).

regards, tom lane

#4Craig Ringer
craig@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: Getting FATAL: terminating connection due to administrator command

On 15/09/2010 10:07 PM, Tom Lane wrote:

The server is from Dell, Dell's hardware monitoring, OpenManage, says that the hardware, in particular memory and disk, are ok.

Never dealt with OpenManage before, but you should cast a wary eye
upon any Dell-specific software on the machine.

(A bit of a digression, but):

Personally I'd suggest being wary of any software supplied by the entity
that will be responsible for the costs of any warranty work. They won't
be at *all* sad if their software deflects blame and you don't discover
a fault until your server is out of warranty.

I've seen enough HDD vendor utilities report that a disk is just peachy,
thanks, when it's developing and reallocating bad sectors at a rate of
one every few minutes. ("Hey, you didn't need that boot block, I've
allocated you a shiny new one full of zeroes that's just as good.") The
S.M.A.R.T. "health check" tends to say everything's fine, too ... but if
you examine the fine print in the vendor attributes you see very high
reallocated sector counts, ECC error levels, and other signs of a dying
disk. I see this with so-called "enterprise" disks, not just consumer
SATA drives.

HDD vendors are certainly a particularly bad case, but nonetheless -
don't trust vendor diagnostic software in general. If it says the device
is broken I'll believe it because I trust them to make sure it won't
report expensive false positives - but if it says it's OK I'll merely
consider it not proven broken yet. False negatives work in their favour.

Find 3rd party diagnostic tools where possible, and where not possible
don't trust the overall health assessment provided by the vendor tools,
dig into the fine print in the diagnostics and see what the details are
like.

For hard disks, smartctl from smartmontools is a lifesaver. Your issue
doesn't sound HDD related, but it's worth mentioning for the future.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

#5Peter Hopfgartner
peter.hopfgartner@r3-gis.com
In reply to: Craig Ringer (#4)
Re: Getting FATAL: terminating connection due to administrator command

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 15.09.2010 16:07

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

Since some days we are getting the above message.
Also in the PostgreSQL logs we get:
FATAL: terminating connection due to administrator command

This is a result of something sending SIGTERM to the backend process.

I have heard reports of "load management" software that SIGTERM's
processes more or less at random whenever it decides the system is
overloaded. If you have any such junkware installed on your server,
try disabling it.

The server is a rather bare bone server for web mapping, so basically PostgreSQL/PostGIS, Apache, PHP, Tomcat and little other stuff. The Dell software was the only which did not come from CentOS/EPEL/argeo/in-house RPM packages. I've removed the Dell stuff completely, but the problem is still there.

The server is from Dell, Dell's hardware monitoring, OpenManage, says

that the hardware, in particular memory and disk, are ok.

Never dealt with OpenManage before, but you should cast a wary eye
upon any Dell-specific software on the machine. This behavior is
definitely not normal for Unix systems, so you need to look for
nonstandard software (and what's more, nonstandard software running with
root privileges, else it couldn't SIGTERM postgres processes).

Other informations: disks are costly SAS drives in a RAID 1 array, memory is with ECC.
Security level is disabled
SELinux is Permissive.
The server acts as a XEN host

Is it reasonable to restrict the problem to kernel/hardware and/or PostgreSQL/PostGIS itself?

Can I trace where the SIGTERM comes from?

regards, tom lane

Regards,

Peter

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Hopfgartner (#5)
Re: Getting FATAL: terminating connection due to administrator command

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------

This is a result of something sending SIGTERM to the backend process.

Can I trace where the SIGTERM comes from?

If this is a recent Red-Hat-based release, I think that systemtap could
probably be used to determine that. There's a script here that solves
a related problem:
http://sourceware.org/systemtap/examples/process/sigmon.stp

regards, tom lane

#7Joshua J. Kugler
joshua@eeinternet.com
In reply to: Peter Hopfgartner (#5)
Re: Getting FATAL: terminating connection due to administrator command

On Wednesday 15 September 2010, Peter Hopfgartner elucidated thus:

The server is a rather bare bone server for web mapping, so basically
PostgreSQL/PostGIS, Apache, PHP, Tomcat and little other stuff. The
Dell software was the only which did not come from
CentOS/EPEL/argeo/in-house RPM packages. I've removed the Dell stuff
completely, but the problem is still there.

Are you running out of memory and getting killed by the OOM killer?

j

--
Joshua Kugler
Part-Time System Admin/Programmer
http://www.eeinternet.com - Fairbanks, AK
PGP Key: http://pgp.mit.edu/ �ID 0x73B13B6A

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua J. Kugler (#7)
Re: Getting FATAL: terminating connection due to administrator command

"Joshua J. Kugler" <joshua@eeinternet.com> writes:

On Wednesday 15 September 2010, Peter Hopfgartner elucidated thus:

The server is a rather bare bone server for web mapping, so basically
PostgreSQL/PostGIS, Apache, PHP, Tomcat and little other stuff. The
Dell software was the only which did not come from
CentOS/EPEL/argeo/in-house RPM packages. I've removed the Dell stuff
completely, but the problem is still there.

Are you running out of memory and getting killed by the OOM killer?

The OOM killer hits its victims with SIGKILL (kill -9), so we can rule
that out as not matching Peter's symptoms. This is definitely an
unwanted SIGTERM not SIGKILL.

regards, tom lane

#9Peter Hopfgartner
peter.hopfgartner@r3-gis.com
In reply to: Joshua J. Kugler (#7)
Re: Getting FATAL: terminating connection due to administrator command

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 15.09.2010 17:40

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------

This is a result of something sending SIGTERM to the backend process.

Can I trace where the SIGTERM comes from?

If this is a recent Red-Hat-based release, I think that systemtap could
probably be used to determine that. There's a script here that solves
a related problem:
http://sourceware.org/systemtap/examples/process/sigmon.stp

regards, tom lane

As it seems I need to recompile the kernel in order to have the debug infos, http://sourceware.org/systemtap/wiki/SystemTapOnCentOS.
Tomorrow I will start the probes.

Thanks,
Peter

#10Frank Ch. Eigler
fche@redhat.com
In reply to: Peter Hopfgartner (#9)
Re: Getting FATAL: terminating connection due to administrator command

Hi -

peter.hopfgartner wrote:

http://sourceware.org/systemtap/examples/process/sigmon.stp

As it seems I need to recompile the kernel in order to have the debug
infos, http://sourceware.org/systemtap/wiki/SystemTapOnCentOS.
Tomorrow I will start the probes.

Check first on http://debuginfo.centos.org/; they may have saved the
goods for your particular kernel build. (version/architecture?)

It may be also possible to craft a similar script that doesn't rely on
debuginfo.

- FChE

#11Peter Hopfgartner
peter.hopfgartner@r3-gis.com
In reply to: Frank Ch. Eigler (#10)
Re: Getting FATAL: terminating connection due to administrator command

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 15.09.2010 17:40

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------

This is a result of something sending SIGTERM to the backend process.

Can I trace where the SIGTERM comes from?

If this is a recent Red-Hat-based release, I think that systemtap could
probably be used to determine that. There's a script here that solves
a related problem:
http://sourceware.org/systemtap/examples/process/sigmon.stp

Ok, I was able to create the packages with the debug info, adapted the script to report SIGTERM sent so any process, see attachment, tested this sending some SIGTERM and seeing them reported by systemtap.

Anyway, currently we are not able to reproduce the problem on the server, magically everything seems to work on this machine. The only changes should be the 2 debug-info packages (kernel-debuginfo-common-2.6.18-194.11.3.el5.x86_64.rpm and kernel-xen-debuginfo-2.6.18-194.11.3.el5.x86_64.rpm).

In the same time, the problem appeared twice today on our testing server. This suggests to drop the hardware failure hypothesis.

I'm running systemtap on the testing server now, too.

Show quoted text

regards, tom lane

Attachments:

sigmon1.stpapplication/octet-streamDownload
#12Peter Hopfgartner
peter.hopfgartner@r3-gis.com
In reply to: Peter Hopfgartner (#11)
Re: Getting FATAL: terminating connection due to administrator command

--------Peter Hopfgartner <peter.hopfgartner@r3-gis.com> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 16:56

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------

Subject: Re: [GENERAL] Getting FATAL: terminating connection due to
administrator command

Date: 15.09.2010 17:40

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------

This is a result of something sending SIGTERM to the backend process.

Can I trace where the SIGTERM comes from?

If this is a recent Red-Hat-based release, I think that systemtap could

probably be used to determine that. There's a script here that solves

a related problem:

http://sourceware.org/systemtap/examples/process/sigmon.stp

Now we had the error, but systemtap did not report any SIGTERM. Is it possible to have this error without a SIGTERM being involved? As mentioned in a previous mail, I've modified the script to report SIGTERM sent to any process.

Peter

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Hopfgartner (#12)
Re: Getting FATAL: terminating connection due to administrator command

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?

Hmph. I would have said not, but ...

What PG version is this exactly?

regards, tom lane

#14Peter Hopfgartner
peter.hopfgartner@r3-gis.com
In reply to: Peter Hopfgartner (#12)
Re: Getting FATAL: terminating connection due to administrator command

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 17:37

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?

Hmph. I would have said not, but ...

What PG version is this exactly?

regards, tom lane

It says:

"PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit"

and should be identical to the current Red Hat implementation of postgresql84-server as in RHEL 5.5

Peter

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Hopfgartner (#14)
Re: Getting FATAL: terminating connection due to administrator command

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?

Hmph. I would have said not, but ...

What PG version is this exactly?

"PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit"

Well, I just took another careful look at the 8.4.x source code. The
only occurrence of that error string is in ProcessInterrupts() when it
sees ProcDiePending set, and so far as I can see ProcDiePending is only
set by the die() signal handler, and in postmaster child processes die()
is only used as a SIGTERM handler. So it's pretty hard to come to the
conclusion that it could be anything else.

I don't know systemtap well at all, so I can't suggest any reason why
your script might be missing SIGTERM events, but it seems that it must be.
Perhaps you can find a systemtap expert who can comment on that.

regards, tom lane

#16Peter Hopfgartner
peter.hopfgartner@r3-gis.com
In reply to: Peter Hopfgartner (#14)
Re: Getting FATAL: terminating connection due to administrator command

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------
Subject: Re: [GENERAL] Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 18:49

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

--------Tom Lane <tgl@sss.pgh.pa.us> wrote--------

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

Now we had the error, but systemtap did not report any SIGTERM. Is it
possible to have this error without a SIGTERM being involved?

Hmph. I would have said not, but ...

What PG version is this exactly?

"PostgreSQL 8.4.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)

4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit"

Well, I just took another careful look at the 8.4.x source code. The
only occurrence of that error string is in ProcessInterrupts() when it
sees ProcDiePending set, and so far as I can see ProcDiePending is only
set by the die() signal handler, and in postmaster child processes die()
is only used as a SIGTERM handler. So it's pretty hard to come to the
conclusion that it could be anything else.

I don't know systemtap well at all, so I can't suggest any reason why
your script might be missing SIGTERM events, but it seems that it must
be.
Perhaps you can find a systemtap expert who can comment on that.

regards, tom lane

Is there any other way I can attack this issue? Attach GDB? How (I haven't done any serious C programming in this millenium).

Peter

#17Frank Ch. Eigler
fche@redhat.com
In reply to: Peter Hopfgartner (#12)
Re: Getting FATAL: terminating connection due to administrator command

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

[...]

http://sourceware.org/systemtap/examples/process/sigmon.stp

Now we had the error, but systemtap did not report any SIGTERM. Is
it possible to have this error without a SIGTERM being involved? As
mentioned in a previous mail, I've modified the script to report
SIGTERM sent to any process.

There are some other possibilities. It's possible that the version of
stap you're using is not expanding signal.send to all possible paths
of the kernel dispatching signals to your process.

So one might try a few different things:

------------------------------------------------------------------------
# see what die() is getting to work with
probe process("/usr/bin/postgres").function("die") {
printf("%s[%d] received %d\n", execname(), pid(), $postgres_signal_arg)
}

# check for another process sending SIGTERM
probe syscall.kill {
if (sig == 15) {
printf("%s[%d] sending %s\n", execname(), pid(), argstr)
print_ubacktrace()
}
}

# heck, trace the whole statement sequence during the signal handling
probe process("/usr/bin/postgres").statement("die@*:*"),
process("/usr/bin/postgres").statement("ProcessInterrupts@*:*") {
printf("%s %s\n", pp(), $$vars)
}
------------------------------------------------------------------------

You can run that in the background. The second probe will give
systemwide SIGTERM activity, so you may need to filter it a bit.
If you know the appropriate postmaster process-id, you could change
the syscall.kill probe:

< if (sig == 15) {

if (sig == 15 && pid == target_pid()) {

and invoke the script with stap ... -x PID_OF_YOUR_POSTGRES_SERVER
(In this case, "sig" and "pid" come from the syscall arguments, that
is represent the intended signal recepient, rather than the sender;
see also 'stap -L signal.send'.)

Note that postgres does sometimes send signals to itself, so don't be
surprised to see post* processes show up there.

(A more modern system compiler & systemtap would give you much better
variable-value dumping options.)

- FChE

#18Adrien Kunysz
adk@redhat.com
In reply to: Frank Ch. Eigler (#17)
Re: Getting FATAL: terminating connection due to administrator command

Is there any other way [than SystemTap] I can attack this issue
[figuring out who sent a SIGTERM]? Attach GDB?

gdb won't tell you who sent the signal.

I don't really see any specific reason why sigmon.stp wouldn't show a SIGTERM if
one was sent.

An alternative could be to use auditd with something like this:
# auditctl -a entry,always -S kill -F a1=15

Regards,
Adrien Kunysz

#19Peter Hopfgartner
peter.hopfgartner@r3-gis.com
In reply to: Adrien Kunysz (#18)
Re: Getting FATAL: terminating connection due to administrator command

--------"Frank Ch. Eigler" <fche@redhat.com> wrote--------
Subject: Re: Getting FATAL: terminating connection due to administrator command
Date: 16.09.2010 22:59

Peter Hopfgartner <peter.hopfgartner@r3-gis.com> writes:

[...]

http://sourceware.org/systemtap/examples/process/sigmon.stp

Now we had the error, but systemtap did not report any SIGTERM. Is
it possible to have this error without a SIGTERM being involved? As
mentioned in a previous mail, I've modified the script to report
SIGTERM sent to any process.

There are some other possibilities. It's possible that the version of
stap you're using is not expanding signal.send to all possible paths
of the kernel dispatching signals to your process.

So one might try a few different things:

------------------------------------------------------------------------

(...)

# check for another process sending SIGTERM
probe syscall.kill {
if (sig == 15) {
printf("%s[%d] sending %s\n", execname(), pid(), argstr)
print_ubacktrace()
}
}

I'm trying this one, currently.

(...)

The sad thing, at the moment, is, that the problem has not been occuring since days, now. I did not change anything in the setup of the server, only that the system might be used slightly less, then when it occured, so it is statistically less likely that it occurs.

(A more modern system compiler & systemtap would give you much better
variable-value dumping options.)

Is there an easy way to install those on a 5.5 system?

- FChE

Regards,
Peter