PostgreSQL 7.3.4 gets killed by SIG_KILL
I have this big table running on an old linux install (kernel 2.2.25).
I've COPYed some tcpip logs into a table created as such:
create table ipstats (time timestamp, src inet, dst inet, npackets int8,
nbytes int8);
Big:
select count(*) from ipstats;
count
----------
99173733
When i do two selects some from that table multiple times, the the
backend doing the selects is getting killed by signal 9.
The select pair look like:
select sum(nbytes) from ipstats where dst = '10.10.10.170';
select sum(nbytes) from ipstats where src = '10.10.10.170';
This is what the serverlog says:
LOG: server process (pid 20308) was terminated by signal 9
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing shared memory and
semaphores
FATAL: The database system is starting up
LOG: database system was interrupted at 2003-12-03 23:21:49 CET
FATAL: The database system is starting up
LOG: checkpoint record is at 3/9095BC20
LOG: redo record is at 3/9095BC20; undo record is at 0/0; shutdown TRUE
LOG: next transaction id: 8716399; next oid: 141842933
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: ReadRecord: record with zero length at 3/9095BC60
LOG: redo is not required
LOG: database system is ready
When i attach a gdb to the process it doesn't help, it exits immediatly
anyways.
This i believe is because SIG_KILL is "unstoppable"...
Any ideas as of what to do?
Regards
Magnus
"Magnus Naeslund(t)" <mag@fbab.net> writes:
I have this big table running on an old linux install (kernel 2.2.25).
I've COPYed some tcpip logs into a table created as such:
Linux is probably killing your process because it (the kernel) is low
on memory. Unfortunately, this happens more often with older versions
of the kernel. Add more RAM/swap or figure out how to make your query
use less memory...
-Doug
Doug McNaught wrote:
"Magnus Naeslund(t)" <mag@fbab.net> writes:
I have this big table running on an old linux install (kernel 2.2.25).
I've COPYed some tcpip logs into a table created as such:Linux is probably killing your process because it (the kernel) is low
on memory. Unfortunately, this happens more often with older versions
of the kernel. Add more RAM/swap or figure out how to make your query
use less memory...-Doug
Well this just isn't the case.
There is no printout in kernel logs/dmesg (as it would be if the kernel
killed it in an OOM situation).
I have 1 GB of RAM, and 1.5 GB of swap (swap never touched).
When running the query i have about 850 MB sitting in kernel cache, the
postgres process takes about 40MB of memory, and the ipcs -m command
shows that postgresql is taking 41508864 bytes of shared memory.
There is no sorting or index lookups going on, the query is simple.
I just had an power outage, i'll check if it maybe wised up after reboot
or something, but i doubt it.
Is it possible to somehow find out what process sent the KILL (or if
it's the kernel) ?
I find this very weird to say the least...
Magnus
On Thu, 04 Dec 2003 03:35:49 +0100
"Magnus Naeslund(t)" <mag@fbab.net> wrote:
Well this just isn't the case.
There is no printout in kernel logs/dmesg (as it would be if the
kernel killed it in an OOM situation).
I have 1 GB of RAM, and 1.5 GB of swap (swap never touched).
Do you have any system monitoring scripts that may be killing it as it
may look like a "runaway" process?
We've had this happen to us before. You tend to forget about things like
that.
--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/
"Magnus Naeslund(t)" <mag@fbab.net> writes:
Doug McNaught wrote:
Linux is probably killing your process because it (the kernel) is low
on memory. Unfortunately, this happens more often with older versions
of the kernel. Add more RAM/swap or figure out how to make your query
use less memory...
-DougWell this just isn't the case.
There is no printout in kernel logs/dmesg (as it would be if the
kernel killed it in an OOM situation).
I have 1 GB of RAM, and 1.5 GB of swap (swap never touched).
Ahh, that's an additional piece of information hat you didn't supply
earlier. ;)
Though your system memory is ample, is it possible that you're hitting
a ulimit() on the stack size or heap size or something? I'm not sure
what signal you'd get in such a case, though.
Is it possible to somehow find out what process sent the KILL (or if
it's the kernel) ?
Not that I know of, unless it's in a logfile somewhere. You could try
strace(8) on the backend running the query--that might give you some
more info.
I find this very weird to say the least...
Yah. You might also consider running a more recent kernel, especially
with such a big machine. 2.2.X never did play that well with large
amounts of RAM...
-Doug
Jeff wrote:
Do you have any system monitoring scripts that may be killing it as it
may look like a "runaway" process?We've had this happen to us before. You tend to forget about things like
that.
This got me thinking, and i rechecked all possibilities.
It turned out that we changed rlimit policies earlier and the "default"
cpu time limits bleeded over to postgres since it didn't have a negating
entry in the pam limits control.
Since the startup scripts use "su - postgres -c cmd" it "logged in" and
so got the now default cpu time values.
So it was only a mindbug, and thats good :)
Magnus