"recovery mode"
What exactly is "recovery mode"? Today, the backend went into recovery
mode, and simply wouldn't do anything. Not using any CPU, and would not go
away even with a kill -9. I ended up having to reboot the machine to get
the database working again....
steve
"Steve Wolfe" <steve@iboats.com> writes:
What exactly is "recovery mode"? Today, the backend went into recovery
mode, and simply wouldn't do anything. Not using any CPU, and would not go
away even with a kill -9. I ended up having to reboot the machine to get
the database working again....
I don't think recovery mode actually does much in 7.0.* --- I think it's
just a stub (Vadim might know better though). In 7.1 it means the thing
is replaying the WAL log after a crash. In any case it shouldn't
create a lockup condition like that.
The only cases I've ever heard of where a user process couldn't be
killed with kill -9 are where it's stuck in a kernel call (and the
kill response is being held off till the end of the kernel call).
Any such situation is arguably a kernel bug, of course, but that's
not a lot of comfort.
Exactly which process were you sending kill -9 to, anyway? There should
have been a postmaster and one backend running the recovery-mode code.
If the postmaster was responding to connection requests with an error
message, then I would not say that it was locked up.
regards, tom lane
I don't think recovery mode actually does much in 7.0.* --- I think it's
just a stub (Vadim might know better though). In 7.1 it means the thing
is replaying the WAL log after a crash. In any case it shouldn't
create a lockup condition like that.The only cases I've ever heard of where a user process couldn't be
killed with kill -9 are where it's stuck in a kernel call (and the
kill response is being held off till the end of the kernel call).
Any such situation is arguably a kernel bug, of course, but that's
not a lot of comfort.Exactly which process were you sending kill -9 to, anyway? There should
have been a postmaster and one backend running the recovery-mode code.
If the postmaster was responding to connection requests with an error
message, then I would not say that it was locked up.
I believe that it was a backend that I tried -9'ing. I knew it wasn't
something that good to do, but I had to get it running again. It's amazing
how bold you get when you hear an entire department mumbling about "Why
isn't the site working?". : )
Anyway, I think the problem wasn't in postgres. I rebooted the machine,
and it worked - for about ten minutes. Then, it froze, with the kernel
crapping out. I rebooted it, it lasted about three minutes until the same
thing happened. Reboot, it didn't even get through the fsck before it did
it again.
I looked at the CPU temps, one of the four was warmer than it should be,
but still within acceptable limits (40 C). So, I shut it down, reseated the
RAM chassis, the DIMM's, the CPU's, and the expansion cards. When it came
up, I compiled and put on a newer kernel (I guess there was some good in the
crashes), and then it worked fine. Because of the symptoms, I imagine that
it was a flakey connection. Odd, considering that everything except the
DIMM's (including the CPU's) are literally screwed to the motherboard!
steve