"recovery mode"

Started by Steve Wolfeabout 25 years ago3 messagesgeneral
Jump to latest
#1Steve Wolfe
steve@iboats.com

What exactly is "recovery mode"? Today, the backend went into recovery
mode, and simply wouldn't do anything. Not using any CPU, and would not go
away even with a kill -9. I ended up having to reboot the machine to get
the database working again....

steve

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Steve Wolfe (#1)
Re: "recovery mode"

"Steve Wolfe" <steve@iboats.com> writes:

What exactly is "recovery mode"? Today, the backend went into recovery
mode, and simply wouldn't do anything. Not using any CPU, and would not go
away even with a kill -9. I ended up having to reboot the machine to get
the database working again....

I don't think recovery mode actually does much in 7.0.* --- I think it's
just a stub (Vadim might know better though). In 7.1 it means the thing
is replaying the WAL log after a crash. In any case it shouldn't
create a lockup condition like that.

The only cases I've ever heard of where a user process couldn't be
killed with kill -9 are where it's stuck in a kernel call (and the
kill response is being held off till the end of the kernel call).
Any such situation is arguably a kernel bug, of course, but that's
not a lot of comfort.

Exactly which process were you sending kill -9 to, anyway? There should
have been a postmaster and one backend running the recovery-mode code.
If the postmaster was responding to connection requests with an error
message, then I would not say that it was locked up.

regards, tom lane

#3Steve Wolfe
steve@iboats.com
In reply to: Steve Wolfe (#1)
Re: "recovery mode"

I don't think recovery mode actually does much in 7.0.* --- I think it's
just a stub (Vadim might know better though). In 7.1 it means the thing
is replaying the WAL log after a crash. In any case it shouldn't
create a lockup condition like that.

The only cases I've ever heard of where a user process couldn't be
killed with kill -9 are where it's stuck in a kernel call (and the
kill response is being held off till the end of the kernel call).
Any such situation is arguably a kernel bug, of course, but that's
not a lot of comfort.

Exactly which process were you sending kill -9 to, anyway? There should
have been a postmaster and one backend running the recovery-mode code.
If the postmaster was responding to connection requests with an error
message, then I would not say that it was locked up.

I believe that it was a backend that I tried -9'ing. I knew it wasn't
something that good to do, but I had to get it running again. It's amazing
how bold you get when you hear an entire department mumbling about "Why
isn't the site working?". : )

Anyway, I think the problem wasn't in postgres. I rebooted the machine,
and it worked - for about ten minutes. Then, it froze, with the kernel
crapping out. I rebooted it, it lasted about three minutes until the same
thing happened. Reboot, it didn't even get through the fsck before it did
it again.

I looked at the CPU temps, one of the four was warmer than it should be,
but still within acceptable limits (40 C). So, I shut it down, reseated the
RAM chassis, the DIMM's, the CPU's, and the expansion cards. When it came
up, I compiled and put on a newer kernel (I guess there was some good in the
crashes), and then it worked fine. Because of the symptoms, I imagine that
it was a flakey connection. Odd, considering that everything except the
DIMM's (including the CPU's) are literally screwed to the motherboard!

steve