"recovery mode"

Started by Steve Wolfeover 25 years ago3 messagesgeneral

steve@iboats.com

over 25 years ago

What exactly is "recovery mode"? Today, the backend went into recovery
mode, and simply wouldn't do anything. Not using any CPU, and would not go
away even with a kill -9. I ended up having to reboot the machine to get
the database working again....

steve

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Steve Wolfe (#1)

Re: "recovery mode"

"Steve Wolfe" <steve@iboats.com> writes:

What exactly is "recovery mode"? Today, the backend went into recovery
mode, and simply wouldn't do anything. Not using any CPU, and would not go
away even with a kill -9. I ended up having to reboot the machine to get
the database working again....

I don't think recovery mode actually does much in 7.0.* --- I think it's
just a stub (Vadim might know better though). In 7.1 it means the thing
is replaying the WAL log after a crash. In any case it shouldn't
create a lockup condition like that.

The only cases I've ever heard of where a user process couldn't be
killed with kill -9 are where it's stuck in a kernel call (and the
kill response is being held off till the end of the kernel call).
Any such situation is arguably a kernel bug, of course, but that's
not a lot of comfort.

Exactly which process were you sending kill -9 to, anyway? There should
have been a postmaster and one backend running the recovery-mode code.
If the postmaster was responding to connection requests with an error
message, then I would not say that it was locked up.

regards, tom lane

Steve Wolfe

steve@iboats.com

over 25 years ago

In reply to: Steve Wolfe (#1)

Re: "recovery mode"

I don't think recovery mode actually does much in 7.0.* --- I think it's
just a stub (Vadim might know better though). In 7.1 it means the thing
is replaying the WAL log after a crash. In any case it shouldn't
create a lockup condition like that.

The only cases I've ever heard of where a user process couldn't be
killed with kill -9 are where it's stuck in a kernel call (and the
kill response is being held off till the end of the kernel call).
Any such situation is arguably a kernel bug, of course, but that's
not a lot of comfort.

Exactly which process were you sending kill -9 to, anyway? There should
have been a postmaster and one backend running the recovery-mode code.
If the postmaster was responding to connection requests with an error
message, then I would not say that it was locked up.

I believe that it was a backend that I tried -9'ing. I knew it wasn't
something that good to do, but I had to get it running again. It's amazing
how bold you get when you hear an entire department mumbling about "Why
isn't the site working?". : )

Anyway, I think the problem wasn't in postgres. I rebooted the machine,
and it worked - for about ten minutes. Then, it froze, with the kernel
crapping out. I rebooted it, it lasted about three minutes until the same
thing happened. Reboot, it didn't even get through the fsck before it did
it again.

I looked at the CPU temps, one of the four was warmer than it should be,
but still within acceptable limits (40 C). So, I shut it down, reseated the
RAM chassis, the DIMM's, the CPU's, and the expansion cards. When it came
up, I compiled and put on a newer kernel (I guess there was some good in the
crashes), and then it worked fine. Because of the symptoms, I imagine that
it was a flakey connection. Odd, considering that everything except the
DIMM's (including the CPU's) are literally screwed to the motherboard!

steve