kill -KILL: What happens?

Started by David Fetterover 15 years ago45 messageshackers
Jump to latest
#1David Fetter
david@fetter.org

Folks,

I've noticed over the years that we give people dire warnings never to
send a KILL signal to the postmaster, but I'm unsure as to what are
potential consequences of this, as in just exactly how this can result
in problems. Is there some reference I can look to for explanations
of the mechanism(s) whereby the damage occurs?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#1)
Re: kill -KILL: What happens?

David Fetter <david@fetter.org> writes:

I've noticed over the years that we give people dire warnings never to
send a KILL signal to the postmaster, but I'm unsure as to what are
potential consequences of this, as in just exactly how this can result
in problems. Is there some reference I can look to for explanations
of the mechanism(s) whereby the damage occurs?

There's no risk of data corruption, if that's what you're thinking of.
It's just that you're then looking at having to manually clean up the
child processes and then restart the postmaster; a process that is not
only tedious but does offer the possibility of screwing yourself.

In particular the risk is that someone clueless enough to do this would
next decide that removing $PGDATA/postmaster.pid, rather than killing
all the existing children, is the quickest way to get the postmaster
restarted. Once he's done that, his data will shortly be hosed beyond
recovery, because now he has two noncommunicating sets of backends
massaging the same files via separate sets of shared buffers.

The reason this sequence of events doesn't seem improbable is that the
error you get when you try to start a new postmaster, if there are still
old backends running, is

FATAL: pre-existing shared memory block (key 5490001, ID 15609) is still in use
HINT: If you're sure there are no old server processes still running, remove the shared memory block or just delete the file "postmaster.pid".

Maybe we should rewrite that HINT --- while it's *possible* that
removing the shmem block or deleting postmaster.pid is the right thing
to do, it's not exactly *likely*. I think we need to put a bit more
emphasis on the "If ..." part. Like "If you are prepared to swear on
your mother's grave that there are no old server processes still
running, consider removing postmaster.pid. But first check for existing
processes again."

(BTW, I notice that this interlock against starting a new postmaster
appears to be broken in HEAD, which is likely not unrelated to the fact
that the contents of postmaster.pid seem to be totally bollixed :-()

regards, tom lane

#3David Fetter
david@fetter.org
In reply to: Tom Lane (#2)
Re: kill -KILL: What happens?

On Thu, Jan 13, 2011 at 10:41:28AM -0500, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

I've noticed over the years that we give people dire warnings never to
send a KILL signal to the postmaster, but I'm unsure as to what are
potential consequences of this, as in just exactly how this can result
in problems. Is there some reference I can look to for explanations
of the mechanism(s) whereby the damage occurs?

There's no risk of data corruption, if that's what you're thinking of.
It's just that you're then looking at having to manually clean up the
child processes and then restart the postmaster; a process that is not
only tedious but does offer the possibility of screwing yourself.

Does this mean that there's no cross-platform way to ensure that
killing a process results in its children's timely (i.e. before damage
can occur) death? That such a way isn't practical from a performance
point of view?

In particular the risk is that someone clueless enough to do this would
next decide that removing $PGDATA/postmaster.pid, rather than killing
all the existing children, is the quickest way to get the postmaster
restarted. Once he's done that, his data will shortly be hosed beyond
recovery, because now he has two noncommunicating sets of backends
massaging the same files via separate sets of shared buffers.

Right.

The reason this sequence of events doesn't seem improbable is that the
error you get when you try to start a new postmaster, if there are still
old backends running, is

FATAL: pre-existing shared memory block (key 5490001, ID 15609) is still in use
HINT: If you're sure there are no old server processes still running, remove the shared memory block or just delete the file "postmaster.pid".

Maybe we should rewrite that HINT --- while it's *possible* that
removing the shmem block or deleting postmaster.pid is the right thing
to do, it's not exactly *likely*. I think we need to put a bit more
emphasis on the "If ..." part. Like "If you are prepared to swear on
your mother's grave that there are no old server processes still
running, consider removing postmaster.pid. But first check for existing
processes again."

Maybe the hint could give an OS-tailored way to check this...

(BTW, I notice that this interlock against starting a new postmaster
appears to be broken in HEAD, which is likely not unrelated to the
fact that the contents of postmaster.pid seem to be totally bollixed
:-()

D'oh! Well, I hope knowing it's a problem gives some kind of glimmer
as to how to solve it :)

Is this worth writing tests for?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#3)
Re: kill -KILL: What happens?

David Fetter <david@fetter.org> writes:

On Thu, Jan 13, 2011 at 10:41:28AM -0500, Tom Lane wrote:

It's just that you're then looking at having to manually clean up the
child processes and then restart the postmaster; a process that is not
only tedious but does offer the possibility of screwing yourself.

Does this mean that there's no cross-platform way to ensure that
killing a process results in its children's timely (i.e. before damage
can occur) death? That such a way isn't practical from a performance
point of view?

The simple, easy, cross-platform solution is this: don't kill -9 the
postmaster. Send it one of the provisioned shutdown signals and let it
kill its children for you.

At least on Unix I don't believe there is any other solution. You
could try looking at ps output but there's a fundamental race condition,
ie the postmaster could spawn another child just before you kill it,
whereupon the child is reassigned to init and there's no longer a good
way to tell that it came from that postmaster.

regards, tom lane

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#4)
Re: kill -KILL: What happens?

Tom Lane <tgl@sss.pgh.pa.us> wrote:

At least on Unix I don't believe there is any other solution. You
could try looking at ps output but there's a fundamental race
condition, ie the postmaster could spawn another child just before
you kill it, whereupon the child is reassigned to init and there's
no longer a good way to tell that it came from that postmaster.

Couldn't you run `ps auxf` and kill any postgres process which is
not functioning as postmaster (those are pretty easy to distinguish)
and which isn't the child of such a process? Is there ever a reason
to allow such an orphan to run?

-Kevin

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#5)
Re: kill -KILL: What happens?

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

At least on Unix I don't believe there is any other solution. You
could try looking at ps output but there's a fundamental race
condition, ie the postmaster could spawn another child just before
you kill it, whereupon the child is reassigned to init and there's
no longer a good way to tell that it came from that postmaster.

Couldn't you run `ps auxf` and kill any postgres process which is
not functioning as postmaster (those are pretty easy to distinguish)
and which isn't the child of such a process? Is there ever a reason
to allow such an orphan to run?

That's not terribly hard to do by hand, especially since the cautious
DBA could also do things like checking a process' CWD to verify which
postmaster it had belonged to. I can't see automating it though.
We already have a perfectly good solution to the automated shutdown
problem.

regards, tom lane

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#6)
Re: kill -KILL: What happens?

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I can't see automating it though. We already have a perfectly
good solution to the automated shutdown problem.

Oh, I totally agree with that. I somehow thought we'd gotten off
into how someone could recover after shooting their foot.

-Kevin

#8Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#4)
Re: kill -KILL: What happens?

On Jan13, 2011, at 19:00 , Tom Lane wrote:

At least on Unix I don't believe there is any other solution. You
could try looking at ps output but there's a fundamental race condition,
ie the postmaster could spawn another child just before you kill it,
whereupon the child is reassigned to init and there's no longer a good
way to tell that it came from that postmaster.

Maybe I'm totally confused, but ...

Couldn't normal backends call PostmasterIsAlive and exit if not, just
like the startup process, the stats collector, autovacuum, bgwriter,
walwriter, walreceiver, walsender and the wal archiver already do?

I assumed they do, but now that I grepped the code it seems they don't.

best regards,
Florian Pflug

#9David Fetter
david@fetter.org
In reply to: Kevin Grittner (#7)
Re: kill -KILL: What happens?

On Thu, Jan 13, 2011 at 12:45:07PM -0600, Kevin Grittner wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I can't see automating it though. We already have a perfectly
good solution to the automated shutdown problem.

Oh, I totally agree with that. I somehow thought we'd gotten off
into how someone could recover after shooting their foot.

I get that we can't prevent all pilot error, but I was hoping we could
bullet-proof this a little more, especially in light of a certain
extremely popular server OS's OOM killer's default behavior.

Yes, I get that that behavior is crazy, and stupid, and that people
should shut it off, but it *is* our problem if we let the postmaster
start (or continue) when it's set that way.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Pflug (#8)
Re: kill -KILL: What happens?

Florian Pflug <fgp@phlo.org> writes:

Couldn't normal backends call PostmasterIsAlive and exit if not, just
like the startup process, the stats collector, autovacuum, bgwriter,
walwriter, walreceiver, walsender and the wal archiver already do?

I assumed they do, but now that I grepped the code it seems they don't.

That's intentional: they keep going until the user closes the session or
someone sends them a signal to do otherwise. The other various
background processes have to watch PostmasterIsAlive because there is no
session to close.

Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
It sucks because you don't get a signal on parent death. With the
arrival of the latch code, having to check for PostmasterIsAlive
frequently is the only reason for an idle background process to consume
CPU at all.

Another problem with the scheme is that it only works as long as the
background process is providing a *non critical* service. Eventually we
are probably going to need some way for bgwriter/walwriter to stay alive
long enough to service orphaned backends, rather than disappearing
instantly if the postmaster goes away.

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#9)
Re: kill -KILL: What happens?

David Fetter <david@fetter.org> writes:

I get that we can't prevent all pilot error, but I was hoping we could
bullet-proof this a little more, especially in light of a certain
extremely popular server OS's OOM killer's default behavior.

Yes, I get that that behavior is crazy, and stupid, and that people
should shut it off, but it *is* our problem if we let the postmaster
start (or continue) when it's set that way.

Packagers who are paying attention have fixed that ;-)

regards, tom lane

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: kill -KILL: What happens?

On Thu, Jan 13, 2011 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
It sucks because you don't get a signal on parent death.  With the
arrival of the latch code, having to check for PostmasterIsAlive
frequently is the only reason for an idle background process to consume
CPU at all.

What we really need is SIGPARENT. I wonder if the Linux folks would
consider adding such a thing. Might be useful to others as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: kill -KILL: What happens?

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 13, 2011 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
It sucks because you don't get a signal on parent death. �With the
arrival of the latch code, having to check for PostmasterIsAlive
frequently is the only reason for an idle background process to consume
CPU at all.

What we really need is SIGPARENT. I wonder if the Linux folks would
consider adding such a thing. Might be useful to others as well.

That's pretty much a dead-end idea unfortunately; it would never be
portable enough to let us change our system structure to rely on it.
Even more to the point, "go away when the postmaster does" isn't
really the behavior we want anyway. "Go away when the last backend
does" is what we want.

I wonder whether we could have some sort of latch-like counter that
would count the number of active backends and deliver signals when the
count went to zero. However, if the goal is to defend against random
applications of SIGKILL, there's probably no way to make this reliable
in userspace.

Another idea is to have a "postmaster minder" process that respawns the
postmaster when it's killed. The hard part of that is that the minder
can't be connected to shared memory (else its OOM cross-section is just
as big as the postmaster's), and that makes it difficult for it to tell
when all the children have gone away. I suppose it could be coded to
just retry every few seconds until success. This doesn't improve the
behavior of background processes at all, though.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: kill -KILL: What happens?

On Thu, Jan 13, 2011 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 13, 2011 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Frankly I'd prefer to get rid of PostmasterIsAlive, not extend its use.
It sucks because you don't get a signal on parent death.  With the
arrival of the latch code, having to check for PostmasterIsAlive
frequently is the only reason for an idle background process to consume
CPU at all.

What we really need is SIGPARENT.  I wonder if the Linux folks would
consider adding such a thing.  Might be useful to others as well.

That's pretty much a dead-end idea unfortunately; it would never be
portable enough to let us change our system structure to rely on it.
Even more to the point, "go away when the postmaster does" isn't
really the behavior we want anyway.  "Go away when the last backend
does" is what we want.

I'm not convinced. I was thinking that we could simply treat it like
SIGQUIT, if it's available. I doubt there's a real use case for
continuing to run queries after the postmaster and all the background
processes are dead. Expedited death seems like much better behavior.
Even checking PostmasterIsAlive() once per query would be reasonable,
except that it'd add a system call to check for a condition that
almost never holds, which I'm not eager to do.

I wonder whether we could have some sort of latch-like counter that
would count the number of active backends and deliver signals when the
count went to zero.  However, if the goal is to defend against random
applications of SIGKILL, there's probably no way to make this reliable
in userspace.

I don't think you can get there 100%. We could, however, make a rule
that when a background process fails a PostmasterIsAlive() check, it
sends SIGQUIT to everyone it can find in the ProcArray, which would at
least ensure a timely exit in most real-world cases.

Another idea is to have a "postmaster minder" process that respawns the
postmaster when it's killed.  The hard part of that is that the minder
can't be connected to shared memory (else its OOM cross-section is just
as big as the postmaster's), and that makes it difficult for it to tell
when all the children have gone away.  I suppose it could be coded to
just retry every few seconds until success.  This doesn't improve the
behavior of background processes at all, though.

It hardly seems worth it. Given a reliable interlock against multiple
postmasters, the real concern is making sure that a half-dead
postmaster gets itself all-dead quickly so that the DBA can start up a
new one before he gets fired.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#14)
Re: kill -KILL: What happens?

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 13, 2011 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether we could have some sort of latch-like counter that
would count the number of active backends and deliver signals when the
count went to zero. �However, if the goal is to defend against random
applications of SIGKILL, there's probably no way to make this reliable
in userspace.

I don't think you can get there 100%. We could, however, make a rule
that when a background process fails a PostmasterIsAlive() check, it
sends SIGQUIT to everyone it can find in the ProcArray, which would at
least ensure a timely exit in most real-world cases.

You're going in the wrong direction there: we're trying to have the
system remain sane when the postmaster crashes, not see how quickly
it can screw up every remaining session.

BTW, in Unix-land we could maybe rely on SysV semaphores' SEM_UNDO
feature to keep a trustworthy count of how many live processes there
are. But I don't know whether there's anything comparable for Windows.

regards, tom lane

#16Aidan Van Dyk
aidan@highrise.ca
In reply to: Robert Haas (#14)
Re: kill -KILL: What happens?

On Thu, Jan 13, 2011 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm not convinced.  I was thinking that we could simply treat it like
SIGQUIT, if it's available.  I doubt there's a real use case for
continuing to run queries after the postmaster and all the background
processes are dead.  Expedited death seems like much better behavior.
Even checking PostmasterIsAlive() once per query would be reasonable,
except that it'd add a system call to check for a condition that
almost never holds, which I'm not eager to do.

If postmaster has a few fds to spare, what about having it open a pipe
to every child it spawns. It never has to read/write to it, but
postmaster closing will signal the client's fd. The client just has
to pop the fd into whatever nrmal poll/select event handlign it uses
to notice when the "parent's pipe" is closed.

A FIFO would allow postmaster to not need as many file handles, and
clients reading the fifo would notice when the writer (postmaster)
closes it.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: kill -KILL: What happens?

On Thu, Jan 13, 2011 at 3:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 13, 2011 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether we could have some sort of latch-like counter that
would count the number of active backends and deliver signals when the
count went to zero.  However, if the goal is to defend against random
applications of SIGKILL, there's probably no way to make this reliable
in userspace.

I don't think you can get there 100%.  We could, however, make a rule
that when a background process fails a PostmasterIsAlive() check, it
sends SIGQUIT to everyone it can find in the ProcArray, which would at
least ensure a timely exit in most real-world cases.

You're going in the wrong direction there: we're trying to have the
system remain sane when the postmaster crashes, not see how quickly
it can screw up every remaining session.

I strongly believe you're in the minority on that one, for the same
reasons that I don't think most people would agree with your notion of
what should be the default shutdown mode. A database that can't
accept new connections is a liability, not an asset.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Florian Pflug
fgp@phlo.org
In reply to: Aidan Van Dyk (#16)
Re: kill -KILL: What happens?

On Jan13, 2011, at 21:01 , Aidan Van Dyk wrote:

On Thu, Jan 13, 2011 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm not convinced. I was thinking that we could simply treat it like
SIGQUIT, if it's available. I doubt there's a real use case for
continuing to run queries after the postmaster and all the background
processes are dead. Expedited death seems like much better behavior.
Even checking PostmasterIsAlive() once per query would be reasonable,
except that it'd add a system call to check for a condition that
almost never holds, which I'm not eager to do.

If postmaster has a few fds to spare, what about having it open a pipe
to every child it spawns. It never has to read/write to it, but
postmaster closing will signal the client's fd. The client just has
to pop the fd into whatever nrmal poll/select event handlign it uses
to notice when the "parent's pipe" is closed.

I just started to experiment with that idea, and wrote a small test
program to check if that'd work. I'll post the results when I'm done.

best regards,
Florian Pflug

#19Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#17)
Re: kill -KILL: What happens?

Robert Haas <robertmhaas@gmail.com> wrote:

A database that can't accept new connections is a liability, not
an asset.

+1

I have so far been unable to imagine a use case for the production
databases I use where I would prefer to see backends continue after
postmaster failure.

-Kevin

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#17)
Re: kill -KILL: What happens?

Robert Haas <robertmhaas@gmail.com> writes:

I strongly believe you're in the minority on that one, for the same
reasons that I don't think most people would agree with your notion of
what should be the default shutdown mode. A database that can't
accept new connections is a liability, not an asset.

Killing active sessions when it's not absolutely necessary is not an
asset.

regards, tom lane

#21Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Aidan Van Dyk (#16)
#23David Fetter
david@fetter.org
In reply to: Florian Pflug (#18)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#20)
#25David Fetter
david@fetter.org
In reply to: Tom Lane (#11)
#26Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#22)
#27Jeff Davis
pgsql@j-davis.com
In reply to: David Fetter (#9)
#28David Fetter
david@fetter.org
In reply to: Jeff Davis (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#24)
#30Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Pflug (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#29)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#31)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#34)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#35)
#37Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Alvaro Herrera (#36)
#38Florian Pflug
fgp@phlo.org
In reply to: Kevin Grittner (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#38)
#40Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#40)
#42Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#39)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#42)
#44Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#43)
In reply to: Florian Pflug (#44)