How to shoot yourself in the foot: kill -9 postmaster

bright@wintelcom.net

about 25 years ago

In reply to: Tom Lane (#1)

Re: How to shoot yourself in the foot: kill -9 postmaster

* Tom Lane <tgl@sss.pgh.pa.us> [010305 14:51] wrote:

I think we need a stronger interlock to prevent this scenario, but I'm
unsure what it should be. Ideas?

Re having multiple postmasters active by accident.

The sysV IPC stuff has some hooks in it that may help you.

One idea is to check the 'struct shmid_ds' feild 'shm_nattch',
basically at startup if it's not 1 (or 0) then you have more than
one postgresql instance messing with it and it should not proceed.

I'd also suggest looking into using sysV semaphores and the semundo
stuff, afaik it can be used to track the number of consumers of
a reasource.

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]

lamar.owen@wgcr.org

about 25 years ago

In reply to: Tom Lane (#1)

Re: How to shoot yourself in the foot: kill -9 postmaster

Tom Lane wrote:

checkpoint record. Clueless admins who resort to kill -9 as a routine
admin tool *will* lose their databases. Moreover, the init scripts
that are running around now are dangerous weapons if used with 7.1.

Thanks for the headsup, Tom. Time to nix killproc and do something
cleaner -- compatible, but cleaner. I'll have to research what the
defaults are for later RH's -- but, as 6.1 is one of my target platforms
at this time, I have to fix that issue for sure.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#4)

Re: How to shoot yourself in the foot: kill -9 postmaster

Lamar Owen <lamar.owen@wgcr.org> writes:

Thanks for the headsup, Tom. Time to nix killproc and do something
cleaner -- compatible, but cleaner.

As far as I could tell from the 6.1 scripts, it would work to do

killproc postmaster -TERM

The problem is just that killproc has an overenthusiastic default...

regards, tom lane

bruce@momjian.us

about 25 years ago

In reply to: Lamar Owen (#4)

Re: How to shoot yourself in the foot: kill -9 postmaster

killproc should send a kill -15 to the process, wait a few seconds for
it to exit. If it does not, try kill -1, and if that doesn't kill it,
then kill -9.

Tom Lane wrote:

checkpoint record. Clueless admins who resort to kill -9 as a routine
admin tool *will* lose their databases. Moreover, the init scripts
that are running around now are dangerous weapons if used with 7.1.

Thanks for the headsup, Tom. Time to nix killproc and do something
cleaner -- compatible, but cleaner. I'll have to research what the
defaults are for later RH's -- but, as 6.1 is one of my target platforms
at this time, I have to fix that issue for sure.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

bruce@momjian.us

about 25 years ago

In reply to: Tom Lane (#5)

Re: How to shoot yourself in the foot: kill -9 postmaster

Lamar Owen <lamar.owen@wgcr.org> writes:

Thanks for the headsup, Tom. Time to nix killproc and do something
cleaner -- compatible, but cleaner.

As far as I could tell from the 6.1 scripts, it would work to do

killproc postmaster -TERM

Yes, amazing it has a -9 default.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Bruce Momjian (#6)

Re: How to shoot yourself in the foot: kill -9 postmaster

Bruce Momjian <pgman@candle.pha.pa.us> writes:

killproc should send a kill -15 to the process, wait a few seconds for
it to exit. If it does not, try kill -1, and if that doesn't kill it,
then kill -9.

Tell it to the Linux people ... this is their boot-script code we're
talking about.

regards, tom lane

lamar.owen@wgcr.org

about 25 years ago

In reply to: Bruce Momjian (#6)

Re: How to shoot yourself in the foot: kill -9 postmaster

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

killproc should send a kill -15 to the process, wait a few seconds for
it to exit. If it does not, try kill -1, and if that doesn't kill it,
then kill -9.

Tell it to the Linux people ... this is their boot-script code we're
talking about.

RedHat, in particular. I can't vouch for any others.

On my RH 6.2 box, with initscripts-5.00-1 loaded, here's what killproc
does if no killlevel is set (even though a default $killlevel is set to
-9, it's not used in this code):
($pid is the pid of the proc to kill, $base is the name of the proc,
etc)

if [ "$notset" = "1" ] ; then
if ps h $pid>/dev/null 2>&1; then
# TERM first, then KILL if not dead
kill -TERM $pid
usleep 100000
if ps h $pid >/dev/null 2>&1 ; then
sleep 1
if ps h $pid >/dev/null 2>&1 ; then
sleep 3
if ps h $pid >/dev/null 2>&1 ; then
kill -KILL $pid
fi
fi
fi
fi
ps h $pid >/dev/null 2>&1
RC=$?
[ $RC -eq 0 ] && failure "$base shutdown" || success "$base
shutdown"
RC=$((! $RC))
# use specified level only
else
if ps h $pid >/dev/null 2>&1; then
kill $killlevel $pid
RC=$?
[ $RC -eq 0 ] && success "$base $killlevel" || failure "$base
$killlevel"
fi
fi

Is 6.1 this different from 6.2? This code on the surface seems
reasonable to me -- am I missing something? The 6.2 code (found in
/etc/rc.d/init.d/functions, for those who might not know where to find
killproc) sets a default killlevel but never uses it -- ignorant but not
stupid.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#10

bruce@momjian.us

about 25 years ago

In reply to: Lamar Owen (#9)

Re: How to shoot yourself in the foot: kill -9 postmaster

if [ "$notset" = "1" ] ; then
if ps h $pid>/dev/null 2>&1; then
# TERM first, then KILL if not dead
kill -TERM $pid
usleep 100000
if ps h $pid >/dev/null 2>&1 ; then
sleep 1
if ps h $pid >/dev/null 2>&1 ; then
sleep 3
if ps h $pid >/dev/null 2>&1 ; then
kill -KILL $pid
fi
fi
fi
fi

Yes, this seems like the proper way to do it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#11

Nathan Myers

ncm@zembu.com

about 25 years ago

In reply to: Tom Lane (#8)

Re: How to shoot yourself in the foot: kill -9 postmaster

On Mon, Mar 05, 2001 at 08:55:41PM -0500, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

killproc should send a kill -15 to the process, wait a few seconds for
it to exit. If it does not, try kill -1, and if that doesn't kill it,
then kill -9.

Tell it to the Linux people ... this is their boot-script code we're
talking about.

Not to be a zealot, but this isn't _Linux_ boot-script code, it's
_Red Hat_ boot-script code. Red Hat would like for us all to confuse
the two, but they jes' ain't the same. (As a rule of thumb, where it
works right, credit Linux; where it doesn't, blame Red Hat. :-)

Nathan Myers
ncm@zembu.com

#12

Hiroshi Inoue

Inoue@tpf.co.jp

about 25 years ago

In reply to: Tom Lane (#1)

Re: How to shoot yourself in the foot: kill -9 postmaster

Tom Lane wrote:

Now, killing the postmaster -9 and not cleaning up the backends has
always been a good way to shoot yourself in the foot, but up to now the
worst thing that was likely to happen to you was isolated corruption in
specific tables. In the brave new world of WAL the stakes are higher,
because the system will refuse to start up if it finds a corrupted
checkpoint record. Clueless admins who resort to kill -9 as a routine
admin tool *will* lose their databases. Moreover, the init scripts
that are running around now are dangerous weapons if used with 7.1.

I think we need a stronger interlock to prevent this scenario, but I'm
unsure what it should be. Ideas?

Seems the simplest way is to inhibit starting postmaster
if the pid file exists.
Another way is to use flock() if flock() is available.
We could flock() the pid file so that another postmaster
could detect the lock of the file.

Regards,
Hiroshi Inoue

#13

lamar.owen@wgcr.org

about 25 years ago

In reply to: Bruce Momjian (#10)

Re: How to shoot yourself in the foot: kill -9 postmaster

Bruce Momjian wrote:

# TERM first, then KILL if not dead

Yes, this seems like the proper way to do it.

Now to verify that 6.1 is the same....or different.... Hmmmm.... The
mirrors of ftp.redhat.com (and, in fact, RedHat.com itself) no longer
have the updates or the original for 6.1's initscripts-4.70 package.
Can a RedHat 6.1 user (using as close as possible to 6.1's release
initscripts package) send me a copy of /etc/rc.d/init.d/functions, or
verify how that initscripts package defines killproc? I cannot at this
moment locate my RH 6.1 SRPMS CD. Found my RH _4_.1 CD, but that's just
a _little_ old :-).
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#14

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Hiroshi Inoue (#12)

Re: How to shoot yourself in the foot: kill -9 postmaster

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

Tom Lane wrote:

I think we need a stronger interlock to prevent this scenario, but I'm
unsure what it should be. Ideas?

Seems the simplest way is to inhibit starting postmaster
if the pid file exists.

Then we're unable to recover from a crash without manual intervention.

The tricky part of this is not to give up the ability to restart when
there *has* been a crash.

Another way is to use flock() if flock() is available.
We could flock() the pid file so that another postmaster
could detect the lock of the file.

This would only work if every backend is holding flock on the file,
which would mean they'd all have to keep it open all the time. Kind
of annoying to use up that many file descriptors on it. Might be the
best answer though; I haven't thought of anything I like better...

regards, tom lane

#15

lamar.owen@wgcr.org

about 25 years ago

In reply to: Bruce Momjian (#6)

Re: How to shoot yourself in the foot: kill -9 postmaster

Nathan Myers wrote:

Not to be a zealot, but this isn't _Linux_ boot-script code, it's
_Red Hat_ boot-script code. Red Hat would like for us all to confuse
the two, but they jes' ain't the same. (As a rule of thumb, where it
works right, credit Linux; where it doesn't, blame Red Hat. :-)

So we're going to credit Linux for PostgreSQL being shipped as part of
the RedHat distribution since RH 5.0, then? :-0
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#16

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#9)

Re: How to shoot yourself in the foot: kill -9 postmaster

Lamar Owen <lamar.owen@wgcr.org> writes:

Is 6.1 this different from 6.2?

Scott sent me a copy of /etc/init.d/functions from his box, and it has
largely the same behavior (I hadn't read the whole code to notice that
it doesn't use the default killlevel...). What's actually happening
here is that the init script sends SIGTERM, and then SIGKILL four
seconds later if the postmaster hasn't shut down yet. Unfortunately,
unless your clients are very short-lived four seconds isn't going to
be enough for a "polite" shutdown. (It's pretty marginal even for
an impolite one, since a checkpoint will take at least a couple of
seconds.)

However, with an explicit kill level that doesn't happen: you get one
signal of the specified value, no more. Possibly it would be better for
the init script to send SIGINT (forcibly disconnect clients) instead of
SIGTERM, however. So I'm now leaning to "killproc postmaster -INT".

regards, tom lane

#17

lamar.owen@wgcr.org

about 25 years ago

In reply to: Tom Lane (#1)

Re: How to shoot yourself in the foot: kill -9 postmaster

Tom Lane wrote:

The tricky part of this is not to give up the ability to restart when
there *has* been a crash.

But kill -9 effectively _is_ an admin-initiated crash.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#18

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#17)

Re: How to shoot yourself in the foot: kill -9 postmaster

Lamar Owen <lamar.owen@wgcr.org> writes:

Tom Lane wrote:

The tricky part of this is not to give up the ability to restart when
there *has* been a crash.

But kill -9 effectively _is_ an admin-initiated crash.

Yeah, but only a partial crash. If the admin finishes the job by
killing the backends too, we're fine. Postmaster down, backends alive
is not a scenario we're currently prepared for. We need a way to plug
that gap.

regards, tom lane

#19

lamar.owen@wgcr.org

about 25 years ago

In reply to: Bruce Momjian (#6)

Re: How to shoot yourself in the foot: kill -9 postmaster

Tom Lane wrote:

However, with an explicit kill level that doesn't happen: you get one
signal of the specified value, no more. Possibly it would be better for
the init script to send SIGINT (forcibly disconnect clients) instead of
SIGTERM, however. So I'm now leaning to "killproc postmaster -INT".

Ok, since I can't seem to count on killproc's exact behavior, istm that
I can:
killproc postmaster -INT
wait some number of seconds
if postmaster still up
killproc postmaster -TERM
wait some number of seconds
if postmaster STILL up
killproc postmaster #and let the grim reaper do its dirty work.

After all, the system shutdown is relying on this script to properly and
thoroughly shut things down, or it WILL do the 'kill -9
pid-of-postmaster' for you.

Now, what's a good delay here? Or is there a better metric that a
simple delay? After all, I want to avoid the kill -9 unless we have an
emergency hard lock situation -- what's a good indicator of the backend
fleet of processes actually _doing_ something? Or should I key on an
indicator of processor speed (Linux does provide a nice bogus metric
known as BogoMIPS for such a purpose)? The last thing I want to do is
wait too long on some platforms and not long enough on others.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#20

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#19)

Re: How to shoot yourself in the foot: kill -9 postmaster

Lamar Owen <lamar.owen@wgcr.org> writes:

The last thing I want to do is
wait too long on some platforms and not long enough on others.

The difficulty is to know how long the final checkpoint will take.
This depends on (at least) your hard disk speed and the number of
dirty buffers, so I think you're going to have some difficulty
estimating it with any reliability. BogoMIPS won't help, for sure.

However, if you do SIGINT and then wait a few seconds, you can be fairly
sure that all the extant backends are dead (if not frozen up...) and
that the checkpoint is in progress. That may be about the best you can
do.

I do not agree that this script should take it on itself to kill -9 the
postmaster. Please note that the reason we're having this discussion at
all is that the init script may be used for purposes other than system
shutdown. So the argument that "it's going to happen anyway" is wrong.

regards, tom lane

#21

lamar.owen@wgcr.org

about 25 years ago

In reply to: Tom Lane (#1)

#22

bruce@momjian.us

about 25 years ago

In reply to: Lamar Owen (#19)

#23

lamar.owen@wgcr.org

about 25 years ago

In reply to: Bruce Momjian (#6)

#24

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#21)

#25

lamar.owen@wgcr.org

about 25 years ago

In reply to: Bruce Momjian (#22)

#26

bruce@momjian.us

about 25 years ago

In reply to: Lamar Owen (#25)

#27

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#23)

#28

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#25)

#29

lamar.owen@wgcr.org

about 25 years ago

In reply to: Tom Lane (#1)

#30

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#29)

#31

lamar.owen@wgcr.org

about 25 years ago

In reply to: Bruce Momjian (#6)

#32

lamar.owen@wgcr.org

about 25 years ago

In reply to: Tom Lane (#1)

#33

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#31)

#34

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Lamar Owen (#32)

#35

lamar.owen@wgcr.org

about 25 years ago

In reply to: Tom Lane (#1)

#36

bright@wintelcom.net

about 25 years ago

In reply to: Tom Lane (#24)

#37

Dominique Quatravaux

dom@idealx.com

about 25 years ago

In reply to: Tom Lane (#30)

#38

bruce@momjian.us

about 25 years ago

In reply to: Dominique Quatravaux (#37)

#39

peter_e@gmx.net

about 25 years ago

In reply to: Bruce Momjian (#22)

#40

peter_e@gmx.net

about 25 years ago

In reply to: Lamar Owen (#25)

#41

lamar.owen@wgcr.org

about 25 years ago

In reply to: Peter Eisentraut (#40)

#42

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Alfred Perlstein (#36)

#43

peter_e@gmx.net

about 25 years ago

In reply to: Lamar Owen (#41)

#44

bright@wintelcom.net

about 25 years ago

In reply to: Tom Lane (#42)

#45

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Alfred Perlstein (#44)

#46

bright@wintelcom.net

about 25 years ago

In reply to: Tom Lane (#45)

#47

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Alfred Perlstein (#46)

#48

lamar.owen@wgcr.org

about 25 years ago

In reply to: Peter Eisentraut (#43)

#49

bright@wintelcom.net

about 25 years ago

In reply to: Tom Lane (#47)

#50

peter_e@gmx.net

about 25 years ago

In reply to: Alfred Perlstein (#46)

#51

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Alfred Perlstein (#49)

#52

lamar.owen@wgcr.org

about 25 years ago

In reply to: Peter Eisentraut (#50)

#53

bright@wintelcom.net

about 25 years ago

In reply to: Tom Lane (#51)

#54

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Peter Eisentraut (#50)

#55

bright@wintelcom.net

about 25 years ago

In reply to: Lamar Owen (#52)

#56

bright@wintelcom.net

about 25 years ago

In reply to: Tom Lane (#54)

#57

tgl@sss.pgh.pa.us

about 25 years ago

In reply to: Alfred Perlstein (#56)

#58

Nathan Myers

ncm@zembu.com

about 25 years ago

In reply to: Peter Eisentraut (#50)

#59

lamar.owen@wgcr.org

about 25 years ago

In reply to: Alfred Perlstein (#46)

#60

Samuel Sieb

samuel@sieb.net

about 25 years ago

In reply to: Nathan Myers (#58)

#61

Martin A. Marques

martin@math.unl.edu.ar

about 25 years ago

In reply to: Samuel Sieb (#60)

#62

bright@wintelcom.net

about 25 years ago

In reply to: Lamar Owen (#59)

#63

Cyril VELTER

cyril.velter@libertysurf.fr

about 25 years ago

In reply to: Hiroshi Inoue (#12)

#64

bright@wintelcom.net

about 25 years ago

In reply to: Cyril VELTER (#63)

#65

lamar.owen@wgcr.org

about 25 years ago

In reply to: Alfred Perlstein (#46)

#66

Nathan Myers

ncm@zembu.com

about 25 years ago

In reply to: Lamar Owen (#59)

#67