windows doesn't notice backend death

Started by Andrew Dunstanalmost 17 years ago27 messageshackers
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

While trying to test Tom's theory about Windows not being able to
reinitialise shared memory when a backend crashes, I was doing some
testing on my Vista box. I killed a backend using the task manager, and
the postmaster never noticed. pg_cancel_backend() reported correctly
that the process didn't exist, but pg_stat_activity still had it listed
as IDLE, and the server certainly didn't re-initialise as it would have
on Unix. This seems a fairly serious bug.

cheers

andrew

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: windows doesn't notice backend death

Andrew Dunstan <andrew@dunslane.net> writes:

... I killed a backend using the task manager, and
the postmaster never noticed.

Ick. Is it possible that the postmaster did get a report, but thought
it was normal session termination? If so, how could we distinguish?

regards, tom lane

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: windows doesn't notice backend death

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

... I killed a backend using the task manager, and
the postmaster never noticed.

Ick. Is it possible that the postmaster did get a report, but thought
it was normal session termination? If so, how could we distinguish?

If that were the case then it would not have the dead process still
listed as a live backend, ISTM, which it does.

cheers

andrew

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: windows doesn't notice backend death

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Ick. Is it possible that the postmaster did get a report, but thought
it was normal session termination? If so, how could we distinguish?

If that were the case then it would not have the dead process still
listed as a live backend, ISTM, which it does.

The postmaster does not control the content of the pg_stat_activity
view.

regards, tom lane

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#4)
Re: windows doesn't notice backend death

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Ick. Is it possible that the postmaster did get a report, but thought
it was normal session termination? If so, how could we distinguish?

If that were the case then it would not have the dead process still
listed as a live backend, ISTM, which it does.

The postmaster does not control the content of the pg_stat_activity
view.

Well, I'm not I know how to find out the answer to your question. I
could try attaching a debugger to the postmaster - if I knew where to
put a breakpoint.

cheers

andrew

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#5)
Re: windows doesn't notice backend death

Andrew Dunstan wrote:

Well, I'm not I know how to find out the answer to your question. I
could try attaching a debugger to the postmaster - if I knew where to
put a breakpoint.

Did you try reaper()? It's the SIGCHLD handler.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#5)
Re: windows doesn't notice backend death

Andrew Dunstan wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Ick. Is it possible that the postmaster did get a report, but thought
it was normal session termination? If so, how could we distinguish?

If that were the case then it would not have the dead process still
listed as a live backend, ISTM, which it does.

The postmaster does not control the content of the pg_stat_activity
view.

Well, I'm not I know how to find out the answer to your question. I
could try attaching a debugger to the postmaster - if I knew where to
put a breakpoint.

I'd start at pgwin32_deadchild_callback(). That's where the waiting
thread should activate once a child goes away.

That one should post a notice to the win32ChildQueue, which is polled in
win32_waitpid() - that's a second good point for a breakpoint.

This in turn should make things happen up in reaper() - as Alvaro
suggests, a third good place for a breakpoint.

FWIW, this certainly used to work. So we've either broken this recently,
or it's always been broken on Vista (I've never tried it myself on
Vista, only 2000, XP and 2003).

//Magnus

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#7)
Re: windows doesn't notice backend death

Magnus Hagander <magnus@hagander.net> writes:

FWIW, this certainly used to work. So we've either broken this recently,
or it's always been broken on Vista (I've never tried it myself on
Vista, only 2000, XP and 2003).

Maybe a quick check if it still works on non-Vista versions would be
in order, to eliminate one or the other of those theories.

regards, tom lane

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: windows doesn't notice backend death

Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

FWIW, this certainly used to work. So we've either broken this recently,
or it's always been broken on Vista (I've never tried it myself on
Vista, only 2000, XP and 2003).

Maybe a quick check if it still works on non-Vista versions would be
in order, to eliminate one or the other of those theories.

Well, I can tell you that it is getting an exit code of 1, which is why
the postmaster isn't restarting.

That raises two questions in my mind. First, is that the behaviour we
expect when we kill the backend this way? And second, why is it still
showing up in the output of pg_stat_activity?

Meanwhile, I guess I can try to create a loadable function that will
crash a backend badly so I can try to reproduce my client's problem.

cheers

andrew

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#9)
Re: windows doesn't notice backend death

Andrew Dunstan <andrew@dunslane.net> writes:

Well, I can tell you that it is getting an exit code of 1, which is why
the postmaster isn't restarting.

Blech. Count on Windows to find a way to break things.

That raises two questions in my mind. First, is that the behaviour we
expect when we kill the backend this way? And second, why is it still
showing up in the output of pg_stat_activity?

Well, if the process is being hard-killed without an opportunity to run
through proc_exit(), then yes it is going to still show up in
pg_stat_activity. It's pgstat_beshutdown_hook that removes that entry.

The problem here is that we need to be able to distinguish a task
manager kill from a voluntary exit(1). Have M$ really been stupid
enough to make an external kill look just like an exit() call?

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#10)
Re: windows doesn't notice backend death

I wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Well, I can tell you that it is getting an exit code of 1, which is why
the postmaster isn't restarting.

Blech. Count on Windows to find a way to break things.

I reflected on this a bit more. Even if we find a way around this
particular task-manager behavior, it seems to me there is a generic
problem here. If some bit of clueless code does exit(0) or exit(1)
inside a backend session, the postmaster will think everything is fine,
but actually we have an un-cleaned-up session that's probably still
holding locks etc. It's fairly easy to demonstrate the issue:

pl_regression=# create language plperlu;
CREATE LANGUAGE
pl_regression=# create or replace function trouble() returns void as
pl_regression-# $$ exit 0; $$ language plperlu;
CREATE FUNCTION
pl_regression=# select trouble();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
pl_regression=# select * from pg_stat_activity;
datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port
-------+---------------+---------+----------+---------+---------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------
40179 | pl_regression | 20847 | 10 | tgl | select trouble(); | f | 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:45:10.911359-04 | | -1
40179 | pl_regression | 20855 | 10 | tgl | select * from pg_stat_activity; | f | 2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:17.920486-04 | | -1
(2 rows)

Up to now we've always just dismissed the above possibility as
"superusers should know better", but I think there's a reasonable case
to be made that this is an obvious failure mode and we should put a bit
more effort into being robust against it. With more and more external
code being routinely run in the backend, who wants to swear that there
is no "exit(1)" in the guts of libperl or libxml or whatever?

The first idea that comes to mind is to have some sort of "dead man
switch" that flags an active backend and is reset by proc_exit() after
it's finished cleaning up everything else. If the postmaster sees
this flag still set after backend exit, then it treats the backend as
having crashed regardless of what the reported exit code is.
We could implement this via an array of sig_atomic_t in shared memory,
so as to minimize the postmaster's entanglement with shared memory
(it'd be no worse than the old WIN32-specific child pid arrays).

Or maybe there's a better way. Thoughts?

regards, tom lane

#12Justin
justin@emproshunts.com
In reply to: Tom Lane (#10)
Re: windows doesn't notice backend death

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Tom Lane wrote:
<blockquote cite="mid:28338.1241374442@sss.pgh.pa.us" type="cite">
<pre wrap=""> Have M$ really been stupid
enough to make an external kill look just like an exit() call?

regards, tom lane

</pre>
</blockquote>
<font face="Arial">kind of :-(<br>
<br>
Everything i have read seems to point the Task Manager calls
TerminateProcess() in the kernel32.dll and passes a 1 setting the
exitcode to 1.&nbsp; I have not found anything clearly stating that, yet
logic points that is what its doing.&nbsp; <br>
<br>
<a class="moz-txt-link-freetext" href="http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx&quot;&gt;http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx&lt;/a&gt;&lt;br&gt;
<br>
Would it not be easy to set the normal exitcode to something other than
1 to see the difference <br>
ExitProcess()<br>
<a class="moz-txt-link-freetext" href="http://msdn.microsoft.com/en-us/library/ms682658(VS.85).aspx&quot;&gt;http://msdn.microsoft.com/en-us/library/ms682658(VS.85).aspx&lt;/a&gt;&lt;br&gt;
<br>
<br>
&nbsp;this article maybe helpful.&nbsp; <br>
<a class="moz-txt-link-freetext" href="http://msdn.microsoft.com/en-us/library/ms686722(VS.85).aspx&quot;&gt;http://msdn.microsoft.com/en-us/library/ms686722(VS.85).aspx&lt;/a&gt;&lt;br&gt;
<br>
BTW tested on Windows 2003 SP1&nbsp; this problem shows up in Postgresql
8.3.3<br>
</font>
</body>
</html>

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Justin (#12)
Re: windows doesn't notice backend death

justin wrote:

Tom Lane wrote:

Have M$ really been stupid
enough to make an external kill look just like an exit() call?

regards, tom lane

kind of :-(

Would it not be easy to set the normal exitcode to something other
than 1 to see the difference
ExitProcess()

Not really, as Tom showed later this is an example of a more general
problem. I think his solution of detecting when backends have cleaned up
nicely and when they have not is the right way to go.

cheers

andrew

#14Justin
justin@emproshunts.com
In reply to: Andrew Dunstan (#13)
Re: windows doesn't notice backend death

Andrew Dunstan wrote:

justin wrote:

Would it not be easy to set the normal exitcode to something other
than 1 to see the difference
ExitProcess()

Not really, as Tom showed later this is an example of a more general
problem. I think his solution of detecting when backends have cleaned
up nicely and when they have not is the right way to go.

cheers

andrew

Stupid thought why can the some clueless code set the exit status to
crashed status??? Would it not be more prudent to remove that ability???

#15Andrew Dunstan
andrew@dunslane.net
In reply to: Justin (#14)
Re: windows doesn't notice backend death

justin wrote:

Stupid thought why can the some clueless code set the exit status to
crashed status??? Would it not be more prudent to remove that ability???

You're missing the point. The danger comes from code that we don't control.

cheers

andrew

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#11)
Re: windows doesn't notice backend death

I wrote:

The first idea that comes to mind is to have some sort of "dead man
switch" that flags an active backend and is reset by proc_exit() after
it's finished cleaning up everything else. If the postmaster sees
this flag still set after backend exit, then it treats the backend as
having crashed regardless of what the reported exit code is.

Another thought that came to mind: we could set up an atexit hook that
does all the work that proc_exit() currently does, and reduce
proc_exit() itself to just an exit() call. psql already relies on
having atexit (or on_exit) so this doesn't appear to add any new
portability issues.

This will probably not fix the Vista taskmanager issue, since I'll
bet it's not running atexit hooks anyway. What it would do is improve
the situation so that a "clueless" exit() call would be no worse than
elog(FATAL), rather than triggering a DB-wide restart as the dead man
switch would do.

regards, tom lane

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#11)
Re: windows doesn't notice backend death

Tom Lane wrote:

Up to now we've always just dismissed the above possibility as
"superusers should know better", but I think there's a reasonable case
to be made that this is an obvious failure mode and we should put a bit
more effort into being robust against it. With more and more external
code being routinely run in the backend, who wants to swear that there
is no "exit(1)" in the guts of libperl or libxml or whatever?

FWIW there is (or there was, last time I looked) an exit(1) call in the
guts of the PHP library that PL/php uses, which is triggered when the
memory used goes over the configured memory limit. It was very easily
triggered with some of the test functions we had on our regression
tests, and the only solution was to kludge up the limit.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#17)
Re: windows doesn't notice backend death

Alvaro Herrera <alvherre@commandprompt.com> writes:

FWIW there is (or there was, last time I looked) an exit(1) call in the
guts of the PHP library that PL/php uses, which is triggered when the
memory used goes over the configured memory limit. It was very easily
triggered with some of the test functions we had on our regression
tests, and the only solution was to kludge up the limit.

I don't think we'll be able to prevent PHP from doing that :-(. But
it now seems clear that we should try to make the database as a whole
recover with some degree of grace. I'll go work up a patch.

regards, tom lane

#19Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#8)
Re: windows doesn't notice backend death

Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

FWIW, this certainly used to work. So we've either broken this recently,
or it's always been broken on Vista (I've never tried it myself on
Vista, only 2000, XP and 2003).

Maybe a quick check if it still works on non-Vista versions would be
in order, to eliminate one or the other of those theories.

I'll have to take this back.

I just tested back to 8.2.latest on XP, and it has the same behavior. We
don't notice a "kill -9" from the task manager.

If it worked, it's either <8.2, or it's something that changed with a
service pack to XP as well.

//Magnus

#20Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#10)
Re: windows doesn't notice backend death

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Well, I can tell you that it is getting an exit code of 1, which is why
the postmaster isn't restarting.

Blech. Count on Windows to find a way to break things.

Yup. A quick search gives this:
http://support.microsoft.com/kb/155075

which clearly states that task manager simply uses TerminateProcess().
And it's probably hard-coded with an exit code.

Which really means this can't have worked, and my memory was wrong.

I still claim it works when the process actually *crashes* though -
since that will give you one of those funky huge exit numbers.

That raises two questions in my mind. First, is that the behaviour we
expect when we kill the backend this way? And second, why is it still
showing up in the output of pg_stat_activity?

Well, if the process is being hard-killed without an opportunity to run
through proc_exit(), then yes it is going to still show up in
pg_stat_activity. It's pgstat_beshutdown_hook that removes that entry.

That's certainly what task manager does. TerminateProcess() just stops
all threads and removes the process. No chance for atexit() stuff to run.

The problem here is that we need to be able to distinguish a task
manager kill from a voluntary exit(1). Have M$ really been stupid
enough to make an external kill look just like an exit() call?

It certainly looks that way. It probably goes back to the "use threads,
not processes, on this platform" design...

//Magnus

#21Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#13)
#22Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#16)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#18)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#24)
#26Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#25)
#27Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#25)