Stats collection on Windows

Started by Peter Brantalmost 20 years ago33 messages
#1Peter Brant
Peter.Brant@wicourts.gov

Hi all,

I think I've found the cause (or one of the causes) why stats
collection is unreliable on Windows and I'm wondering about the best way
to go about fixing it.

The problem is that process IDs on Windows seem to be assigned without
much rhyme or reason and it seems to happen relatively frequently that a
new process will be assigned the same process ID as a process which
recently died. If this happens before the backend has been expired out
of pgstat.c's pgStatBeDead hash, the backend will be missed.

I was thinking the postmaster could maintain a backend sequence number
with similar semantics to a UNIXish process ID which could then be used
as the key for pgStatBeDead instead of the actual process ID. Does that
sound reasonable?

Peter

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Brant (#1)
Re: Stats collection on Windows

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

I think I've found the cause (or one of the causes) why stats
collection is unreliable on Windows and I'm wondering about the best way
to go about fixing it.

The problem is that process IDs on Windows seem to be assigned without
much rhyme or reason and it seems to happen relatively frequently that a
new process will be assigned the same process ID as a process which
recently died. If this happens before the backend has been expired out
of pgstat.c's pgStatBeDead hash, the backend will be missed.

That's an interesting theory, but do you have any actual evidence for it?
The evidence I've seen says that our big problem on Windows is the stats
collector process just quitting due to unexplained piperead() failures.

(I mean, I'd love to blame Microsoft for everything, but even the
Redmond crowd should be able to figure out that recycling process IDs
instantly would be a stupid idea...)

regards, tom lane

#3Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Peter Brant (#1)
Re: Stats collection on Windows

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

The problem is that process IDs on Windows seem to be assigned without
much rhyme or reason and it seems to happen relatively frequently that a
new process will be assigned the same process ID as a process which
recently died.

That's an interesting theory, but do you have any actual evidence for it?

I can confirm that on Windows 2000 the process ID is recycled instantly.

Regards,
Qingqing

#4Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Peter Brant (#1)
Re: Stats collection on Windows

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Redmond crowd should be able to figure out that recycling process IDs
instantly would be a stupid idea...)

Can you explain more of this? IMHO, if we rely on feature like this, the
difference is unstable-every-day vs. unstable-every-year.

Regards,
Qingqing

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#4)
Re: Stats collection on Windows

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Redmond crowd should be able to figure out that recycling process IDs
instantly would be a stupid idea...)

Can you explain more of this? IMHO, if we rely on feature like this, the
difference is unstable-every-day vs. unstable-every-year.

The mere existence of the kill() primitive should bring to mind reasons
why it's a bad idea.

regards, tom lane

#6Magnus Hagander
mha@sollentuna.net
In reply to: Tom Lane (#5)
Re: Stats collection on Windows

Redmond crowd should be able to figure out that recycling

process IDs

instantly would be a stupid idea...)

Can you explain more of this? IMHO, if we rely on feature

like this,

the difference is unstable-every-day vs. unstable-every-year.

The mere existence of the kill() primitive should bring to
mind reasons why it's a bad idea.

Except the kill() primitive *does not exist* on windows.

That said, how did you go about to confirm that the pid is recycled
instantly? I was under the impression that it assignes any unused pid in
random order, which is also what a quick glance at my XP box looks like
(don't have a 2000 box around, but I wasn't aware of such a change
between those - but it's certainly not impossible). But if oyu had some
better method of determining it, please let me know :-)

If that's how, several other OSes do the same thing AFAIK - for security
reasons. For example OpenBSD. So if we rely heavily on that, we may be
in trouble elsewhere as well.

//Magnus

#7Noname
mark@mark.mielke.cc
In reply to: Tom Lane (#2)
Re: Stats collection on Windows

On Tue, Apr 04, 2006 at 11:02:11PM -0400, Tom Lane wrote:

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

I think I've found the cause (or one of the causes) why stats
collection is unreliable on Windows and I'm wondering about the best way
to go about fixing it.
The problem is that process IDs on Windows seem to be assigned without
much rhyme or reason and it seems to happen relatively frequently that a
new process will be assigned the same process ID as a process which
recently died. If this happens before the backend has been expired out
of pgstat.c's pgStatBeDead hash, the backend will be missed.

That's an interesting theory, but do you have any actual evidence for it?
The evidence I've seen says that our big problem on Windows is the stats
collector process just quitting due to unexplained piperead() failures.
(I mean, I'd love to blame Microsoft for everything, but even the
Redmond crowd should be able to figure out that recycling process IDs
instantly would be a stupid idea...)

Why? :-)

They use HANDLE. The process ID isn't nearly as useful as it is on UNIX.
I haven't looked at that stuff in a long time, but process "ID" on Windows
may be a compatibility method.

Process "ID" isn't necessarily a good way of identifying tasks,
precisely because they may be reused. Using a serial allocated at
process start might make more sense. Relying on it not being reused,
is not dissimilar to the old malloc() "tricks" of assuming that
malloc() will not return something recently free()d. It's bad.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#8Noname
mark@mark.mielke.cc
In reply to: Tom Lane (#5)
Re: Stats collection on Windows

On Tue, Apr 04, 2006 at 11:17:49PM -0400, Tom Lane wrote:

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Redmond crowd should be able to figure out that recycling process IDs
instantly would be a stupid idea...)

Can you explain more of this? IMHO, if we rely on feature like this, the
difference is unstable-every-day vs. unstable-every-year.

The mere existence of the kill() primitive should bring to mind reasons
why it's a bad idea.

CreateProcess - "The process is assigned a process identifier. The
identifier is valid until the process terminates. It can be used to
identify the process, or specified in the OpenProcess function to open
a handle to the process. The initial thread in the process is also
assigned a thread identifier. ..."

TerminateProcess - "Terminates the specified process and all of its
threads."

TerminateProcess takes a HANDLE, not a process identifier. Yes, they
provide the "kill" primitive, but only as a compatibility measure. A
"good" Windows process, should maintain a HANDLE to the process, and
kill the process using the HANDLE. This way, there is no race. The
HANDLE is also how you wait for the process to terminate normally.

I prefer the "Redmond" way, in that I find UNIX's use of integer
identifiers to be encouraging of race conditions. UNIX requires hacks
like minimizing PID reuse, because UNIX is the one that is broken. :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#9Martijn van Oosterhout
kleptog@svana.org
In reply to: Noname (#8)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 03:20:47AM -0400, mark@mark.mielke.cc wrote:

TerminateProcess takes a HANDLE, not a process identifier. Yes, they
provide the "kill" primitive, but only as a compatibility measure. A
"good" Windows process, should maintain a HANDLE to the process, and
kill the process using the HANDLE. This way, there is no race. The
HANDLE is also how you wait for the process to terminate normally.

Which presents the solution, we should use the HANDLE on windows rather
than the process identifier.

I prefer the "Redmond" way, in that I find UNIX's use of integer
identifiers to be encouraging of race conditions. UNIX requires hacks
like minimizing PID reuse, because UNIX is the one that is broken. :-)

Eh? A HANDLE is (or can be mapped to) an integer too. I don't see
anything on that page about handle reuse. If you run a machine long
enough I'm sure it can be reused also...

Have a nice day,

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#10Noname
mark@mark.mielke.cc
In reply to: Martijn van Oosterhout (#9)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 09:30:06AM +0200, Martijn van Oosterhout wrote:

On Wed, Apr 05, 2006 at 03:20:47AM -0400, mark@mark.mielke.cc wrote:

TerminateProcess takes a HANDLE, not a process identifier. Yes, they
provide the "kill" primitive, but only as a compatibility measure. A
"good" Windows process, should maintain a HANDLE to the process, and
kill the process using the HANDLE. This way, there is no race. The
HANDLE is also how you wait for the process to terminate normally.

Which presents the solution, we should use the HANDLE on windows rather
than the process identifier.

Yes.

I prefer the "Redmond" way, in that I find UNIX's use of integer
identifiers to be encouraging of race conditions. UNIX requires hacks
like minimizing PID reuse, because UNIX is the one that is broken. :-)

Eh? A HANDLE is (or can be mapped to) an integer too. I don't see
anything on that page about handle reuse. If you run a machine long
enough I'm sure it can be reused also...

Once upon a time, when I played with this stuff (I mostly use UNIX, not
Windows), I concluded to myself that HANDLE was process-local, and that
it was allocated. Meaning - it won't be re-used until you CloseHandle().
It's best then, to think of HANDLE as a opaque object. Regardless, of
whether it is process-local or not, until you run CloseHandle(), it is
yours to keep, and it won't be re-used.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#11Martijn van Oosterhout
kleptog@svana.org
In reply to: Noname (#10)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 03:38:28AM -0400, mark@mark.mielke.cc wrote:

Once upon a time, when I played with this stuff (I mostly use UNIX, not
Windows), I concluded to myself that HANDLE was process-local, and that
it was allocated. Meaning - it won't be re-used until you CloseHandle().
It's best then, to think of HANDLE as a opaque object. Regardless, of
whether it is process-local or not, until you run CloseHandle(), it is
yours to keep, and it won't be re-used.

HANDLE is process local? That is worse then, because then there's no
guarentee that each process will see a different identifier.

The stats collector identifies processes by their process id, which
they get using getpid(). If instead they used a handle for their own
process (GetCurrentProcess() always returns -1, but you can apparently
clone it to get a real handle), you have no idea whether that handle is
unique amongst backends, because it's process local.

The stats collector doesn't have any open handles for the backend, it's
just a way for backends to identify themselves. It appears that process
handles are not up to the task either...

Do we have a plan C?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#12Magnus Hagander
mha@sollentuna.net
In reply to: Martijn van Oosterhout (#11)
Re: Stats collection on Windows

Once upon a time, when I played with this stuff (I mostly use UNIX,
not Windows), I concluded to myself that HANDLE was

process-local, and

that it was allocated. Meaning - it won't be re-used until

you CloseHandle().

It's best then, to think of HANDLE as a opaque object.

Regardless, of

whether it is process-local or not, until you run

CloseHandle(), it is

yours to keep, and it won't be re-used.

HANDLE is process local? That is worse then, because then
there's no guarentee that each process will see a different
identifier.

HANDLE is process local. What you need to do is run DuplicateHandle() on
it specifying it should "also be valid for process Y" (for which you
need a HANDLE opened, in this case the stats collector). This will give
you a new handle whichi s valid in the *target process*, but it is *not*
valid in your own process.

A quick look shows that pids appear to be reused quite fast on Windows
2000, but *not* on XP or 2003. Don't have any other versions around to
test.

AFAIK, the pid on unix is also only vaild while the process is running.
It's not likely to be reused quickly, which appears to be the same for
XP+ on win32.

Do we have a plan C?

Well, first we need to know that this really *is* the problem. I'm
unsure if we've seen "proof" that this actually causes a real problem.
(It's certainly not the main issue with teh stats system on win32, which
is the fact that it crashes now and then)

How likely is it that the arrive-in-wrong-order really happens? And can
we do something about that one? (I assume the problem is that the
backend-die message is sent from postmaster, meaning that the die can
appear before the alive sent from the backend itself. What if we changed
it so we had one message sent from a backend doing a controlled shutdown
(99.9% of the time) and a different one when the postmaster picks up a
dead backend. And we apply the "dead counter" only to the messages from
the postmaster, and expire the record right away when we receive one
from a controlled shutdown (which would happen in-line with the
backend-start message)

//Magnus

//Magnus

#13Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#12)
Re: Stats collection on Windows

Redmond crowd should be able to figure out that

recycling process

IDs instantly would be a stupid idea...)

Can you explain more of this? IMHO, if we rely on feature

like this,

the difference is unstable-every-day vs. unstable-every-year.

The mere existence of the kill() primitive should bring to mind
reasons why it's a bad idea.

CreateProcess - "The process is assigned a process
identifier. The identifier is valid until the process
terminates. It can be used to identify the process, or
specified in the OpenProcess function to open a handle to the
process. The initial thread in the process is also assigned a
thread identifier. ..."

TerminateProcess - "Terminates the specified process and all
of its threads."

TerminateProcess takes a HANDLE, not a process identifier.

Yes. You get the handle by doing OpenProcess() with PROCESS_TERMINATE
access.

The functions used to enumerate processes all return the process id, not
a HANDLE.(See Process32First/Process32Next). You use the HANDLE only
when you need to *modify* something (for example, killing it).

Yes, they provide the "kill" primitive, but only as a
compatibility measure.

They do? Where is it? Certainly not in the documented SDK that I can
see.

A "good" Windows process, should
maintain a HANDLE to the process, and kill the process using
the HANDLE. This way, there is no race. The HANDLE is also
how you wait for the process to terminate normally.

A "good" Windows process would be using threads ;-)
That's what the Windows API was written for, and that's why anything
that deals with multiple processes working together is a lot harder than
on Unix.

//Magnus

#14Noname
mark@mark.mielke.cc
In reply to: Martijn van Oosterhout (#11)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 09:58:54AM +0200, Martijn van Oosterhout wrote:

On Wed, Apr 05, 2006 at 03:38:28AM -0400, mark@mark.mielke.cc wrote:

Once upon a time, when I played with this stuff (I mostly use UNIX, not
Windows), I concluded to myself that HANDLE was process-local, and that
it was allocated. Meaning - it won't be re-used until you CloseHandle().
It's best then, to think of HANDLE as a opaque object. Regardless, of
whether it is process-local or not, until you run CloseHandle(), it is
yours to keep, and it won't be re-used.

HANDLE is process local? That is worse then, because then there's no
guarentee that each process will see a different identifier.

It's no different from a file descriptor on UNIX.

Neither UNIX nor Windows promise that a process identifier is valid
beyond the life of the process. UNIX avoids it from happening, as it
is necessary to avoid races with system calls such as kill(). Windows
does not have this problem.

The stats collector identifies processes by their process id, which
they get using getpid(). If instead they used a handle for their own
process (GetCurrentProcess() always returns -1, but you can apparently
clone it to get a real handle), you have no idea whether that handle is
unique amongst backends, because it's process local.

The stats collector doesn't have any open handles for the backend, it's
just a way for backends to identify themselves. It appears that process
handles are not up to the task either...

Do we have a plan C?

Sure. Serial. Allocate on process start.

Or, back to another topic from months ago - UUID generation... :-)

Process identifier should not be used beyond the life of the process.
As soon as wait() removes the process on UNIX, the process identifier
is no longer valid, and could be reused. That the operating system tries
to prevent problems by avoiding recycling isn't necessarily a good
reason to exploit this capability.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#15Noname
mark@mark.mielke.cc
In reply to: Magnus Hagander (#13)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 10:31:37AM +0200, Magnus Hagander wrote:

TerminateProcess takes a HANDLE, not a process identifier.

Yes. You get the handle by doing OpenProcess() with PROCESS_TERMINATE
access.
The functions used to enumerate processes all return the process id, not
a HANDLE.(See Process32First/Process32Next). You use the HANDLE only
when you need to *modify* something (for example, killing it).

Enumerating processes isn't a common operationg for programs other
than 'top'. :-) But even so, one can use OpenProcess() to evaluate
that the process being referenced is actually the one you think it
is, before calling TerminateProcess(), eliminating the race.

Yes, they provide the "kill" primitive, but only as a
compatibility measure.

They do? Where is it? Certainly not in the documented SDK that I can
see.

I believe it is called KillProcess().

A "good" Windows process, should
maintain a HANDLE to the process, and kill the process using
the HANDLE. This way, there is no race. The HANDLE is also
how you wait for the process to terminate normally.

A "good" Windows process would be using threads ;-)

That's what the Windows API was written for, and that's why anything
that deals with multiple processes working together is a lot harder than
on Unix.

Haha. Good point. :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#16Martijn van Oosterhout
kleptog@svana.org
In reply to: Noname (#14)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 06:03:31AM -0400, mark@mark.mielke.cc wrote:

It's no different from a file descriptor on UNIX.

Neither UNIX nor Windows promise that a process identifier is valid
beyond the life of the process. UNIX avoids it from happening, as it
is necessary to avoid races with system calls such as kill(). Windows
does not have this problem.

But consider, why is the process id there? (Amongst other reasons) so
that users can monitor pg_stat_activity and kill a backend that's out
of control. The equivalent to this in windows would be to.

1. Get HANDLE from the process ID.
2. TerminateProcess with that HANDLE.

Presumably users have a little GUI app that displays processes on the
system with the process ID so they can kill it. If a process ID is
quickly reused, they may end up killing the wrong one.

The race condition in this case involves the user and you can't solve
that programmatically. The non-reuse of pids is more for
user-friendlyness than anything else. The Window use of HANDLE doesn't
solve this problem at all.

Sure. Serial. Allocate on process start.

Or, back to another topic from months ago - UUID generation... :-)

Neither of which solve the "I'm a user and want to kill *that* backend"
problem. Because even on windows you'll have to get the process id,
convert it to a handle an kill it. Same race condition.

Process identifier should not be used beyond the life of the process.
As soon as wait() removes the process on UNIX, the process identifier
is no longer valid, and could be reused. That the operating system tries
to prevent problems by avoiding recycling isn't necessarily a good
reason to exploit this capability.

Yeah, but it's very useful as a user. Consider this scenerio:

<process 1234 goes AWOL>
kill -INT 1234
<check if process still there>
kill -TERM 1234
<process still there, damn it!>
kill -9 1234

There gone. With no quick PID reuse I can be sure I won't kill the
wrong one. This is presumably why recent versions of windows don't reuse
pids quickly either... It for *users* not programs.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#17Noname
mark@mark.mielke.cc
In reply to: Martijn van Oosterhout (#16)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 12:20:49PM +0200, Martijn van Oosterhout wrote:

On Wed, Apr 05, 2006 at 06:03:31AM -0400, mark@mark.mielke.cc wrote:

It's no different from a file descriptor on UNIX.
Neither UNIX nor Windows promise that a process identifier is valid
beyond the life of the process. UNIX avoids it from happening, as it
is necessary to avoid races with system calls such as kill(). Windows
does not have this problem.

But consider, why is the process id there? (Amongst other reasons) so
that users can monitor pg_stat_activity and kill a backend that's out
of control. The equivalent to this in windows would be to.
1. Get HANDLE from the process ID.
2. TerminateProcess with that HANDLE.

You missed 1.5. Ensure that HANDLE matches the process you intend. :-)

This is a little bit of a distraction though, as any system that requires
the user to be able to kill broken backends, is only indicative of a
broken backend. We're talking about how to deal with a broken process,
after the process owner (PostgreSQL) has forgotten about it.

What would be wrong with having a PostgreSQL generated serial associated
with each backend, that can be used by the backend process owner, to
map to a HANDLE, which uses TerminateProcess() underneath?

Presumably users have a little GUI app that displays processes on the
system with the process ID so they can kill it. If a process ID is
quickly reused, they may end up killing the wrong one.

Sure. But the problem here, is that PostgreSQL is broken, so they find a
need to go to their process listing GUI. And who is to say that the GUI
doesn't do as I've suggested above? Ensure that TerminateProcess() is
against the intended process? I have no idea if it does - but it could.

The race condition in this case involves the user and you can't solve
that programmatically. The non-reuse of pids is more for
user-friendlyness than anything else. The Window use of HANDLE doesn't
solve this problem at all.

Sure you can. But it also shouldn't matter.

Sure. Serial. Allocate on process start.
Or, back to another topic from months ago - UUID generation... :-)

Neither of which solve the "I'm a user and want to kill *that* backend"
problem. Because even on windows you'll have to get the process id,
convert it to a handle an kill it. Same race condition.

No, and not if PostgreSQL kills the backend for you cleanly, using the
HANDLE, that it owns for the process.

Process identifier should not be used beyond the life of the process.
As soon as wait() removes the process on UNIX, the process identifier
is no longer valid, and could be reused. That the operating system tries
to prevent problems by avoiding recycling isn't necessarily a good
reason to exploit this capability.

Yeah, but it's very useful as a user. Consider this scenerio:
<process 1234 goes AWOL>
kill -INT 1234
<check if process still there>
kill -TERM 1234
<process still there, damn it!>
kill -9 1234
There gone. With no quick PID reuse I can be sure I won't kill the
wrong one. This is presumably why recent versions of windows don't reuse
pids quickly either... It for *users* not programs.

Users shouldn't need to kill programs.

If they do - on the off chance that they do, they should be more
careful than you list above.

That's a kneejerk reaction to a problem that shouldn't occur in the
first place. It's a habit trained by UNIX users.

I rarely ever need to kill a process on my Windows box, and when I have,
the process listing in the GUI hasn't offered me a process ID. I click
on the item I wish to kill, and I right click "End Process". A lot more
user friendly, in my opinion. And if they happen to fix the race in the
background using the method I suggest above? All the better...

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#18Magnus Hagander
mha@sollentuna.net
In reply to: Noname (#17)
Re: Stats collection on Windows

It's no different from a file descriptor on UNIX.

Neither UNIX nor Windows promise that a process identifier is valid
beyond the life of the process. UNIX avoids it from

happening, as it

is necessary to avoid races with system calls such as

kill(). Windows

does not have this problem.

But consider, why is the process id there? (Amongst other
reasons) so that users can monitor pg_stat_activity and kill
a backend that's out of control. The equivalent to this in
windows would be to.

1. Get HANDLE from the process ID.
2. TerminateProcess with that HANDLE.

Presumably users have a little GUI app that displays
processes on the system with the process ID so they can kill
it. If a process ID is quickly reused, they may end up
killing the wrong one.

Actualy, not quite. That's the equivalent of kill -9. What you'd do is
"pg_ctl kill TERM <pid>". Or whatever other signal you want to send. But
you still need the pid, there is no way to get the required information
from a plain handle when you're not in that process.

Though I'd say I'm usually more interested in noticing what process it
is, rather than killing it. Possibly canceling it, but very seldomly
actually kill it. But it still needs the process id, yes.

(BTW, the GUI would generaelly be Task Manager, so it's not like it's a
special tool)

<snip rest>

//Magnus

#19Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#18)
Re: Stats collection on Windows

Yes, they provide the "kill" primitive, but only as a

compatibility

measure.

They do? Where is it? Certainly not in the documented SDK

that I can

see.

I believe it is called KillProcess().

No such function documented on my MSDN, other than one in SQL Server
DMO.
Are you referring to TerminateProcess()? If so, it can just terminate
the process, not send it arbitrary signals. Considering windows doesnt'
reallyi have the concept of signals, that's not very surprising though
:-)

//Magnus

#20Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#19)
Re: Stats collection on Windows

This is a little bit of a distraction though, as any system
that requires the user to be able to kill broken backends, is
only indicative of a broken backend. We're talking about how
to deal with a broken process, after the process owner
(PostgreSQL) has forgotten about it.

Don't confuse the postmaster with the statistics system. the postmaster
knows about the backends per handle, but the stats system runs outside
of the postmaster, and can possibly get confused.

What would be wrong with having a PostgreSQL generated serial
associated with each backend, that can be used by the backend
process owner, to map to a HANDLE, which uses
TerminateProcess() underneath?

It might be a good idea to use the combination pid+internalid. Note taht
we *don't* just want to TerminateProcess(), we need the pid to send
actual signals with. And in this case, that's not even the problem - the
problem is that we're not dropping the process from our list (I think).

Presumably users have a little GUI app that displays

processes on the

system with the process ID so they can kill it. If a process ID is
quickly reused, they may end up killing the wrong one.

Sure. But the problem here, is that PostgreSQL is broken, so
they find a need to go to their process listing GUI. And who
is to say that the GUI doesn't do as I've suggested above?
Ensure that TerminateProcess() is against the intended
process? I have no idea if it does - but it could.

Because the GUI is most likely either Task Manager or Process Explorer.
There is no need for a special GUI.

Process identifier should not be used beyond the life of

the process.

As soon as wait() removes the process on UNIX, the process
identifier is no longer valid, and could be reused. That the
operating system tries to prevent problems by avoiding recycling
isn't necessarily a good reason to exploit this capability.

Yeah, but it's very useful as a user. Consider this scenerio:
<process 1234 goes AWOL>
kill -INT 1234
<check if process still there>
kill -TERM 1234
<process still there, damn it!>
kill -9 1234
There gone. With no quick PID reuse I can be sure I won't kill the
wrong one. This is presumably why recent versions of windows don't
reuse pids quickly either... It for *users* not programs.

Users shouldn't need to kill programs.

And normally they don't. But again, this isn't where the problem is. If
your process is stuck and you need to kill it, it most likely will still
be around once you get to killing it. If not, you were trying to kill it
too soon.

I rarely ever need to kill a process on my Windows box, and
when I have, the process listing in the GUI hasn't offered me
a process ID. I click on the item I wish to kill, and I right
click "End Process". A lot more user friendly, in my opinion.
And if they happen to fix the race in the background using
the method I suggest above? All the better...

AFAIK, Task Manager uses the pid to kill the task. Try tracing it. (And
it will happily show you the pid as well - just enable it in the
checkbox. It's the only reasonable way to differ between two processes
off the same exe, for example two different notepad:s...)

//Magnus

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#12)
Re: Stats collection on Windows

"Magnus Hagander" <mha@sollentuna.net> writes:

HANDLE is process local? That is worse then, because then
there's no guarentee that each process will see a different
identifier.

HANDLE is process local. What you need to do is run DuplicateHandle() on
it specifying it should "also be valid for process Y" (for which you
need a HANDLE opened, in this case the stats collector). This will give
you a new handle whichi s valid in the *target process*, but it is *not*
valid in your own process.

What happens if process Y goes away between the time you obtain a
handle for it and the time you try to run this DuplicateHandle call?

regards, tom lane

#22Magnus Hagander
mha@sollentuna.net
In reply to: Tom Lane (#21)
Re: Stats collection on Windows

HANDLE is process local? That is worse then, because then

there's no

guarentee that each process will see a different identifier.

HANDLE is process local. What you need to do is run

DuplicateHandle()

on it specifying it should "also be valid for process Y" (for which
you need a HANDLE opened, in this case the stats

collector). This will

give you a new handle whichi s valid in the *target

process*, but it

is *not* valid in your own process.

What happens if process Y goes away between the time you
obtain a handle for it and the time you try to run this
DuplicateHandle call?

I don't know offhand. I would assume one of two things:
1) DuplicateHandle() fails.
2) You get a handle back that is valid in the dead process (meaning it's
not valid).

I can put together some quick test-code for this if you need me to?

//Magnus

#23stephen joseph butler
stephen.butler@gmail.com
In reply to: Tom Lane (#21)
Re: Stats collection on Windows

2006/4/5, Tom Lane <tgl@sss.pgh.pa.us>:

What happens if process Y goes away between the time you obtain a
handle for it and the time you try to run this DuplicateHandle call?

The structures for a process should remain as long as there are any open
HANDLEs, even if the process has ended. Threads behave the same way, which
is why a creator needs to remember to always close the HANDLE returned from
CreateThread.

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#22)
Re: Stats collection on Windows

"Magnus Hagander" <mha@sollentuna.net> writes:

What happens if process Y goes away between the time you
obtain a handle for it and the time you try to run this
DuplicateHandle call?

I can put together some quick test-code for this if you need me to?

Nah, it was just a rhetorical question meant to poke a hole in the
claim that Windows can avoid race conditions by using HANDLEs.

AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has
to solve by ensuring it doesn't reuse HANDLEs too quick.

regards, tom lane

#25stephen joseph butler
stephen.butler@gmail.com
In reply to: Tom Lane (#24)
Re: Stats collection on Windows

2006/4/5, Tom Lane <tgl@sss.pgh.pa.us>:

AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has
to solve by ensuring it doesn't reuse HANDLEs too quick.

There's a disconnect here. handles aren't process identifiers: they're
reference counted "pointers" to the kernel structures for the process. If
you are holding a handle (ie: from CreateProcess or OpenProcess) that handle
cannot and will not be reclaimed until you call CloseHandle (or your process
itself exits). You should never retain a handle after you've called
CloseHandle on it.

Which brings an interesting thought: are process ID's reclaimed while open
handles remain? I'm willing to bet the answer is no. In that case, the stats
collector could retain the handle until it's done with the process ID.

#26Magnus Hagander
mha@sollentuna.net
In reply to: stephen joseph butler (#25)
Re: Stats collection on Windows

What happens if process Y goes away between the time you obtain a
handle for it and the time you try to run this

DuplicateHandle call?

I can put together some quick test-code for this if you need me to?

Nah, it was just a rhetorical question meant to poke a hole
in the claim that Windows can avoid race conditions by using HANDLEs.

That's what I thought, which is why I asked first.

AFAICS, don't-reuse-PIDs-too-quick has exact analogs that
Windows has to solve by ensuring it doesn't reuse HANDLEs too quick.

Windows doesn't "reuse HANDLEs" in that way. Handles are like file
descriptors.
The kernel structure represnting the process has a *process id*, not a
handle. That is what uniquely identifies the process. (In the kernel
it's often referred to as a "client id").

The process structure will be held as long as anybody has an open handle
to the process. Thus, as long as this happens, the pid *cannot* be
reused. So one way to assure the pid isn't recycled too soon is to keep
the postmasters handle open "long enough" (the postmaster already has a
handle there - it's just the pgstats process that doesn't have one).
RIght now we close the postmaster handle first (win32_removechild at ~
2086 in postmaster.c) and then fire off the message (CleanupBackend at
~2239).

//Magnus

#27Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#26)
Re: Stats collection on Windows

There's a disconnect here. handles aren't process
identifiers: they're reference counted "pointers" to the
kernel structures for the process. If you are holding a
handle (ie: from CreateProcess or OpenProcess) that handle
cannot and will not be reclaimed until you call CloseHandle
(or your process itself exits). You should never retain a
handle after you've called CloseHandle on it.

Which brings an interesting thought: are process ID's
reclaimed while open handles remain? I'm willing to bet the
answer is no. In that case, the stats collector could retain
the handle until it's done with the process ID.

Since the process id lives in the kernel structure they point to, they
can't very well be recycled... (classical
handle-leak-kills-windows-system-by-running-out-of-kernel-space
scenario).

The problem is, the stats collector doesn't have the handle in the first
place. I guess it could open one upon receiving the bestart message, but
it seems like a kludge. Would be nicer if it could be done cleaner :-)

//Magnus

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Brant (#1)
Re: Stats collection on Windows

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

I added some strategic printfs to pgstat.c. Attached is the output when
a little program is run which, in a loop, makes 10 connections, sleeps 3
seconds, closes them, sleeps another 3 seconds. My workstation (Windows
XP) was otherwise idle.

Search for "is known to be dead, ignoring" to find the re-used process
IDs. Things start out clean, but after a few cycles anywhere between 1
and 5 backends are being missed.

Looking at the pgstats code, I notice that once it makes an entry in the
dead-backends hashtable, it keeps that entry (rejecting any messages
with the same PID) for 10 seconds. That seems like approximately
forever on modern machines, certainly much more than any plausible
out-of-order condition in the UDP packet stream. It could easily be
enough to get us in trouble on Unix machines, never mind Windows.

A conservative suggestion would be to trim down the destroy interval.
A more radical one is to question whether we need the destroy delay
mechanism at all. What if we got rid of all that logic and simply let
the collector delete stuff when it's told to? Out-of-order messages
could cause entries to be re-created after they've been deleted, but
I'm not sure that I see any harm in that. Bogus DB and table entries
are already ignored in the pgstats views (because they won't join to
anything in the system catalogs) and we also have a filter for bogus
backend entries. There are also mechanisms that ensure these entries
will go away eventually: pgstat_vacuum_tabstat for DB and table
entries, and eventual re-use of a BackendId slot for backends.
So I'm sort of thinking that the destroy delay has outlived its
usefulness.

regards, tom lane

#29Magnus Hagander
mha@sollentuna.net
In reply to: Tom Lane (#28)
Re: Stats collection on Windows

Search for "is known to be dead, ignoring" to find the

re-used process

IDs. Things start out clean, but after a few cycles

anywhere between

1 and 5 backends are being missed.

Looking at the pgstats code, I notice that once it makes an
entry in the dead-backends hashtable, it keeps that entry
(rejecting any messages with the same PID) for 10 seconds.
That seems like approximately forever on modern machines,
certainly much more than any plausible out-of-order condition
in the UDP packet stream. It could easily be enough to get
us in trouble on Unix machines, never mind Windows.

A conservative suggestion would be to trim down the destroy interval.
A more radical one is to question whether we need the destroy
delay mechanism at all. What if we got rid of all that logic
and simply let the collector delete stuff when it's told to?
Out-of-order messages could cause entries to be re-created
after they've been deleted, but I'm not sure that I see any
harm in that. Bogus DB and table entries are already ignored
in the pgstats views (because they won't join to anything in
the system catalogs) and we also have a filter for bogus
backend entries. There are also mechanisms that ensure these
entries will go away eventually: pgstat_vacuum_tabstat for DB
and table entries, and eventual re-use of a BackendId slot
for backends.
So I'm sort of thinking that the destroy delay has outlived
its usefulness.

After reviewing my own patch I just sent in (stats write delay), I
realised I had just upped that from 10 seconds to 5 minutes. Which is
then even longer than forever :)
Which prompted me to consider suggesting just this - do we really need
the destroy functionality. From what I could tell it did look reasonable
to get rid of it for these reasons.

//Magnus

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#29)
Re: Stats collection on Windows

"Magnus Hagander" <mha@sollentuna.net> writes:

After reviewing my own patch I just sent in (stats write delay), I
realised I had just upped that from 10 seconds to 5 minutes. Which is
then even longer than forever :)
Which prompted me to consider suggesting just this - do we really need
the destroy functionality.

Which was in fact exactly my train of thought ...

regards, tom lane

#31Noname
mark@mark.mielke.cc
In reply to: Tom Lane (#21)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 10:05:56AM -0400, Tom Lane wrote:

What happens if process Y goes away between the time you obtain a
handle for it and the time you try to run this DuplicateHandle call?

Think of Windows HANDLE like UNIX fd, but Windows HANDLE works for
everything - not just sockets, files, pipes, and character devices.

Process Y doesn't go away until all references to it, via HANDLE, have
been closed. It may not be running. It may have an exit status available.
It doesn't go away, though, until you are done with it, and everybody
who has a reference to it does CloseHandle().

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#32Noname
mark@mark.mielke.cc
In reply to: Tom Lane (#24)
Re: Stats collection on Windows

On Wed, Apr 05, 2006 at 10:30:36AM -0400, Tom Lane wrote:

"Magnus Hagander" <mha@sollentuna.net> writes:

What happens if process Y goes away between the time you
obtain a handle for it and the time you try to run this
DuplicateHandle call?

I can put together some quick test-code for this if you need me to?

Nah, it was just a rhetorical question meant to poke a hole in the
claim that Windows can avoid race conditions by using HANDLEs.
AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has
to solve by ensuring it doesn't reuse HANDLEs too quick.

No. It means you don't understand what a HANDLE is.

But that's fine - because you understand DB stuff to compensate... :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#33Peter Brant
Peter.Brant@wicourts.gov
In reply to: Tom Lane (#28)
Re: Stats collection on Windows

I'm going to rip out the destroy code and see how it goes. Patch will
be forthcoming if things turn out well.

Pete

Tom Lane <tgl@sss.pgh.pa.us> 04/05/06 4:49 pm >>>

So I'm sort of thinking that the destroy delay has outlived its
usefulness.

regards, tom lane