fork/exec patch
This patch is the next step towards (re)allowing fork/exec.
Bruce, I've cleaned up the parts we discussed, and, pending objections from
anyone else, it is ready for application to HEAD.
Cheers,
Claudio
---
Certain disclaimers and policies apply to all email sent from Memetrics.
For the full text of these disclaimers and policies see
<a
href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em
ailpolicy.html</a>
Attachments:
diff2c.outapplication/octet-stream; name=diff2c.outDownload+523-262
Let me provide a summary of this patch because I reviewed the first
version.
The patch basically is a slight rearrangement of the code to allow
fork/exec on Unix, with the ultimate goal of doing CreateProcess on
Win32. The changes are:
o Write out postmaster global variables and per-backend
variables to be read by the exec'ed backend
o Mark some static variables as global when exec is used so
then can be dumped from postmaster.c, marked NON_EXEC_STATIC
o Remove value passing with -p now that we have per-backend
file
o Move some pointer storage out of shared memory for easier
dumping.
o Modified pgsql_temp directory cleanup to handle per-database
directories and the backend exec directory under datadir.
Let me add that Claudio is doing a fantastic job on this. The changes
are minimal and clean. I think the writing of a per-backend temp file
has allowed this patch to be smaller than it might have been.
---------------------------------------------------------------------------
Claudio Natoli wrote:
This patch is the next step towards (re)allowing fork/exec.
Bruce, I've cleaned up the parts we discussed, and, pending objections from
anyone else, it is ready for application to HEAD.Cheers,
Claudio--- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
[ Attachment, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Let me add that Claudio is doing a fantastic job on this. The
changes are minimal and clean. I think the writing of a per-backend
temp file has allowed this patch to be smaller than it might have
been.
Did we REALLY conclude that the best way to work around the lack of
fork() on Win32 is by writing variables out to disk and reading them
back in? Frankly, that seems like an enormous kludge.
For example, couldn't we write this data into a particular location in
shared memory, and then pass that location to the child? That is still
ugly, slow, and prone to failure (shmem being statically sized), but
ISTM that the proposed implementation already possesses those
attributes :-)
(/me goes off to re-read the archives on this issue...)
-Neil
Neil Conway wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Let me add that Claudio is doing a fantastic job on this. The
changes are minimal and clean. I think the writing of a per-backend
temp file has allowed this patch to be smaller than it might have
been.Did we REALLY conclude that the best way to work around the lack of
fork() on Win32 is by writing variables out to disk and reading them
back in? Frankly, that seems like an enormous kludge.For example, couldn't we write this data into a particular location in
shared memory, and then pass that location to the child? That is still
ugly, slow, and prone to failure (shmem being statically sized), but
ISTM that the proposed implementation already possesses those
attributes :-)
I don't think we ever discussed it, but it seemed logical and a minimal
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that. Of course, this one is
per-backend.
Yea, we could use shared memory for this too, but I don't see a problem
with using the file system. Older releases of PostgreSQL read from
postgresql.conf or pg_hba.conf or other files for every connection so I
don't see that using the file system is going to be that slow. Of
course, we removed those file reads because they were slow, but they
were also much larger and more complex in requiring parsing and stuff,
while this is just a list of binary values. We also have a GUC dump
file that is read by exec too.
The downside of shared memory is that you would need the postmaster to
write into shared memory and you have to allocate enough shared memory
for all backends, but clearly it could be done. You could just pass the
backend slot number to the child and the child could read from the
offset. Not sure how to cleanly do the GUC dump file because it is of
variable length depending on the number of GUC variables changed.
I guess the big question is whether it is worth doing in shared memory.
We also would have to pass the shared memory address to the child, and
make sure the child knew the proper offset in shared memory to find its
values.
Of course, stuff that is variable length would be a problem in shared
memory, and we have some of those for the GUC defaults.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't think we ever discussed it, but it seemed logical and a minimal
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that.
For the record, I think that is ugly as well :-)
Anyway, I'm not necessarily arguing that using shmem is the right way
to go here -- that was merely an off-the-cuff suggestion. I'm just
saying that whatever solution we end up with, ISTM we can do better
than writing out + reading in a file for /every/ new connection.
-Neil
Neil Conway wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't think we ever discussed it, but it seemed logical and a minimal
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that.For the record, I think that is ugly as well :-)
Anyway, I'm not necessarily arguing that using shmem is the right way
to go here -- that was merely an off-the-cuff suggestion. I'm just
saying that whatever solution we end up with, ISTM we can do better
than writing out + reading in a file for /every/ new connection.
[ Moved to hackers and win32. Discussion is writing postmaster-constant
and per-backend variables to a file for exec'ed backends to read.]
Sure --- I am all ears. I am looking for suggestions. I couldn't think
of anything better. I did ask a month ago for ideas on how to do this,
but got no reply.
One idea I had was to write the postmaster-constant values into one
file, and the per-backend values into another so you would write less
data for every backend, but then every backend has to read two files.
Is that a win?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgman wrote:
Neil Conway wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't think we ever discussed it, but it seemed logical and a minimal
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that.For the record, I think that is ugly as well :-)
Anyway, I'm not necessarily arguing that using shmem is the right way
to go here -- that was merely an off-the-cuff suggestion. I'm just
saying that whatever solution we end up with, ISTM we can do better
than writing out + reading in a file for /every/ new connection.[ Moved to hackers and win32. Discussion is writing postmaster-constant
and per-backend variables to a file for exec'ed backends to read.]Sure --- I am all ears. I am looking for suggestions. I couldn't think
of anything better. I did ask a month ago for ideas on how to do this,
but got no reply.One idea I had was to write the postmaster-constant values into one
file, and the per-backend values into another so you would write less
data for every backend, but then every backend has to read two files.
Is that a win?
OK, new idea! Let's write the postmaster-constant values to a file, and
pass the per-backend variables on the command line using -p like we have
in the current code in CVS. However, those values include the cancel
key, which has to be a secret. Maybe we need a per-backend array in
shared memory just for those keys. The postmaster has to keep those
keys anyway, so moving into shared memory might be the right solution.
Right now the cancel key is stored in the Backend struct along with the
backend pid but that is only in the postmaster.
Does this make sense? It does remove the pgsql_temp directory we were
going to need in the datadir.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Import Notes
Reply to msg id not found: | Resolved by subject fallback
On Sun, 14 Dec 2003, Bruce Momjian wrote:
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that. Of course, this one is
per-backend.Yea, we could use shared memory for this too, but I don't see a problem
with using the file system.
Why not use an anonymous pipe to send data from the parent to the child
process? That is a common way to handle this problem in win32 (and in unix
by the way). The parent sets up the pipe and the child process inherits
the handle, and after that the child and parent can excange information in
private.
--
/Dennis
Dennis Bjorklund wrote:
On Sun, 14 Dec 2003, Bruce Momjian wrote:
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that. Of course, this one is
per-backend.Yea, we could use shared memory for this too, but I don't see a problem
with using the file system.Why not use an anonymous pipe to send data from the parent to the child
process? That is a common way to handle this problem in win32 (and in unix
by the way). The parent sets up the pipe and the child process inherits
the handle, and after that the child and parent can excange information in
private.
Doesn't that require the postmaster to stay around to feed that
information into the pipe or can the postmaster just shove the data and
continue on, and how do the old pipes get cleaned up? Seems messy.
Also has to work on Unix too for testing.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Sun, 14 Dec 2003, Bruce Momjian wrote:
Why not use an anonymous pipe to send data from the parent to the child
process?Doesn't that require the postmaster to stay around to feed that
information into the pipe or can the postmaster just shove the data and
continue on, and how do the old pipes get cleaned up?
I think that one can just output the data and close that end of the pipe.
But i've not looked at win32 the last 5 years or so, I could be wrong.
Seems messy.
Maybe, but to me the solution where you write to files are much more ugly.
If one does not like pipes, there are other ipc mechanisms that does not
involve creating, reading and deleting a file on each connect.
Does windows have a temp filesystem where the temp files are not actually
written out on disk? It's still ugly but better then hitting a disk all
the time.
Also has to work on Unix too for testing.
Everything can not work in unix, CreateProcess() and fork() are different.
However, the pipe solution can be mimiced in unix, but it will not be the
same code since the api's are different. So that does not give much.
--
/Dennis
Hi all,
Dennis Bjorklund wrote:
Also has to work on Unix too for testing.
Everything can not work in unix, CreateProcess() and fork()
are different.
True (but CreateProcess and "fork followed by exec" are pretty close). I
think what Bruce is implying is that, ideally, we'd like to keep things as
close as possible between Unix fork/exec and Windows code bases, so that:
* it remains possible to advance the Windows port under a *nix dev
environment and
* should (when!) issues arise in the Windows implementation, it will
be easier to identify and debug
Neil Conway wrote:
For example, couldn't we write this data into a particular location in
shared memory, and then pass that location to the child? That is still
ugly, slow, and prone to failure (shmem being statically sized), but
ISTM that the proposed implementation already possesses those
attributes :-)
I agree that this is a better implementation.
Bruce, do we implement this now, or just hold it as something to revisit
down the track? I'm all for leaving it as is.
Moreover, in general, how do we handle things like this? IMHO, I'd rather
live with a few kludges (that don't impact the *nix code) until the Windows
port is actually a reality, and then reiterate (having the discussions as we
go along, however, is necessary). Perhaps adding a TO_REVISIT section to
your Win32 Status Report page?
Or do people have strong leanings towards "fix as you go along"? Just feels
like that way could see us getting bogged down making things "perfect"
instead of advancing the port...
Comments?
Cheers,
Claudio
---
Certain disclaimers and policies apply to all email sent from Memetrics.
For the full text of these disclaimers and policies see
<a
href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em
ailpolicy.html</a>
Import Notes
Resolved by subject fallback
On Mon, 15 Dec 2003, Claudio Natoli wrote:
Moreover, in general, how do we handle things like this? IMHO, I'd rather
live with a few kludges (that don't impact the *nix code) until the Windows
port is actually a reality
As long as it does not hurt the unix code it's not a big problem as I see
it. The usual open source solution is that since no one else writes the
code, you can do it the way you think works the best. To change this in
the future does not mean that everything else has to be rewritten which is
good.
It does also not mean that one can not discuss the implementation. A fair
amount of discussion is always good.
--
/Dennis
[moved to hackers / win32]
Claudio Natoli wrote:
Or do people have strong leanings towards "fix as you go along"? Just feels
like that way could see us getting bogged down making things "perfect"
instead of advancing the port...
w.r.t. Win32, I think the way to proceed is (in this order):
. make it work
. make it elegant
. make it fast
BTW, I agree with Bruce, you're doing excellent stuff. Now for the fun
part (signals).
cheers
andrew
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't think we ever discussed it, but it seemed logical and a minimal
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that.
You can hardly claim that "no one had issues with that". I complained
about it and I think other people did too. It's a messy, ugly approach;
moreover we have no field experience that says it's reliable.
It may be the least messy, ugly approach available, but I concur with
Neil's wish to at least look for other answers.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
... Maybe we need a per-backend array in
shared memory just for those keys. The postmaster has to keep those
keys anyway, so moving into shared memory might be the right solution.
The postmaster's dependence on the contents of shared memory should
ideally be zero (and it is zero, or nearly so, at the moment).
Otherwise a backend crash that clobbers shared memory poses the risk of
taking down the postmaster as well. We can't go in that direction.
regards, tom lane
On Sun, Dec 14, 2003 at 06:53:22PM -0500, Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't think we ever discussed it, but it seemed logical and a minimal
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that.You can hardly claim that "no one had issues with that". I complained
about it and I think other people did too. It's a messy, ugly approach;
moreover we have no field experience that says it's reliable.
Don't the FSM and the system catalog cache use a similar mechanism?
--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Lim�tate a mirar... y algun d�a veras"
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
On Sun, Dec 14, 2003 at 06:53:22PM -0500, Tom Lane wrote:
You can hardly claim that "no one had issues with that".
Don't the FSM and the system catalog cache use a similar mechanism?
FSM uses a backing file to hold information over a database shutdown
(write once during shutdown, read once during startup). That's a little
different from once per backend fork. Also, there are no race
conditions to worry about, and finally the system does not *require* the
backing file to be either present or correct.
The catalog cache uses a file that typically gets updated once per
VACUUM. Again, the file does not have to be present, nor correct.
There are mechanisms in place to deal with the cases (including race
conditions) where it's broken or obsolete.
I have not looked at the proposed fork/exec code in any detail, but
IIUC it will be *necessary* that the intermediate file be present, and
correct. I think a minimum requirement for accepting this solution is a
sketch of how race conditions will be dealt with (ie, postmaster changes
its own value of some variable immediately after making the temp file).
I don't necessarily say that the first-cut patch has to get it right,
but we'd better understand how we will get to where it is right.
regards, tom lane
Tom Lane wrote:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
On Sun, Dec 14, 2003 at 06:53:22PM -0500, Tom Lane wrote:
You can hardly claim that "no one had issues with that".
Don't the FSM and the system catalog cache use a similar mechanism?
FSM uses a backing file to hold information over a database shutdown
(write once during shutdown, read once during startup). That's a little
different from once per backend fork. Also, there are no race
conditions to worry about, and finally the system does not *require* the
backing file to be either present or correct.The catalog cache uses a file that typically gets updated once per
VACUUM. Again, the file does not have to be present, nor correct.
There are mechanisms in place to deal with the cases (including race
conditions) where it's broken or obsolete.I have not looked at the proposed fork/exec code in any detail, but
IIUC it will be *necessary* that the intermediate file be present, and
correct. I think a minimum requirement for accepting this solution is a
sketch of how race conditions will be dealt with (ie, postmaster changes
its own value of some variable immediately after making the temp file).
I don't necessarily say that the first-cut patch has to get it right,
but we'd better understand how we will get to where it is right.
Agreed, added to the Win32 status page:
* remove per-backend parameter file and move into shared memory
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Claudio Natoli wrote:
For example, couldn't we write this data into a particular location in
shared memory, and then pass that location to the child? That is still
ugly, slow, and prone to failure (shmem being statically sized), but
ISTM that the proposed implementation already possesses those
attributes :-)I agree that this is a better implementation.
Bruce, do we implement this now, or just hold it as something to revisit
down the track? I'm all for leaving it as is.Moreover, in general, how do we handle things like this? IMHO, I'd rather
live with a few kludges (that don't impact the *nix code) until the Windows
port is actually a reality, and then reiterate (having the discussions as we
go along, however, is necessary). Perhaps adding a TO_REVISIT section to
your Win32 Status Report page?Or do people have strong leanings towards "fix as you go along"? Just feels
like that way could see us getting bogged down making things "perfect"
instead of advancing the port...
Let's get it working first. I have added an item to the Win32 status
page so we will not forget it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't think we ever discussed it, but it seemed logical and a minimal
change to the code. We already have a GUC write of non-default values
for exec and no one had issues with that.You can hardly claim that "no one had issues with that". I complained
about it and I think other people did too. It's a messy, ugly approach;
moreover we have no field experience that says it's reliable.It may be the least messy, ugly approach available, but I concur with
Neil's wish to at least look for other answers.
Absolutely. I am not happy with the GUC file either, but can't see a
better way right now. I have already documented your concern about the
GUC race condition issue on the Win32 status page so we will not forget
about it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073