What is happening on buildfarm member baiji?

Started by Tom Laneover 18 years ago57 messages
#1Tom Lane
tgl@sss.pgh.pa.us

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

regards, tom lane

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

I no longer have an MSVC box, so I can't tell so easily ;-(

cheers

andrew

#3Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#2)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

Uh, but that piece of code you're referring to is from the bulidfarm
code, right? Isn't it the same?

I no longer have an MSVC box, so I can't tell so easily ;-(

Non-Vista MSVC boxes seem to pass fine (mastodon and skylark, for
example - skylark fails on something completely different, not fully
investigated yet, but looks to be a buildfarm problem rather than a
backend one), so I don't think it's the MSVC procedure alone that's the
cause of it.

//Magnus

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#3)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

Andrew Dunstan wrote:

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

Uh, but that piece of code you're referring to is from the bulidfarm
code, right? Isn't it the same?

Yes, but it might be that the MSVC install doesn't actually use that
location properly. Unfortunately, its logging is less than verbose, unlike
the standard install procedure.

I no longer have an MSVC box, so I can't tell so easily ;-(

Non-Vista MSVC boxes seem to pass fine (mastodon and skylark, for
example - skylark fails on something completely different, not fully
investigated yet, but looks to be a buildfarm problem rather than a
backend one), so I don't think it's the MSVC procedure alone that's the
cause of it.

Possibly. My point was that I can't even investigate how MSVC is working
at all.

cheers

andrew

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

My point was that I can't even investigate how MSVC is working
at all.

So what is it you're looking for, specifically, to help with that?

As a very bare minimum, we need to change the installation procedure to
log its destination.

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

cheers

andrew

#6Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#4)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Magnus Hagander wrote:

Andrew Dunstan wrote:

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

Uh, but that piece of code you're referring to is from the bulidfarm
code, right? Isn't it the same?

Yes, but it might be that the MSVC install doesn't actually use that
location properly. Unfortunately, its logging is less than verbose, unlike
the standard install procedure.

I no longer have an MSVC box, so I can't tell so easily ;-(

Non-Vista MSVC boxes seem to pass fine (mastodon and skylark, for
example - skylark fails on something completely different, not fully
investigated yet, but looks to be a buildfarm problem rather than a
backend one), so I don't think it's the MSVC procedure alone that's the
cause of it.

Possibly. My point was that I can't even investigate how MSVC is working
at all.

So what is it you're looking for, specifically, to help with that?

//Magnus

#7Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#5)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Magnus Hagander wrote:

My point was that I can't even investigate how MSVC is working
at all.

So what is it you're looking for, specifically, to help with that?

As a very bare minimum, we need to change the installation procedure to
log its destination.

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

Just to be clear, are you looking for something as simple as this?

Index: Install.pm
===================================================================
RCS file: /cvsroot/pgsql/src/tools/msvc/Install.pm,v
retrieving revision 1.14
diff -c -r1.14 Install.pm
*** Install.pm  25 Apr 2007 19:00:05 -0000      1.14
--- Install.pm  13 May 2007 15:21:51 -0000
***************
*** 35,41 ****
          $conf = "release";
      }
      die "Could not find debug or release binaries" if ($conf eq "");
!     print "Installing for $conf\n";
      EnsureDirectories($target,
'bin','lib','share','share/timezonesets','share/contrib','doc',
          'doc/contrib', 'symbols');
--- 35,41 ----
          $conf = "release";
      }
      die "Could not find debug or release binaries" if ($conf eq "");
!     print "Installing for $conf in $target\n";

EnsureDirectories($target,
'bin','lib','share','share/timezonesets','share/contrib','doc',
'doc/contrib', 'symbols');

//Magnus

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#7)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

! print "Installing for $conf in $target\n";

Looks like a good place to start, sure.

cheers

andrew

#9Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#8)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Magnus Hagander wrote:

! print "Installing for $conf in $target\n";

Looks like a good place to start, sure.

Ok. Applied.

//Magnus

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: What is happening on buildfarm member baiji?

"Andrew Dunstan" <andrew@dunslane.net> writes:

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

Well, I tried inserting a .bki file from April 30 into a HEAD
installation, and that made it dump core during bootstrap, so that
offhand theory was wrong.

However, when I run the HEAD regression tests against that entire
April 30 installation tree, I can duplicate the baiji regression diffs
almost exactly --- the polymorphism test fails for me where it succeeds
on baiji, which I think indicate that baiji has the patch I applied on
May 1 for SQL function inlining.

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.
Have there been any recent changes in either the buildfarm script or
the MSVC install code that might have changed where the install is
supposed to go?

regards, tom lane

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#10)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

"Andrew Dunstan" <andrew@dunslane.net> writes:

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

Well, I tried inserting a .bki file from April 30 into a HEAD
installation, and that made it dump core during bootstrap, so that
offhand theory was wrong.

However, when I run the HEAD regression tests against that entire
April 30 installation tree, I can duplicate the baiji regression diffs
almost exactly --- the polymorphism test fails for me where it succeeds
on baiji, which I think indicate that baiji has the patch I applied on
May 1 for SQL function inlining.

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.
Have there been any recent changes in either the buildfarm script or
the MSVC install code that might have changed where the install is
supposed to go?

Not to my knowledge, but I have no method of testing what's going on,
and I hate guessing like this - in fact this is what has worried me all
along about supporting MSVC builds - we always said we didn't want to
have to have 2 build environments, but now we have two and we'll be
supporting them forever, even though one of them is not used by 95% of
our developers. I realise that MSVC builds are likely to perform better,
but we have now got a situation where we are likely to have breakage on
a regular basis, ISTM.

(sorry to grumble - it's been a very frustrating 24 hours)

cheers

andrew

#12Dave Page
dpage@postgresql.org
In reply to: Andrew Dunstan (#11)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Not to my knowledge, but I have no method of testing what's going on,
and I hate guessing like this - in fact this is what has worried me all
along about supporting MSVC builds - we always said we didn't want to
have to have 2 build environments, but now we have two and we'll be
supporting them forever, even though one of them is not used by 95% of
our developers. I realise that MSVC builds are likely to perform better,
but we have now got a situation where we are likely to have breakage on
a regular basis, ISTM.

It's not just that they perform better - we also get a debugger that
actually works well (yes, I know newer gdb's apparently do work on
Mingw; but even a fully functional GDB doesn't come close to VC++), but
more importantly it's looking more and more like it'll be our only way
of producing a 64bit build for Windows.

(sorry to grumble - it's been a very frustrating 24 hours)

:-(

Regards, Dave.

#13Dave Page
dpage@postgresql.org
In reply to: Tom Lane (#10)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.

Close. There was an Msys build from the 9th running on port 5432.

So, it seems there are a couple of issues here:

1) There appears to be no way to specify the default port number in the
MSVC build. The buildfarm passes it to configure for regular builds,
which obviously isn't run in VC++ mode, thus leaving the build on 5432.

2) VC++ and Msys builds will both happily start on the same port at the
same time. The first one to start listens on 5432 until it shuts down,
at which point the second server takes over seamlessly! It doesn't
matter which is started first - it's as if Windows is queuing up the
listens on the port.

Confusingly, the similar behaviour is reproducible on XP Pro, except the
connection seems to go to the last server started, instead of the first!

Regards, Dave

#14Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

Close. There was an Msys build from the 9th running on port 5432.

2) VC++ and Msys builds will both happily start on the same
port at the same time. The first one to start listens on 5432
until it shuts down, at which point the second server takes
over seamlessly! It doesn't matter which is started first -
it's as if Windows is queuing up the listens on the port.

Um, we explicitly set SO_REUSEADDR. So the port will happily allow a
second bind.

http://support.microsoft.com/kb/307175 quote:
"If you use SO_REUSADDR to bind multiple servers to the same port at the
same time, only one random listening socket accepts a connection
request."

Andreas

#15Dave Page
dpage@postgresql.org
In reply to: Zeugswetter Andreas ADI SD (#14)
Re: What is happening on buildfarm member baiji?

Zeugswetter Andreas ADI SD wrote:

Close. There was an Msys build from the 9th running on port 5432.

2) VC++ and Msys builds will both happily start on the same
port at the same time. The first one to start listens on 5432
until it shuts down, at which point the second server takes
over seamlessly! It doesn't matter which is started first -
it's as if Windows is queuing up the listens on the port.

Um, we explicitly set SO_REUSEADDR. So the port will happily allow a
second bind.

So we do. I must confess I didn't look at the code, just spoke with
Magnus who agreed it didn't seem like it should be possible.

Regards, Dave

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

Dave Page wrote:

Tom Lane wrote:

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.

Close. There was an Msys build from the 9th running on port 5432.

So, it seems there are a couple of issues here:

1) There appears to be no way to specify the default port number in the
MSVC build. The buildfarm passes it to configure for regular builds,
which obviously isn't run in VC++ mode, thus leaving the build on 5432.

2) VC++ and Msys builds will both happily start on the same port at the
same time. The first one to start listens on 5432 until it shuts down,
at which point the second server takes over seamlessly! It doesn't
matter which is started first - it's as if Windows is queuing up the
listens on the port.

Confusingly, the similar behaviour is reproducible on XP Pro, except the
connection seems to go to the last server started, instead of the first!

I'll look at the port mess.

Are you running 2 buildfarm members on the same machine? If so, you
should look at using the multi-root factility which is explicitly
designed to avoid clashes of this sort.

cheers

andrew

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

Dave Page <dpage@postgresql.org> writes:

2) VC++ and Msys builds will both happily start on the same port at the
same time. The first one to start listens on 5432 until it shuts down,
at which point the second server takes over seamlessly!

Uh ... so the lock-file stuff is completely broken on Windows?

The SO_REUSEADDR flag is intentional --- without that, on many
platforms there would be a significant time delay needed between
stopping a postmaster and starting a new one. But our socket lock
file machinery ought to have detected the conflict.

regards, tom lane

#18Dave Page
dpage@postgresql.org
In reply to: Andrew Dunstan (#16)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

I'll look at the port mess.

Are you running 2 buildfarm members on the same machine? If so, you
should look at using the multi-root factility which is explicitly
designed to avoid clashes of this sort.

Yes, I've got VC++ and Mingw/Msys animals on each of two (virtual)
machines. Each is completely independent of each other - different
configs, different scripts, different ports, different directories etc.

Where can I find out about multi-root? I can't see anything in the
config file, or in PGBuildFarm-HOWTO.txt

Regards, Dave.

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: What is happening on buildfarm member baiji?

I wrote:

Uh ... so the lock-file stuff is completely broken on Windows?

Not so much broken as commented out ... on looking at the code, it's
blindingly obvious that we don't even try to create a socket lock file
if not HAVE_UNIX_SOCKETS. Sigh.

There is a related risk even on Unix machines: two postmasters can be
started on the same port number if they have different settings of
unix_socket_directory, and then it's indeterminate which one you will
contact if you connect to the TCP port. I seem to recall that we
discussed this several years ago, and didn't really find a satisfactory
way of interlocking the TCP port per se.

regards, tom lane

#20Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#19)
Re: What is happening on buildfarm member baiji?

On Mon, May 14, 2007 at 08:50:54AM -0400, Tom Lane wrote:

I wrote:

Uh ... so the lock-file stuff is completely broken on Windows?

Not so much broken as commented out ... on looking at the code, it's
blindingly obvious that we don't even try to create a socket lock file
if not HAVE_UNIX_SOCKETS. Sigh.

There is a related risk even on Unix machines: two postmasters can be
started on the same port number if they have different settings of
unix_socket_directory, and then it's indeterminate which one you will
contact if you connect to the TCP port. I seem to recall that we
discussed this several years ago, and didn't really find a satisfactory
way of interlocking the TCP port per se.

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Worth doing?

//Magnus

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#20)
Re: What is happening on buildfarm member baiji?

Magnus Hagander <magnus@hagander.net> writes:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Does it go away automatically on postmaster crash?

regards, tom lane

#22Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#19)
Re: What is happening on buildfarm member baiji?

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

There is a related risk even on Unix machines: two postmasters can be
started on the same port number if they have different settings of
unix_socket_directory, and then it's indeterminate which one you will
contact if you connect to the TCP port. I seem to recall that we
discussed this several years ago, and didn't really find a satisfactory
way of interlocking the TCP port per se.

I'm curious as to which Unix systems allow multiple processes to listen
on the same port at the same time.. On Linux, and I thought on most,
you get an EADDRINUSE on the listen() call (which the postmaster should
pick up on and bomb out, which it may already).

Thanks,

Stephen

#23Dave Page
dpage@postgresql.org
In reply to: Stephen Frost (#22)
Re: What is happening on buildfarm member baiji?

Stephen Frost wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

There is a related risk even on Unix machines: two postmasters can be
started on the same port number if they have different settings of
unix_socket_directory, and then it's indeterminate which one you will
contact if you connect to the TCP port. I seem to recall that we
discussed this several years ago, and didn't really find a satisfactory
way of interlocking the TCP port per se.

I'm curious as to which Unix systems allow multiple processes to listen
on the same port at the same time.. On Linux, and I thought on most,
you get an EADDRINUSE on the listen() call (which the postmaster should
pick up on and bomb out, which it may already).

Linux certainly does. Windows seems to treat SO_REUSEADDR in the same
way as SO_REUSEPORT which just seems wrong.

Regards, Dave.

#24Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#19)
Re: What is happening on buildfarm member baiji?

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I wrote:

Uh ... so the lock-file stuff is completely broken on Windows?

Not so much broken as commented out ... on looking at the code, it's
blindingly obvious that we don't even try to create a socket lock file
if not HAVE_UNIX_SOCKETS. Sigh.

Isn't the socket lock file only there to protect the socket?

There is a related risk even on Unix machines: two postmasters can be
started on the same port number if they have different settings of
unix_socket_directory, and then it's indeterminate which one you will
contact if you connect to the TCP port. I seem to recall that we
discussed this several years ago, and didn't really find a satisfactory
way of interlocking the TCP port per se.

stark@oxford:~/src/local-concurrent-psql/pgsql/src/bin/psql$ /usr/local/pgsql/bin/postgres -D /var/tmp/db2
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
WARNING: could not create listen socket for "localhost"
FATAL: could not create any TCP/IP sockets

Is it possible the previous discussion related to servers with IPv6 where they
did manage to bind to one but not the other?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#25Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#21)
Re: What is happening on buildfarm member baiji?

On Mon, May 14, 2007 at 09:02:10AM -0400, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Does it go away automatically on postmaster crash?

Yes.

//Magnus

#26Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#25)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

On Mon, May 14, 2007 at 09:02:10AM -0400, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Does it go away automatically on postmaster crash?

Yes.

Then I think it's worth adding, and I'd argue that as a low risk safety
measure we should allow it to sneak into 8.3. I'm assuming the code
involved will be quite small.

cheers

andrew

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dave Page (#23)
Re: What is happening on buildfarm member baiji?

Dave Page <dpage@postgresql.org> writes:

Stephen Frost wrote:

I'm curious as to which Unix systems allow multiple processes to listen
on the same port at the same time.. On Linux, and I thought on most,
you get an EADDRINUSE on the listen() call (which the postmaster should
pick up on and bomb out, which it may already).

Linux certainly does.

Mmm, you're right, I misread the man page:

Setting the SO_REUSEADDR option allows the local socket address to be
reused in subsequent calls to bind(). This permits multiple
SOCK_STREAM sockets to be bound to the same local address, as long as
all existing sockets with the desired local address are in a connected
state before bind() is called for a new socket.

The bit about "connected state" is relevant here --- a listening socket
isn't connected. Time for more caffeine.

Windows seems to treat SO_REUSEADDR in the same
way as SO_REUSEPORT which just seems wrong.

Well, Microsoft getting standards wrong is no surprise. So what do we
want to do about it?

regards, tom lane

#28Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#27)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

Setting the SO_REUSEADDR option allows the local socket address to be
reused in subsequent calls to bind(). This permits multiple
SOCK_STREAM sockets to be bound to the same local address, as long as
all existing sockets with the desired local address are in a connected
state before bind() is called for a new socket.

The bit about "connected state" is relevant here --- a listening socket
isn't connected. Time for more caffeine.

That's what I thought it meant. I am glad to see that I am not quite as
out of date as I thought I must be reading upthread :-)

cheers

andrew

#29Dave Page
dpage@postgresql.org
In reply to: Tom Lane (#27)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

Windows seems to treat SO_REUSEADDR in the same
way as SO_REUSEPORT which just seems wrong.

Well, Microsoft getting standards wrong is no surprise. So what do we
want to do about it?

Microsoft did lift that code from BSD many moons ago, so it might be
worth checking if the bug actually originated there.

Assuming it didn't, then Magnus' idea sounds good to me.

Regards, Dave

#30Alvaro Herrera
alvherre@commandprompt.com
In reply to: Andrew Dunstan (#26)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Magnus Hagander wrote:

On Mon, May 14, 2007 at 09:02:10AM -0400, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Does it go away automatically on postmaster crash?

Yes.

Then I think it's worth adding, and I'd argue that as a low risk safety
measure we should allow it to sneak into 8.3. I'm assuming the code
involved will be quite small.

Do you actually mean 8.2 here?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#26)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan <andrew@dunslane.net> writes:

Magnus Hagander wrote:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Then I think it's worth adding, and I'd argue that as a low risk safety
measure we should allow it to sneak into 8.3. I'm assuming the code
involved will be quite small.

What happens if we just "#ifndef WIN32" the setsockopt(SO_REUSEADDR)
call? I believe the reason that's in there is that some platforms will
reject bind() to a previously-used address for a TCP timeout delay after
a previous postmaster quit, but if that doesn't happen on Windows then
maybe all we need is to not set the option.

regards, tom lane

#32Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#31)
Re: What is happening on buildfarm member baiji?

On Mon, May 14, 2007 at 09:49:47AM -0400, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Magnus Hagander wrote:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Then I think it's worth adding, and I'd argue that as a low risk safety
measure we should allow it to sneak into 8.3. I'm assuming the code
involved will be quite small.

What happens if we just "#ifndef WIN32" the setsockopt(SO_REUSEADDR)
call? I believe the reason that's in there is that some platforms will
reject bind() to a previously-used address for a TCP timeout delay after
a previous postmaster quit, but if that doesn't happen on Windows then
maybe all we need is to not set the option.

I think that at least used to happen on Windows in earlier versions.

//Magnus

#33Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#26)
1 attachment(s)
Re: What is happening on buildfarm member baiji?

On Mon, May 14, 2007 at 09:34:05AM -0400, Andrew Dunstan wrote:

Magnus Hagander wrote:

On Mon, May 14, 2007 at 09:02:10AM -0400, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Does it go away automatically on postmaster crash?

Yes.

Then I think it's worth adding, and I'd argue that as a low risk safety
measure we should allow it to sneak into 8.3. I'm assuming the code
involved will be quite small.

Yes, see attached.

BTW, did you mean 8.2? One typical case where this could happen is in an
upgrade scenario, I think...

//Magnus

Attachments:

win32_interlock.patchtext/plain; charset=us-asciiDownload
Index: src/backend/libpq/pqcomm.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/libpq/pqcomm.c,v
retrieving revision 1.191
diff -c -r1.191 pqcomm.c
*** src/backend/libpq/pqcomm.c	3 Mar 2007 19:32:54 -0000	1.191
--- src/backend/libpq/pqcomm.c	14 May 2007 13:52:05 -0000
***************
*** 261,266 ****
--- 261,291 ----
  		snprintf(portNumberStr, sizeof(portNumberStr), "%d", portNumber);
  		service = portNumberStr;
  	}
+ #ifdef WIN32
+ 	/* Win32 doesn't have Unix sockets, but will allow multiple processes
+ 	 * to listen on the same port. This interlock is to prevent that.
+ 	 */
+ 	{
+ 		char mutexName[64];
+ 		HANDLE mutex;
+ 
+ 		sprintf(mutexName,"postgresql.interlock.%i", portNumber);
+ 		mutex = CreateMutex(NULL, FALSE, mutexName);
+ 		if (mutex == NULL)
+ 			ereport(FATAL,
+ 					(errmsg_internal("could not create interlocking mutex: %li",
+ 					GetLastError())));
+ 
+ 		if (GetLastError() == ERROR_ALREADY_EXISTS)
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_LOCK_FILE_EXISTS),
+ 					 errmsg("interlock mutex \"%s\" already exists", mutexName),
+ 					 errhint("Is another postgres listening on port %i", portNumber)));
+ 
+ 		/* Intentionally leak the handle until process exit, so the mutex
+ 		 * isn't freed. It will be automatically freed when the process exits. */
+ 	}
+ #endif
  
  	ret = pg_getaddrinfo_all(hostName, service, &hint, &addrs);
  	if (ret || !addrs)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#32)
Re: What is happening on buildfarm member baiji?

Magnus Hagander <magnus@hagander.net> writes:

On Mon, May 14, 2007 at 09:49:47AM -0400, Tom Lane wrote:

What happens if we just "#ifndef WIN32" the setsockopt(SO_REUSEADDR)
call? I believe the reason that's in there is that some platforms will
reject bind() to a previously-used address for a TCP timeout delay after
a previous postmaster quit, but if that doesn't happen on Windows then
maybe all we need is to not set the option.

I think that at least used to happen on Windows in earlier versions.

Well, we'd have to check the behavior of the proposed global object on
every supported Windows version too, so we might as well check the
simpler solution while we're at it.

regards, tom lane

#35Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#31)
Re: What is happening on buildfarm member baiji?

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

What happens if we just "#ifndef WIN32" the setsockopt(SO_REUSEADDR)
call? I believe the reason that's in there is that some platforms will
reject bind() to a previously-used address for a TCP timeout delay after
a previous postmaster quit, but if that doesn't happen on Windows then
maybe all we need is to not set the option.

Well it's worth checking. But whereas Windows breaking our understanding of
what SO_REUSEADDR does doesn't actually violate any specification, not having
a TIME_WAIT state at all would certainly violate the TCP spec. So it's
somewhat unlikely that that's what they're doing. But anything's possible.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#35)
Re: What is happening on buildfarm member baiji?

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

What happens if we just "#ifndef WIN32" the setsockopt(SO_REUSEADDR)
call? I believe the reason that's in there is that some platforms will
reject bind() to a previously-used address for a TCP timeout delay after
a previous postmaster quit, but if that doesn't happen on Windows then
maybe all we need is to not set the option.

Well it's worth checking. But whereas Windows breaking our understanding of
what SO_REUSEADDR does doesn't actually violate any specification, not having
a TIME_WAIT state at all would certainly violate the TCP spec. So it's
somewhat unlikely that that's what they're doing. But anything's possible.

This is not a behavior required by the TCP spec AFAICS. Also, in a
quick test neither Linux nor HPUX appear to need SO_REUSEADDR --- on
both, I can restart the postmaster immediately without it.

[ digs in CVS and archives for awhile... ] An interesting historical
point is that the SO_REUSEADDR call did not appear in the original
Berkeley Postgres95 sources. It was added in rev 1.2 of pqcomm.c,
for which the only comment is

Finished merging in src/backend from Dr. George's source tree

so the fact is that that code has undergone approximately 0 specific
peer review. I'm beginning to wonder if we really need it at all.
I thought I recalled us having discussed the need for it once, but I
cannot find any trace of such a discussion.

regards, tom lane

#37Aidan Van Dyk
aidan@highrise.ca
In reply to: Tom Lane (#36)
Re: What is happening on buildfarm member baiji?

* Tom Lane <tgl@sss.pgh.pa.us> [070514 10:24]:

This is not a behavior required by the TCP spec AFAICS. Also, in a
quick test neither Linux nor HPUX appear to need SO_REUSEADDR --- on
both, I can restart the postmaster immediately without it.

Did you have an active connection before restarting?

In HylaFAX, we had the same situation and went to using SO_REUSEADDR:
http://bugs.hylafax.org/show_bug.cgi?id=217

The problem appears if there *was* a connection, and the server was
stopped. Then the server can't bind again until the TIME_WAIT
connection goes away. Using SO_REUSEADDR allows the new server to
listen again right away.

a.

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Aidan Van Dyk (#37)
Re: What is happening on buildfarm member baiji?

Aidan Van Dyk <aidan@highrise.ca> writes:

* Tom Lane <tgl@sss.pgh.pa.us> [070514 10:24]:

This is not a behavior required by the TCP spec AFAICS. Also, in a
quick test neither Linux nor HPUX appear to need SO_REUSEADDR --- on
both, I can restart the postmaster immediately without it.

Did you have an active connection before restarting?
In HylaFAX, we had the same situation and went to using SO_REUSEADDR:
http://bugs.hylafax.org/show_bug.cgi?id=217

Um, you're right, I hadn't done the test properly. If I have an open
psql session across TCP and do pg_ctl stop -m fast, then I can't
start a new postmaster until the socket goes out of CLOSE_WAIT state.
Which, if I just leave the psql session sit there, seems to mean
"indefinitely" ... so it's even worse than just a TCP timeout.

So the notion of not using SO_REUSEADDR seems a nonstarter, and we
probably have to go with Magnus' global-object hack.

regards, tom lane

#39Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#36)
Re: What is happening on buildfarm member baiji?

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

What happens if we just "#ifndef WIN32" the setsockopt(SO_REUSEADDR)
call? I believe the reason that's in there is that some platforms will
reject bind() to a previously-used address for a TCP timeout delay after
a previous postmaster quit, but if that doesn't happen on Windows then
maybe all we need is to not set the option.

Well it's worth checking. But whereas Windows breaking our understanding of
what SO_REUSEADDR does doesn't actually violate any specification, not having
a TIME_WAIT state at all would certainly violate the TCP spec. So it's
somewhat unlikely that that's what they're doing. But anything's possible.

This is not a behavior required by the TCP spec AFAICS. Also, in a
quick test neither Linux nor HPUX appear to need SO_REUSEADDR --- on
both, I can restart the postmaster immediately without it.

It certainly is, observe on page 55 of RFC 793 for the "Open" call in the
example API:

TIME-WAIT STATE

Return "error: connection already exists".

so the fact is that that code has undergone approximately 0 specific
peer review. I'm beginning to wonder if we really need it at all.
I thought I recalled us having discussed the need for it once, but I
cannot find any trace of such a discussion.

It's certainly standard in Unix coding to have the server set SO_REUSEADDR and
the client not set it.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#33)
Re: What is happening on buildfarm member baiji?

Magnus Hagander <magnus@hagander.net> writes:

+ sprintf(mutexName,"postgresql.interlock.%i", portNumber);

That won't do; it should be legal for two postmasters to listen on
different IP addresses using the same port number. So you need to
include some representation of the IP address being bound to.

+ 		if (GetLastError() == ERROR_ALREADY_EXISTS)
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_LOCK_FILE_EXISTS),
+ 					 errmsg("interlock mutex \"%s\" already exists", mutexName),
+ 					 errhint("Is another postgres listening on port %i", portNumber)));

ereport(FATAL) is quite inappropriate here. Do the same thing that
bind() failure would do, ie, ereport(LOG) and continue the loop.
Also, you probably need to think about cleaning up the mutex in
case one of the later steps of socket-acquisition fails. We should
only be holding locks on addresses we've successfully bound.

regards, tom lane

#41Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#38)
Re: What is happening on buildfarm member baiji?

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Um, you're right, I hadn't done the test properly. If I have an open
psql session across TCP and do pg_ctl stop -m fast, then I can't
start a new postmaster until the socket goes out of CLOSE_WAIT state.
Which, if I just leave the psql session sit there, seems to mean
"indefinitely" ... so it's even worse than just a TCP timeout.

That's still not quite right. Are you running the client and server on the
same machine? Shutting down the server should put its connection in FIN_WAIT1
which would immediately go to FIN_WAIT2 if psql is still reachable. I think
the connection you're seeing in CLOSE_WAIT is the client's end of the
connection.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#41)
Re: What is happening on buildfarm member baiji?

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Um, you're right, I hadn't done the test properly. If I have an open
psql session across TCP and do pg_ctl stop -m fast, then I can't
start a new postmaster until the socket goes out of CLOSE_WAIT state.
Which, if I just leave the psql session sit there, seems to mean
"indefinitely" ... so it's even worse than just a TCP timeout.

That's still not quite right. Are you running the client and server on the
same machine?

Yeah. The behavior might well be different if they're on different
machines ... but it's moot in any case, since the point is that without
SO_REUSEADDR we have at least an exposure to a TCP-timeout delay before
being able to start a new postmaster.

regards, tom lane

#43Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#18)
Re: What is happening on buildfarm member baiji?

Dave Page wrote:

Where can I find out about multi-root? I can't see anything in the
config file, or in PGBuildFarm-HOWTO.txt

It's a hack I want to get rid of. It's a command-line option:

--multiroot = allow several members to use same build root

Of course, at least part of our problem is that the MSVC build is not
honoring port settings at all (and buildfarm isn't setting the port for
MSVC anyway). Magnus and I will work on that - it's a serious deficiency.

(refrains from whining again about 2 build systems)

cheers

andrew

#44Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

On 2007-05-14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aidan Van Dyk <aidan@highrise.ca> writes:

* Tom Lane <tgl@sss.pgh.pa.us> [070514 10:24]:

This is not a behavior required by the TCP spec AFAICS. Also, in a
quick test neither Linux nor HPUX appear to need SO_REUSEADDR --- on
both, I can restart the postmaster immediately without it.

Did you have an active connection before restarting?
In HylaFAX, we had the same situation and went to using SO_REUSEADDR:
http://bugs.hylafax.org/show_bug.cgi?id=217

Um, you're right, I hadn't done the test properly. If I have an open
psql session across TCP and do pg_ctl stop -m fast, then I can't
start a new postmaster until the socket goes out of CLOSE_WAIT state.
Which, if I just leave the psql session sit there, seems to mean
"indefinitely" ... so it's even worse than just a TCP timeout.

SO_REUSEADDR is required in all cases where you bind a listening socket
to a specific port number. There are no exceptions to this rule.

This is an artifact of the Berkeley Sockets interface design, not something
inherent in the TCP spec. It arises because the sockets interface separates
the bind() and listen()/connect() calls; if you replace bind/listen/connect
with a single system call, then SO_REUSEADDR becomes unnecessary. (The
behaviour of bind() needs to be different depending on whether it will be
followed by listen() or connect(); this was not well understood by the
original designers of the API, hence the use of SO_REUSEADDR as a klugy
way of saying "I'm going to use listen() on this socket after the bind".)

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#45Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#43)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Dave Page wrote:

Where can I find out about multi-root? I can't see anything in the
config file, or in PGBuildFarm-HOWTO.txt

It's a hack I want to get rid of. It's a command-line option:

--multiroot = allow several members to use same build root

I have in fact just removed this in buildfarm CVS tip. That means that
you can now run as many buildfarm members as you like against a single
buildroot and they will not trip over each other.

We still have the MSVC port problem to fix though.

cheers

andrew

#46Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

Dave Page wrote:

1) There appears to be no way to specify the default port number in the
MSVC build. The buildfarm passes it to configure for regular builds,
which obviously isn't run in VC++ mode, thus leaving the build on 5432.

I have committed fixes to both pgsql and buildfarm that should in
combination cure this, I hope. Please test - there might still be loose
ends hanging around.

cheers

andrew

#47Dave Page
dpage@postgresql.org
In reply to: Andrew Dunstan (#46)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Dave Page wrote:

1) There appears to be no way to specify the default port number in the
MSVC build. The buildfarm passes it to configure for regular builds,
which obviously isn't run in VC++ mode, thus leaving the build on 5432.

I have committed fixes to both pgsql and buildfarm that should in
combination cure this, I hope. Please test - there might still be loose
ends hanging around.

OK, thanks.

Regards, Dave.

#48Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#33)
Re: What is happening on buildfarm member baiji?

Are we going to apply this? I would also like to see a comment added on
why we use SO_REUSEADDR.

---------------------------------------------------------------------------

Magnus Hagander wrote:

On Mon, May 14, 2007 at 09:34:05AM -0400, Andrew Dunstan wrote:

Magnus Hagander wrote:

On Mon, May 14, 2007 at 09:02:10AM -0400, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Does it go away automatically on postmaster crash?

Yes.

Then I think it's worth adding, and I'd argue that as a low risk safety
measure we should allow it to sneak into 8.3. I'm assuming the code
involved will be quite small.

Yes, see attached.

BTW, did you mean 8.2? One typical case where this could happen is in an
upgrade scenario, I think...

//Magnus

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#48)
Re: What is happening on buildfarm member baiji?

Bruce Momjian <bruce@momjian.us> writes:

Are we going to apply this?

Not in the form submitted so far, but I trust Magnus is working on
fixing it.

regards, tom lane

#50Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#49)
Re: What is happening on buildfarm member baiji?

On Thu, May 17, 2007 at 07:51:34PM -0400, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Are we going to apply this?

Not in the form submitted so far, but I trust Magnus is working on
fixing it.

I am. Most likely won't have time to look at it properly until after pgcon,
though.

//Magnus

#51Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#40)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

+ sprintf(mutexName,"postgresql.interlock.%i", portNumber);

That won't do; it should be legal for two postmasters to listen on
different IP addresses using the same port number. So you need to
include some representation of the IP address being bound to.

+ 		if (GetLastError() == ERROR_ALREADY_EXISTS)
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_LOCK_FILE_EXISTS),
+ 					 errmsg("interlock mutex \"%s\" already exists", mutexName),
+ 					 errhint("Is another postgres listening on port %i", portNumber)));

ereport(FATAL) is quite inappropriate here. Do the same thing that
bind() failure would do, ie, ereport(LOG) and continue the loop.
Also, you probably need to think about cleaning up the mutex in
case one of the later steps of socket-acquisition fails. We should
only be holding locks on addresses we've successfully bound.

I've done some further research on this on Win32, and I've come up with
the following:

If I set the flag SO_EXCLUSIVEADDRUSE, I get the same behavior as on
Unix: Can only create one postmaster at a time on the same addr/port,
and if I close the backend with a psql session running I can't create a
new one until there is a timeout passed.

However, if I just *skip* setting SO_REUSEADDR completely, things seem
to work the way we want it. I cannot start more than one postmaster on
the same addr/port. If I start a psql, then terminate postmaster, I can
restart a new postmaster right away.

Given this, I propose we simply #ifdef out the SO_REUSEADDR on win32.
Anybody see a problem with this?

(A fairly good reference to read up on the options is at
http://msdn2.microsoft.com/en-us/library/ms740621.aspx - which
specifically talks about the issue seen on Unix as appearing with the
SO_EXCLUSIVEADDRUSE parameter, which agrees with my testresults)

//Magnus

#52Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#51)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

However, if I just *skip* setting SO_REUSEADDR completely, things seem
to work the way we want it. I cannot start more than one postmaster on
the same addr/port. If I start a psql, then terminate postmaster, I can
restart a new postmaster right away.

Given this, I propose we simply #ifdef out the SO_REUSEADDR on win32.
Anybody see a problem with this?

Is that true even if the backend crashes?

cheers

andrew

#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#52)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan <andrew@dunslane.net> writes:

Magnus Hagander wrote:

Given this, I propose we simply #ifdef out the SO_REUSEADDR on win32.
Anybody see a problem with this?

Is that true even if the backend crashes?

It would take a postmaster crash to make this an issue, and those are
pretty doggone rare. Not that the question shouldn't be checked, but
we might decide to tolerate the problem if there is one ...

regards, tom lane

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#51)
Re: What is happening on buildfarm member baiji?

Magnus Hagander <magnus@hagander.net> writes:

Given this, I propose we simply #ifdef out the SO_REUSEADDR on win32.
Anybody see a problem with this?

(A fairly good reference to read up on the options is at
http://msdn2.microsoft.com/en-us/library/ms740621.aspx

Hmm ... if accurate, that page says in words barely longer than one
syllable that Microsoft entirely misunderstands the intended meaning
of SO_REUSEADDR.

It looks like SO_EXCLUSIVEADDRUSE might be a bit closer to the standard
semantics; should we use that instead on Windoze?

regards, tom lane

#55Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#54)
Re: What is happening on buildfarm member baiji?

On Sun, Jun 03, 2007 at 11:29:33PM -0400, Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

Given this, I propose we simply #ifdef out the SO_REUSEADDR on win32.
Anybody see a problem with this?

(A fairly good reference to read up on the options is at
http://msdn2.microsoft.com/en-us/library/ms740621.aspx

Hmm ... if accurate, that page says in words barely longer than one
syllable that Microsoft entirely misunderstands the intended meaning
of SO_REUSEADDR.

Yes, that's how I read it as well.

It looks like SO_EXCLUSIVEADDRUSE might be a bit closer to the standard
semantics; should we use that instead on Windoze?

I think you're reading something wrong. The way I read it,
SO_EXCLUSIVEADDRUSE gives us pretty much the same behavior we have on Unix
*without* SO_REUSEADDR. There's a paragraph specificallyi talking about the
problem of restarting a server having to wait for a timeout when using this
switch.

//Magnus

#56Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#53)
Re: What is happening on buildfarm member baiji?

On Sun, Jun 03, 2007 at 10:44:13PM -0400, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Magnus Hagander wrote:

Given this, I propose we simply #ifdef out the SO_REUSEADDR on win32.
Anybody see a problem with this?

Is that true even if the backend crashes?

It would take a postmaster crash to make this an issue, and those are
pretty doggone rare. Not that the question shouldn't be checked, but
we might decide to tolerate the problem if there is one ...

The closest I can get is a kill -9 on postmaster, and that does work. I
can't start a new postmaster while the old backend is running - because of
the shared memory detection stuff. But the second it's gone I can start a
new one, so it doesn't have that wait-until-timeout behavior.

Since that's expected behavior and there were no other complaints, I think
I'll go ahead an put this one in later today.

//Magnus

#57Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Magnus Hagander (#55)
Re: What is happening on buildfarm member baiji?

Given this, I propose we simply #ifdef out the SO_REUSEADDR on

win32.

I agree, that this is what we should do.

(A fairly good reference to read up on the options is at
http://msdn2.microsoft.com/en-us/library/ms740621.aspx

Hmm ... if accurate, that page says in words barely longer than one
syllable that Microsoft entirely misunderstands the intended meaning

of SO_REUSEADDR.

Yes, that's how I read it as well.

It looks like SO_EXCLUSIVEADDRUSE might be a bit closer to the
standard semantics; should we use that instead on Windoze?

I think you're reading something wrong. The way I read it,
SO_EXCLUSIVEADDRUSE gives us pretty much the same behavior we have on

Unix

*without* SO_REUSEADDR. There's a paragraph specificallyi
talking about the problem of restarting a server having to
wait for a timeout when using this switch.

Yup, that switch is no good eighter.

Andreas