What is happening on buildfarm member baiji?

Started by Tom Lanealmost 19 years ago57 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

regards, tom lane

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

I no longer have an MSVC box, so I can't tell so easily ;-(

cheers

andrew

#3Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#2)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

Uh, but that piece of code you're referring to is from the bulidfarm
code, right? Isn't it the same?

I no longer have an MSVC box, so I can't tell so easily ;-(

Non-Vista MSVC boxes seem to pass fine (mastodon and skylark, for
example - skylark fails on something completely different, not fully
investigated yet, but looks to be a buildfarm problem rather than a
backend one), so I don't think it's the MSVC procedure alone that's the
cause of it.

//Magnus

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#3)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

Andrew Dunstan wrote:

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

Uh, but that piece of code you're referring to is from the bulidfarm
code, right? Isn't it the same?

Yes, but it might be that the MSVC install doesn't actually use that
location properly. Unfortunately, its logging is less than verbose, unlike
the standard install procedure.

I no longer have an MSVC box, so I can't tell so easily ;-(

Non-Vista MSVC boxes seem to pass fine (mastodon and skylark, for
example - skylark fails on something completely different, not fully
investigated yet, but looks to be a buildfarm problem rather than a
backend one), so I don't think it's the MSVC procedure alone that's the
cause of it.

Possibly. My point was that I can't even investigate how MSVC is working
at all.

cheers

andrew

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

My point was that I can't even investigate how MSVC is working
at all.

So what is it you're looking for, specifically, to help with that?

As a very bare minimum, we need to change the installation procedure to
log its destination.

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

cheers

andrew

#6Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#4)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Magnus Hagander wrote:

Andrew Dunstan wrote:

Tom Lane wrote:

The last two runs on baiji have failed at the installcheck stage,
with symptoms that look a heck of a lot like the most recent system
catalog changes haven't taken effect (eg, it doesn't seem to know
about pg_type.typarray). Given that the previous "check" step
passed, the most likely explanation seems to be that some part
of the "install" step failed --- I've not tried to reproduce the
behavior but it looks like it might be explained if the install
target's postgres.bki file was not getting overwritten. So we
have two issues: what exactly is going wrong (some new form of
Vista brain death no doubt), and why isn't the buildfarm script
noticing?

The script will not even run if the install directory exists:

die "$buildroot/$branch has $pgsql or inst directories!"
if ((!$from_source && -d $pgsql) || -d "inst");

But the install process is different for MSVC. It could be that we are
screwing up there.

Uh, but that piece of code you're referring to is from the bulidfarm
code, right? Isn't it the same?

Yes, but it might be that the MSVC install doesn't actually use that
location properly. Unfortunately, its logging is less than verbose, unlike
the standard install procedure.

I no longer have an MSVC box, so I can't tell so easily ;-(

Non-Vista MSVC boxes seem to pass fine (mastodon and skylark, for
example - skylark fails on something completely different, not fully
investigated yet, but looks to be a buildfarm problem rather than a
backend one), so I don't think it's the MSVC procedure alone that's the
cause of it.

Possibly. My point was that I can't even investigate how MSVC is working
at all.

So what is it you're looking for, specifically, to help with that?

//Magnus

#7Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#5)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Magnus Hagander wrote:

My point was that I can't even investigate how MSVC is working
at all.

So what is it you're looking for, specifically, to help with that?

As a very bare minimum, we need to change the installation procedure to
log its destination.

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

Just to be clear, are you looking for something as simple as this?

Index: Install.pm
===================================================================
RCS file: /cvsroot/pgsql/src/tools/msvc/Install.pm,v
retrieving revision 1.14
diff -c -r1.14 Install.pm
*** Install.pm  25 Apr 2007 19:00:05 -0000      1.14
--- Install.pm  13 May 2007 15:21:51 -0000
***************
*** 35,41 ****
          $conf = "release";
      }
      die "Could not find debug or release binaries" if ($conf eq "");
!     print "Installing for $conf\n";
      EnsureDirectories($target,
'bin','lib','share','share/timezonesets','share/contrib','doc',
          'doc/contrib', 'symbols');
--- 35,41 ----
          $conf = "release";
      }
      die "Could not find debug or release binaries" if ($conf eq "");
!     print "Installing for $conf in $target\n";

EnsureDirectories($target,
'bin','lib','share','share/timezonesets','share/contrib','doc',
'doc/contrib', 'symbols');

//Magnus

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#7)
Re: What is happening on buildfarm member baiji?

Magnus Hagander wrote:

! print "Installing for $conf in $target\n";

Looks like a good place to start, sure.

cheers

andrew

#9Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#8)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Magnus Hagander wrote:

! print "Installing for $conf in $target\n";

Looks like a good place to start, sure.

Ok. Applied.

//Magnus

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: What is happening on buildfarm member baiji?

"Andrew Dunstan" <andrew@dunslane.net> writes:

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

Well, I tried inserting a .bki file from April 30 into a HEAD
installation, and that made it dump core during bootstrap, so that
offhand theory was wrong.

However, when I run the HEAD regression tests against that entire
April 30 installation tree, I can duplicate the baiji regression diffs
almost exactly --- the polymorphism test fails for me where it succeeds
on baiji, which I think indicate that baiji has the patch I applied on
May 1 for SQL function inlining.

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.
Have there been any recent changes in either the buildfarm script or
the MSVC install code that might have changed where the install is
supposed to go?

regards, tom lane

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#10)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

"Andrew Dunstan" <andrew@dunslane.net> writes:

Unless that has somehow got screwed up I can't see how Tom's theory of a
possibly left over .bki file can stand up.

Well, I tried inserting a .bki file from April 30 into a HEAD
installation, and that made it dump core during bootstrap, so that
offhand theory was wrong.

However, when I run the HEAD regression tests against that entire
April 30 installation tree, I can duplicate the baiji regression diffs
almost exactly --- the polymorphism test fails for me where it succeeds
on baiji, which I think indicate that baiji has the patch I applied on
May 1 for SQL function inlining.

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.
Have there been any recent changes in either the buildfarm script or
the MSVC install code that might have changed where the install is
supposed to go?

Not to my knowledge, but I have no method of testing what's going on,
and I hate guessing like this - in fact this is what has worried me all
along about supporting MSVC builds - we always said we didn't want to
have to have 2 build environments, but now we have two and we'll be
supporting them forever, even though one of them is not used by 95% of
our developers. I realise that MSVC builds are likely to perform better,
but we have now got a situation where we are likely to have breakage on
a regular basis, ISTM.

(sorry to grumble - it's been a very frustrating 24 hours)

cheers

andrew

#12Dave Page
dpage@pgadmin.org
In reply to: Andrew Dunstan (#11)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

Not to my knowledge, but I have no method of testing what's going on,
and I hate guessing like this - in fact this is what has worried me all
along about supporting MSVC builds - we always said we didn't want to
have to have 2 build environments, but now we have two and we'll be
supporting them forever, even though one of them is not used by 95% of
our developers. I realise that MSVC builds are likely to perform better,
but we have now got a situation where we are likely to have breakage on
a regular basis, ISTM.

It's not just that they perform better - we also get a debugger that
actually works well (yes, I know newer gdb's apparently do work on
Mingw; but even a fully functional GDB doesn't come close to VC++), but
more importantly it's looking more and more like it'll be our only way
of producing a 64bit build for Windows.

(sorry to grumble - it's been a very frustrating 24 hours)

:-(

Regards, Dave.

#13Dave Page
dpage@pgadmin.org
In reply to: Tom Lane (#10)
Re: What is happening on buildfarm member baiji?

Tom Lane wrote:

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.

Close. There was an Msys build from the 9th running on port 5432.

So, it seems there are a couple of issues here:

1) There appears to be no way to specify the default port number in the
MSVC build. The buildfarm passes it to configure for regular builds,
which obviously isn't run in VC++ mode, thus leaving the build on 5432.

2) VC++ and Msys builds will both happily start on the same port at the
same time. The first one to start listens on 5432 until it shuts down,
at which point the second server takes over seamlessly! It doesn't
matter which is started first - it's as if Windows is queuing up the
listens on the port.

Confusingly, the similar behaviour is reproducible on XP Pro, except the
connection seems to go to the last server started, instead of the first!

Regards, Dave

#14Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

Close. There was an Msys build from the 9th running on port 5432.

2) VC++ and Msys builds will both happily start on the same
port at the same time. The first one to start listens on 5432
until it shuts down, at which point the second server takes
over seamlessly! It doesn't matter which is started first -
it's as if Windows is queuing up the listens on the port.

Um, we explicitly set SO_REUSEADDR. So the port will happily allow a
second bind.

http://support.microsoft.com/kb/307175 quote:
"If you use SO_REUSADDR to bind multiple servers to the same port at the
same time, only one random listening socket accepts a connection
request."

Andreas

#15Dave Page
dpage@pgadmin.org
In reply to: Zeugswetter Andreas SB SD (#14)
Re: What is happening on buildfarm member baiji?

Zeugswetter Andreas ADI SD wrote:

Close. There was an Msys build from the 9th running on port 5432.

2) VC++ and Msys builds will both happily start on the same
port at the same time. The first one to start listens on 5432
until it shuts down, at which point the second server takes
over seamlessly! It doesn't matter which is started first -
it's as if Windows is queuing up the listens on the port.

Um, we explicitly set SO_REUSEADDR. So the port will happily allow a
second bind.

So we do. I must confess I didn't look at the code, just spoke with
Magnus who agreed it didn't seem like it should be possible.

Regards, Dave

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

Dave Page wrote:

Tom Lane wrote:

So I now state fairly confidently that baiji is failing to overwrite
*any* of the installation tree, /share and /bin both, and instead is
testing an installation dating from sometime between May 1 and May 11.

Close. There was an Msys build from the 9th running on port 5432.

So, it seems there are a couple of issues here:

1) There appears to be no way to specify the default port number in the
MSVC build. The buildfarm passes it to configure for regular builds,
which obviously isn't run in VC++ mode, thus leaving the build on 5432.

2) VC++ and Msys builds will both happily start on the same port at the
same time. The first one to start listens on 5432 until it shuts down,
at which point the second server takes over seamlessly! It doesn't
matter which is started first - it's as if Windows is queuing up the
listens on the port.

Confusingly, the similar behaviour is reproducible on XP Pro, except the
connection seems to go to the last server started, instead of the first!

I'll look at the port mess.

Are you running 2 buildfarm members on the same machine? If so, you
should look at using the multi-root factility which is explicitly
designed to avoid clashes of this sort.

cheers

andrew

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dave Page (#13)
Re: What is happening on buildfarm member baiji?

Dave Page <dpage@postgresql.org> writes:

2) VC++ and Msys builds will both happily start on the same port at the
same time. The first one to start listens on 5432 until it shuts down,
at which point the second server takes over seamlessly!

Uh ... so the lock-file stuff is completely broken on Windows?

The SO_REUSEADDR flag is intentional --- without that, on many
platforms there would be a significant time delay needed between
stopping a postmaster and starting a new one. But our socket lock
file machinery ought to have detected the conflict.

regards, tom lane

#18Dave Page
dpage@pgadmin.org
In reply to: Andrew Dunstan (#16)
Re: What is happening on buildfarm member baiji?

Andrew Dunstan wrote:

I'll look at the port mess.

Are you running 2 buildfarm members on the same machine? If so, you
should look at using the multi-root factility which is explicitly
designed to avoid clashes of this sort.

Yes, I've got VC++ and Mingw/Msys animals on each of two (virtual)
machines. Each is completely independent of each other - different
configs, different scripts, different ports, different directories etc.

Where can I find out about multi-root? I can't see anything in the
config file, or in PGBuildFarm-HOWTO.txt

Regards, Dave.

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: What is happening on buildfarm member baiji?

I wrote:

Uh ... so the lock-file stuff is completely broken on Windows?

Not so much broken as commented out ... on looking at the code, it's
blindingly obvious that we don't even try to create a socket lock file
if not HAVE_UNIX_SOCKETS. Sigh.

There is a related risk even on Unix machines: two postmasters can be
started on the same port number if they have different settings of
unix_socket_directory, and then it's indeterminate which one you will
contact if you connect to the TCP port. I seem to recall that we
discussed this several years ago, and didn't really find a satisfactory
way of interlocking the TCP port per se.

regards, tom lane

#20Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#19)
Re: What is happening on buildfarm member baiji?

On Mon, May 14, 2007 at 08:50:54AM -0400, Tom Lane wrote:

I wrote:

Uh ... so the lock-file stuff is completely broken on Windows?

Not so much broken as commented out ... on looking at the code, it's
blindingly obvious that we don't even try to create a socket lock file
if not HAVE_UNIX_SOCKETS. Sigh.

There is a related risk even on Unix machines: two postmasters can be
started on the same port number if they have different settings of
unix_socket_directory, and then it's indeterminate which one you will
contact if you connect to the TCP port. I seem to recall that we
discussed this several years ago, and didn't really find a satisfactory
way of interlocking the TCP port per se.

If all we want to do is add a check that prevents two servers to start on
the same port, we could do that trivially in a win32 specific way (since
we'll never have unix sockets there). Just create an object in the global
namespace named postgresql.interlock.<portnumber> or such a thing.

Worth doing?

//Magnus

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#20)
#22Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#19)
#23Dave Page
dpage@pgadmin.org
In reply to: Stephen Frost (#22)
#24Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#19)
#25Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#21)
#26Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dave Page (#23)
#28Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#27)
#29Dave Page
dpage@pgadmin.org
In reply to: Tom Lane (#27)
#30Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#26)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#26)
#32Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#31)
#33Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#26)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#32)
#35Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#31)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#35)
#37Aidan Van Dyk
aidan@highrise.ca
In reply to: Tom Lane (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Aidan Van Dyk (#37)
#39Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#36)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#33)
#41Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#38)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#41)
#43Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#18)
#44Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Dave Page (#13)
#45Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#43)
#46Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#13)
#47Dave Page
dpage@pgadmin.org
In reply to: Andrew Dunstan (#46)
#48Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#33)
#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#48)
#50Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#49)
#51Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#40)
#52Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#51)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#52)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#51)
#55Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#54)
#56Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#53)
#57Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Magnus Hagander (#55)