Oops - BF:Mastodon just died

Started by Dave Pagealmost 18 years ago23 messages
#2Magnus Hagander
magnus@hagander.net
In reply to: Dave Page (#1)
Re: Oops - BF:Mastodon just died

Dave Page wrote:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&dt=2008-01-30%2020:00:00

Maybe I shouldn't have had those beers after work today, but that looks
like it's for example failing tsearch2, which hasn't been touched for
over a month!

Any chance there's something dodgy in the build env?

(If I'm missing the obvious, I blame the beer!)

//Magnus

#3Dave Page
dpage@postgresql.org
In reply to: Magnus Hagander (#2)
Re: Oops - BF:Mastodon just died

On Jan 30, 2008 9:13 PM, Magnus Hagander <magnus@hagander.net> wrote:

Dave Page wrote:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&amp;dt=2008-01-30%2020:00:00

Maybe I shouldn't have had those beers after work today, but that looks
like it's for example failing tsearch2, which hasn't been touched for
over a month!

Any chance there's something dodgy in the build env?

I can't remember the last time I logged into that box so if it's
something in the buildenv, it's either caused by a Windows update, or
some failing hardware.

/D

#4Magnus Hagander
magnus@hagander.net
In reply to: Dave Page (#3)
Re: Oops - BF:Mastodon just died

Dave Page wrote:

On Jan 30, 2008 9:13 PM, Magnus Hagander <magnus@hagander.net> wrote:

Dave Page wrote:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&amp;dt=2008-01-30%2020:00:00

Maybe I shouldn't have had those beers after work today, but that looks
like it's for example failing tsearch2, which hasn't been touched for
over a month!

Any chance there's something dodgy in the build env?

I can't remember the last time I logged into that box so if it's
something in the buildenv, it's either caused by a Windows update, or
some failing hardware.

I won't have access to my MSVC box until tomorrow, but unless beaten to
it I can dig into it a bit more. I don't see anything obvious int he
latest patches thoughy (but again, that could be the beer :-P).

Any chance you could just do a forced run on it now to show if it was
some kind of transient stuff?

//Magnus

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Dave Page (#3)
Re: Oops - BF:Mastodon just died

Dave Page wrote:

On Jan 30, 2008 9:13 PM, Magnus Hagander <magnus@hagander.net> wrote:

Dave Page wrote:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mastodon&amp;dt=2008-01-30%2020:00:00

Maybe I shouldn't have had those beers after work today, but that looks
like it's for example failing tsearch2, which hasn't been touched for
over a month!

Any chance there's something dodgy in the build env?

I can't remember the last time I logged into that box so if it's
something in the buildenv, it's either caused by a Windows update, or
some failing hardware.

None of the CVS changes in the relevant period seems to have any
relation to the errors, so I suspect a local problem.

red_bat is due to build in a couple of hours, so we will soon see if it
reproduces the error.

cheers

andrew

#6Dave Page
dpage@postgresql.org
In reply to: Magnus Hagander (#4)
Re: Oops - BF:Mastodon just died

On Jan 30, 2008 9:21 PM, Magnus Hagander <magnus@hagander.net> wrote:

I won't have access to my MSVC box until tomorrow, but unless beaten to
it I can dig into it a bit more. I don't see anything obvious int he
latest patches thoughy (but again, that could be the beer :-P).

Any chance you could just do a forced run on it now to show if it was
some kind of transient stuff?

Not from here. :-(

/D

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: Oops - BF:Mastodon just died

Andrew Dunstan <andrew@dunslane.net> writes:

None of the CVS changes in the relevant period seems to have any
relation to the errors, so I suspect a local problem.

skylark and baiji are now red too, so I guess that theory is dead in the
water. Something in today's changes broke the MSVC build, but what?

I diffed yesterday's and today's make logs from skylark, and found
nothing interesting except this:

***************
*** 605,611 ****
          Generate DEF file^M
          Generating POSTGRES.DEF from directory Release\postgres^M
          ............................................................................................................................................................\
......................................................................................................................................................................\
.........................................................................................................................................^M
!         Generated 5208 symbols^M
          Linking...^M
             Creating library Release\postgres\postgres.lib and object Release\postgres\postgres.exp^M
          Embedding manifest...^M
--- 605,611 ----
          Generate DEF file^M
          Generating POSTGRES.DEF from directory Release\postgres^M
          ............................................................................................................................................................\
......................................................................................................................................................................\
.........................................................................................................................................^M
!         Generated 5205 symbols^M
          Linking...^M
             Creating library Release\postgres\postgres.lib and object Release\postgres\postgres.exp^M
          Embedding manifest...^M
***************

Presumably the three missing symbols include the two that are being
complained of later, but what the heck?

(Hmm, actually today's commits should have added two global symbols to
the backend, so it seems there are five not three symbols to be
accounted for.)

It is probably significant that both of the known missing symbols come
from guc.c, which we added another variable to today. I have a
sickening feeling that we have hit some kind of undocumented internal
limit in MSVC as to the number of symbols imported/exported by one
source file...

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: Oops - BF:Mastodon just died

I wrote:

I diffed yesterday's and today's make logs from skylark, and found
nothing interesting except this:

***************
*** 605,611 ****
Generating POSTGRES.DEF from directory Release\postgres^M
!         Generated 5208 symbols^M
Linking...^M
--- 605,611 ----
Generating POSTGRES.DEF from directory Release\postgres^M
!         Generated 5205 symbols^M
Linking...^M
***************

Looking at this a bit closer, I realize that it's coming from
gendef.pl's dumpbin usage of recent infamy. So there are a couple
of ideas that come to mind:

* Has the buildfarm script changed recently in a way that might change
the execution PATH and thereby suck in a different version of dumpbin?
(Or even a different version of Perl?)

* Is it conceivable that dumpbin's output format has changed in a way
that confuses the bit of Perl code that's parsing it? One idea that
comes to mind is that it contains a timestamp that just got wider ---
I remember seeing some bugs like that when the value of Unix time_t
reached 1 billion and became 9 instead of 8 digits.

Neither of these sound very plausible, but it seems the next step for
investigation is to look closely at what's happening in gendef.pl.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dave Page (#3)
Re: Oops - BF:Mastodon just died

"Dave Page" <dpage@postgresql.org> writes:

I can't remember the last time I logged into that box so if it's
something in the buildenv, it's either caused by a Windows update,

Re-reading the thread ... could that last point be significant? Are
all four of these boxen set to auto-accept updates from Redmond?

regards, tom lane

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: Oops - BF:Mastodon just died

Tom Lane wrote:

* Has the buildfarm script changed recently in a way that might change
the execution PATH and thereby suck in a different version of dumpbin?
(Or even a different version of Perl?)

No. In at least the case of red_bat nothing has changed for months.

* Is it conceivable that dumpbin's output format has changed in a way
that confuses the bit of Perl code that's parsing it? One idea that
comes to mind is that it contains a timestamp that just got wider ---
I remember seeing some bugs like that when the value of Unix time_t
reached 1 billion and became 9 instead of 8 digits.

Neither of these sound very plausible, but it seems the next step for
investigation is to look closely at what's happening in gendef.pl.

Right. I agree that your diff makes gendef.pl the prime suspect.

Yoo also just said:

"Dave Page" <dpage@postgresql.org> writes:

I can't remember the last time I logged into that box so if it's
something in the buildenv, it's either caused by a Windows update,

Re-reading the thread ... could that last point be significant? Are
all four of these boxen set to auto-accept updates from Redmond?

No. red_bat does not auto-accept anything.

cheers

andrew

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: Oops - BF:Mastodon just died

Tom Lane wrote:

Neither of these sound very plausible, but it seems the next step for
investigation is to look closely at what's happening in gendef.pl.

Yes, I have found the problem. It is this line, which I am amazed hasn't
bitten us before:

next unless /^\d/;

The first field in the dumpbin output looks like a 3 digit hex number.
The line on my system for GetConfigOptionByName starts with 'A02' which
of course fails the test above.

For now I'm going try to fix it by changing it to:

next unless $pieces[0] =~/^[A-F0-9]{3}$/;

I also propose to have the gendefs.pl script save the dumpbin output so
this sort of problem will be easier to debug.

cheers

andrew

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#11)
Re: Oops - BF:Mastodon just died

Andrew Dunstan <andrew@dunslane.net> writes:

Yes, I have found the problem. It is this line, which I am amazed hasn't
bitten us before:
next unless /^\d/;
The first field in the dumpbin output looks like a 3 digit hex number.

Argh, so it was crossing a power-of-2 boundary that got us. Good catch.

For now I'm going try to fix it by changing it to:
next unless $pieces[0] =~/^[A-F0-9]{3}$/;

Check.

I also propose to have the gendefs.pl script save the dumpbin output so
this sort of problem will be easier to debug.

Agreed, but I suggest waiting till 8.4 is branched unless you are really
sure about this addition. We freeze for 8.3.0 in less than 24 hours.

regards, tom lane

#13Dave Page
dpage@postgresql.org
In reply to: Andrew Dunstan (#10)
Re: Oops - BF:Mastodon just died

On Jan 31, 2008 1:33 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

Re-reading the thread ... could that last point be significant? Are
all four of these boxen set to auto-accept updates from Redmond?

No. red_bat does not auto-accept anything.

For future reference, my BF members do auto-accept updates (though
they only reboot if I tell them to). It seems like having red_bat do
the opposite provides a useful baseline for tracking down future
issues.

I wonder if it would be worth adding a notes field to the BF so we can
record this sort of detail...

/D

#14Magnus Hagander
magnus@hagander.net
In reply to: Dave Page (#13)
Re: Oops - BF:Mastodon just died

On Thu, Jan 31, 2008 at 08:28:21AM +0000, Dave Page wrote:

On Jan 31, 2008 1:33 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

Re-reading the thread ... could that last point be significant? Are
all four of these boxen set to auto-accept updates from Redmond?

No. red_bat does not auto-accept anything.

For future reference, my BF members do auto-accept updates (though
they only reboot if I tell them to). It seems like having red_bat do
the opposite provides a useful baseline for tracking down future
issues.

I wonder if it would be worth adding a notes field to the BF so we can
record this sort of detail...

+1. That should be interesting for non-win32 platforms as well... Assuming
it's not too much work, of course ;)

I have yet to see the first case where a windows update breaks PostgreSQL
in any way though, but once it happens it would be nice to have the info.

//Magnus

#15Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#12)
Re: Oops - BF:Mastodon just died

On Thu, Jan 31, 2008 at 12:45:40AM -0500, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Yes, I have found the problem. It is this line, which I am amazed hasn't
bitten us before:
next unless /^\d/;
The first field in the dumpbin output looks like a 3 digit hex number.

Argh, so it was crossing a power-of-2 boundary that got us. Good catch.

For now I'm going try to fix it by changing it to:
next unless $pieces[0] =~/^[A-F0-9]{3}$/;

Check.

Yeah, nice catch. Wouldn't surprise me if we actually had this problem
before, just that the dropped symbols were not actually used by our own
modules. I notice the export count jumped to 5226...

I also propose to have the gendefs.pl script save the dumpbin output so
this sort of problem will be easier to debug.

Agreed, but I suggest waiting till 8.4 is branched unless you are really
sure about this addition. We freeze for 8.3.0 in less than 24 hours.

+1

//Magnus

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#15)
Re: Oops - BF:Mastodon just died

Magnus Hagander wrote:

I also propose to have the gendefs.pl script save the dumpbin output so
this sort of problem will be easier to debug.

Agreed, but I suggest waiting till 8.4 is branched unless you are really
sure about this addition. We freeze for 8.3.0 in less than 24 hours.

+1

I am pretty damn sure it's OK. It's pretty low risk (change an unlink
call to a rename call) and even if it's broken as my fist version was,
it doesn't appear to break the build. It's working on the buildfarm. I
want it in so if we have problems with 8.3 we don't have to go through
the handstands I had to to find out what was broken.

cheers

andrew

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#15)
Re: Oops - BF:Mastodon just died

Magnus Hagander <magnus@hagander.net> writes:

Andrew Dunstan <andrew@dunslane.net> writes:

For now I'm going try to fix it by changing it to:
next unless $pieces[0] =~/^[A-F0-9]{3}$/;

Yeah, nice catch. Wouldn't surprise me if we actually had this problem
before, just that the dropped symbols were not actually used by our own
modules. I notice the export count jumped to 5226...

I was wondering where the count would go.

It strikes me that the pattern needs to be {3,} or maybe just +.
I dunno what this column is measuring, but if we are past 0xA00
then surely 0x1000 is not far away.

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#16)
Re: Oops - BF:Mastodon just died

Andrew Dunstan <andrew@dunslane.net> writes:

Agreed, but I suggest waiting till 8.4 is branched unless you are really
sure about this addition. We freeze for 8.3.0 in less than 24 hours.

I am pretty damn sure it's OK. It's pretty low risk (change an unlink
call to a rename call) and even if it's broken as my fist version was,
it doesn't appear to break the build. It's working on the buildfarm. I
want it in so if we have problems with 8.3 we don't have to go through
the handstands I had to to find out what was broken.

After looking at the patch, my only question is how all those junk files
get cleaned up at "make clean".

regards, tom lane

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#18)
Re: Oops - BF:Mastodon just died

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Agreed, but I suggest waiting till 8.4 is branched unless you are really
sure about this addition. We freeze for 8.3.0 in less than 24 hours.

I am pretty damn sure it's OK. It's pretty low risk (change an unlink
call to a rename call) and even if it's broken as my fist version was,
it doesn't appear to break the build. It's working on the buildfarm. I
want it in so if we have problems with 8.3 we don't have to go through
the handstands I had to to find out what was broken.

After looking at the patch, my only question is how all those junk files
get cleaned up at "make clean".

The symbols files we are keeping as a result of the patch are renamed
into to the release or debug hierarchy (depending on what we're
building). Those entire trees are removed by src/tools/msvc/clean.bat.

cheers

andrew

#20Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#17)
Re: Oops - BF:Mastodon just died

Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

Andrew Dunstan <andrew@dunslane.net> writes:

For now I'm going try to fix it by changing it to:
next unless $pieces[0] =~/^[A-F0-9]{3}$/;

Yeah, nice catch. Wouldn't surprise me if we actually had this problem
before, just that the dropped symbols were not actually used by our own
modules. I notice the export count jumped to 5226...

I was wondering where the count would go.

It strikes me that the pattern needs to be {3,} or maybe just +.
I dunno what this column is measuring, but if we are past 0xA00
then surely 0x1000 is not far away.

http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx appears to
suggest that the size of the field is fixed.

But who knows?

cheers

andrew

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#20)
Re: Oops - BF:Mastodon just died

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

It strikes me that the pattern needs to be {3,} or maybe just +.
I dunno what this column is measuring, but if we are past 0xA00
then surely 0x1000 is not far away.

http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx appears to
suggest that the size of the field is fixed.

That would imply that dumpbin fails at 4096 symbols per file. While I
surely wouldn't put it past M$ to have put in such a limitation, I think
it's more likely that the documentation is badly written.

In any case it would be easy enough to make up a quick test to see what
happens with say

void func1() {}
void func2() {}
...
void func5000() {}

regards, tom lane

#22Zeugswetter Andreas ADI SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Tom Lane (#21)
Re: Oops - BF:Mastodon just died

http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx
appears to

suggest that the size of the field is fixed.

That would imply that dumpbin fails at 4096 symbols per file. While I
surely wouldn't put it past M$ to have put in such a
limitation, I think
it's more likely that the documentation is badly written.

Yes, it starts with 3 and goes to 4 digits above FFF

Andreas

#23Andrew Dunstan
andrew@dunslane.net
In reply to: Zeugswetter Andreas ADI SD (#22)
Re: Oops - BF:Mastodon just died

Zeugswetter Andreas ADI SD wrote:

http://msdn2.microsoft.com/en-us/library/b842y285(VS.71).aspx
appears to

suggest that the size of the field is fixed.

That would imply that dumpbin fails at 4096 symbols per file. While I
surely wouldn't put it past M$ to have put in such a
limitation, I think
it's more likely that the documentation is badly written.

Yes, it starts with 3 and goes to 4 digits above FFF

OK, then {3,} is the right quantification. Will fix.

cheers

andrew