join regression failure on cygwin

Started by Andrew Dunstanover 16 years ago13 messages
#1Andrew Dunstan
andrew@dunslane.net

My Cygwin buildfarm member started failing (hanging, in fact) recently.
It seems to hang consistently in join.sql and the only way I can get it
to complete is to kill the backend fairly violently. See
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&amp;dt=2009-07-22%2023:10:21&gt;
It looks like this is the result of the semi-join ordering bug fix Tom
applied a few days ago, but it is building and running 8.4 quite
happily, and I think that branch got the same changes, so I'm not quite
sure about it.

I'm going to try reversing that change locally to see if it fixes the
problem. Unfortunately, this isn't an easy platform to debug.

cheers

andrew

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#1)
Re: join regression failure on cygwin

I wrote:

My Cygwin buildfarm member started failing (hanging, in fact)
recently. It seems to hang consistently in join.sql and the only way I
can get it to complete is to kill the backend fairly violently. See
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&amp;dt=2009-07-22%2023:10:21&gt;
It looks like this is the result of the semi-join ordering bug fix Tom
applied a few days ago, but it is building and running 8.4 quite
happily, and I think that branch got the same changes, so I'm not
quite sure about it.

I'm going to try reversing that change locally to see if it fixes the
problem. Unfortunately, this isn't an easy platform to debug.

Nope, it wasn't that. So my next candidate is the prior
join_is_legal/GEQO changes. But reverting that would get rid of the new
regression test where we seem to be failing, so I'm not quite sure where
to go on it.

cheers

andrew

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#2)
Re: join regression failure on cygwin

Andrew Dunstan <andrew@dunslane.net> writes:

My Cygwin buildfarm member started failing (hanging, in fact)
recently. It seems to hang consistently in join.sql and the only way I
can get it to complete is to kill the backend fairly violently. See
<http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=brown_bat&amp;dt=2009-07-22%2023:10:21&gt;

Nope, it wasn't that. So my next candidate is the prior
join_is_legal/GEQO changes. But reverting that would get rid of the new
regression test where we seem to be failing, so I'm not quite sure where
to go on it.

I see it claims to have working erand48, but maybe not so much in
reality?

regards, tom lane

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#3)
Re: join regression failure on cygwin

Tom Lane wrote:

I see it claims to have working erand48, but maybe not so much in
reality?

Good guess. I removed erand48 from the configure file and added
erand48.o to OBJS in src/port/Makefile and it suddenly sailed through.

cheers

andrew

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#4)
Re: join regression failure on cygwin

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

I see it claims to have working erand48, but maybe not so much in
reality?

Good guess. I removed erand48 from the configure file and added
erand48.o to OBJS in src/port/Makefile and it suddenly sailed through.

Hmm. So we need to figure out how to improve configure's check so that
it rejects whatever broken version you've got ...

regards, tom lane

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#5)
Re: join regression failure on cygwin

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

I see it claims to have working erand48, but maybe not so much in
reality?

Good guess. I removed erand48 from the configure file and added
erand48.o to OBJS in src/port/Makefile and it suddenly sailed through.

Hmm. So we need to figure out how to improve configure's check so that
it rejects whatever broken version you've got ...

Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin
and then find out later it's broken on some other abstruse system.

cheers

andrew

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#6)
Re: join regression failure on cygwin

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Hmm. So we need to figure out how to improve configure's check so that
it rejects whatever broken version you've got ...

Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin
and then find out later it's broken on some other abstruse system.

Seems like it would be useful to figure out exactly why it's failing.

I don't personally have a problem with just forcing use of our own
erand48 on Cygwin; it's not a lot of code and it would make the behavior
of that build more like the MSVC build. But it's curious that such a
simple library function is seemingly broken on Cygwin ... especially
when their random() and srandom() evidently work.

regards, tom lane

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#7)
Re: join regression failure on cygwin

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Hmm. So we need to figure out how to improve configure's check so that
it rejects whatever broken version you've got ...

Yeah. Any ideas? I'd hate just to exclude the system erand48 on Cygwin
and then find out later it's broken on some other abstruse system.

Seems like it would be useful to figure out exactly why it's failing.

I don't personally have a problem with just forcing use of our own
erand48 on Cygwin; it's not a lot of code and it would make the behavior
of that build more like the MSVC build. But it's curious that such a
simple library function is seemingly broken on Cygwin ... especially
when their random() and srandom() evidently work.

I'll work on it, but for now I propose to make the following change to
configure.in and the corresponding change in configure:

diff -u -r1.605 configure.in
--- configure.in        16 Jul 2009 17:43:52 -0000      1.605
+++ configure.in        23 Jul 2009 22:39:19 -0000
@@ -1249,7 +1249,7 @@
 pgac_save_LIBS="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
-AC_REPLACE_FUNCS([crypt erand48 getopt getrusage inet_aton random rint 
srandom strdup strerror strlcat strlcpy strtol strtoul])
+AC_REPLACE_FUNCS([crypt getopt getrusage inet_aton random rint srandom 
strdup strerror strlcat strlcpy strtol strtoul])

case $host_os in

@@ -1262,6 +1262,12 @@
;;
esac

+# Cygwin's erand48 sometimes hangs, so force use of ours
+if test "$PORTNAME" = "cygwin"; then
+  AC_LIBOBJ(erand48)
+else
+  AC_REPLACE_FUNCS([erand48])
+fi

LIBS="$pgac_save_LIBS"

Show quoted text

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#8)
Re: join regression failure on cygwin

Andrew Dunstan <andrew@dunslane.net> writes:

I'll work on it, but for now I propose to make the following change to
configure.in and the corresponding change in configure:

I believe you can just add AC_LIBOBJ(erand48) in the Cygwin-specific
section without touching the other part; that's supposed to be a no-op
if the filename has already been added to LIBOBJS.

regards, tom lane

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#7)
Re: join regression failure on cygwin

Tom Lane wrote:

I don't personally have a problem with just forcing use of our own
erand48 on Cygwin; it's not a lot of code and it would make the behavior
of that build more like the MSVC build. But it's curious that such a
simple library function is seemingly broken on Cygwin ... especially
when their random() and srandom() evidently work.

It appears on Googling a bit that the erand48() is buggy in that it
requires the seed to have been initialized with srand48() or it will
constantly return 0.0.

So I think just forcing use of ours is the safe way to go. It might have
been fixed since I installed Cygwin, although I can't find a reference
to that, and I don't feel like triangulating it anyway.

cheers

andrew

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#10)
Re: join regression failure on cygwin

Andrew Dunstan <andrew@dunslane.net> writes:

It appears on Googling a bit that the erand48() is buggy in that it
requires the seed to have been initialized with srand48() or it will
constantly return 0.0.

Huh, and that sends us into an infinite loop? I'll take a look at that.
Even though it's surely nonrandom, it doesn't seem like pathological
behavior of the RNG should wedge us completely.

So I think just forcing use of ours is the safe way to go.

Agreed, but I'm curious ...

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#11)
Re: join regression failure on cygwin

I wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

It appears on Googling a bit that the erand48() is buggy in that it
requires the seed to have been initialized with srand48() or it will
constantly return 0.0.

Huh, and that sends us into an infinite loop? I'll take a look at that.
Even though it's surely nonrandom, it doesn't seem like pathological
behavior of the RNG should wedge us completely.

The answer is that a constant RNG result sends this bit of
geqo_selection() into a tight loop:

int first,
second;

first = linear(root, pool->size, bias);
second = linear(root, pool->size, bias);

if (pool->size > 1)
{
while (first == second)
second = linear(root, pool->size, bias);
}

Not sure if it's worth trying to do something about that, or exactly
what we'd do anyway. Even if we hacked this up somehow, a constant RNG
result would pretty much break GEQO for any useful purpose. So it could
be argued that having the regression tests fail here is a good thing...

regards, tom lane

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#12)
Re: join regression failure on cygwin

Tom Lane wrote:

I wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

It appears on Googling a bit that the erand48() is buggy in that it
requires the seed to have been initialized with srand48() or it will
constantly return 0.0.

Huh, and that sends us into an infinite loop? I'll take a look at that.
Even though it's surely nonrandom, it doesn't seem like pathological
behavior of the RNG should wedge us completely.

The answer is that a constant RNG result sends this bit of
geqo_selection() into a tight loop:

int first,
second;

first = linear(root, pool->size, bias);
second = linear(root, pool->size, bias);

if (pool->size > 1)
{
while (first == second)
second = linear(root, pool->size, bias);
}

Not sure if it's worth trying to do something about that, or exactly
what we'd do anyway. Even if we hacked this up somehow, a constant RNG
result would pretty much break GEQO for any useful purpose. So it could
be argued that having the regression tests fail here is a good thing...

Right. Let's let sleeping dogs lie. I think at most a code comment is
the only action called for.

cheers

andrew