regresssion script hole
While investigating some problems with buildfarm member spoonbill I came
across this piece of code in pg_regress.sh, which seems less than robust:
# Wait till postmaster is able to accept connections (normally only
# a second or so, but Cygwin is reportedly *much* slower). Don't
# wait forever, however.
i=0
max=60
until "$bindir/psql" -X $psql_options postgres </dev/null 2>/dev/null
do
i=`expr $i + 1`
if [ $i -ge $max ]
then
break
fi
if kill -0 $postmaster_pid >/dev/null 2>&1
then
: still starting up
else
break
fi
sleep 1
done
if kill -0 $postmaster_pid >/dev/null 2>&1
then
echo "running on port $PGPORT with pid $postmaster_pid"
else
echo
echo "$me: postmaster did not start"
echo "Examine $LOGDIR/postmaster.log for the reason."
echo
(exit 2); exit
fi
The problem is that if the postmaster takes more than 60 seconds to
start listening (as is apparently happening on spoonbill - don't yet
know why) this code falls through.
I'm inclined to run the psql test one more time to make sure we can
actually connect, and if not then fail at this point. I wouldn't bother
but it did confuse the heck out of both Stefan and me when createlang
failed.
Thoughts?
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
The problem is that if the postmaster takes more than 60 seconds to
start listening (as is apparently happening on spoonbill - don't yet
know why) this code falls through.
If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.
I'm inclined to run the psql test one more time to make sure we can
actually connect, and if not then fail at this point.
How does that differ from just iterating the loop one more time?
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
The problem is that if the postmaster takes more than 60 seconds to
start listening (as is apparently happening on spoonbill - don't yet
know why) this code falls through.If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.
We need both, I think. I am still trying to find out why it's taking so
long. This is on the 8.0 branch, though. Later branches seem to be working.
I'm inclined to run the psql test one more time to make sure we can
actually connect, and if not then fail at this point.How does that differ from just iterating the loop one more time?
There is no guarantee that at the end of the loop we have connected
successfully to postgres.
I will post a patch that shows what I suggest.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
We need both, I think. I am still trying to find out why it's taking so
long. This is on the 8.0 branch, though. Later branches seem to be working.
One idea that comes to mind is a DNS lookup timeout. Can you strace the
postmaster to see what it's doing?
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
We need both, I think. I am still trying to find out why it's taking so
long. This is on the 8.0 branch, though. Later branches seem to be working.One idea that comes to mind is a DNS lookup timeout. Can you strace the
postmaster to see what it's doing?
There is ktrace output I managed to capture at
http://developer.postgresql.org/~adunstan/ktrace.txt
Not sure what it tells us. I do see it reading in the whole timezone db,
which isn't pretty.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
One idea that comes to mind is a DNS lookup timeout. Can you strace the
postmaster to see what it's doing?
There is ktrace output I managed to capture at
http://developer.postgresql.org/~adunstan/ktrace.txt
Not sure what it tells us. I do see it reading in the whole timezone db,
which isn't pretty.
Yeah... I think we fixed that in 8.1. Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:
24583 postgres CALL socket(0x2,0x2,0)
24583 postgres RET socket 3
24583 postgres CALL sendto(0x3,0x43e1e000,0x25,0,0x493a6338,0x10)
24583 postgres GIO fd 3 wrote 37 bytes
"\M-Sr\^A\0\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET sendto 37/0x25
24583 postgres CALL getpid()
24583 postgres RET getpid 24583/0x6007
24583 postgres CALL select(0x4,0x40739600,0,0,0xfffffffffffe4d90)
24583 postgres RET select 1
24583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)
24583 postgres GIO fd 3 read 37 bytes
"\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET recvfrom 37/0x25
24583 postgres CALL close(0x3)
24583 postgres RET close 0
I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.
regards, tom lane
Tom Lane wrote:
Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.
Why are we actually looking up anything? Just so we can bind to a
listening socket?
Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something like
lookup file bind
Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.
cheers
andrew
On Sun, Jun 18, 2006 at 07:50:04PM -0400, Tom Lane wrote:
24583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)
24583 postgres GIO fd 3 read 37 bytes
"\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET recvfrom 37/0x25
24583 postgres CALL close(0x3)
24583 postgres RET close 0I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.
Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
and the DNS server is responding with SRVFAIL.
--
Michael Fuhr
On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
Correction: ''.kaltenbrunner.cc
--
Michael Fuhr
Michael Fuhr wrote:
On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)Correction: ''.kaltenbrunner.cc
yes that is exactly the issue - the postmaster tries to resolve
''.kaltenbrunner.cc multiple times during startup and getting ServFail
as a response from the upstream resolver.
Stefan
Andrew Dunstan wrote:
Tom Lane wrote:
Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.Why are we actually looking up anything? Just so we can bind to a
listening socket?Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something likelookup file bind
Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.
no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)
Stefan
Stefan Kaltenbrunner wrote:
Andrew Dunstan wrote:
Why are we actually looking up anything? Just so we can bind to a
listening socket?Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something likelookup file bind
Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)
Why are we looking up 'kaltenbrunner.cc' at all then? In any case, can
we just try with that resolver line?
The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.
The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds. So the failed DNS responses must be the problem. Could
we get a ktrace with timestamps on the syscalls to confirm that?
Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.
regards, tom lane
On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.
The formatting of the message suggests it is a gethostbyname("''")
doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
configuration files?
I wonder it it'd be worth adding some conditional code around
gethostbyname() calls that warn if the call took longer than say 10
seconds. By printing out that and the string it's looking up you could
save a lot of time confirming if the delay is there or elsewhere...
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote:
On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.The formatting of the message suggests it is a gethostbyname("''")
doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
configuration files?
I tcpdump'd the dns-traffic on that box during a postmaster startup and
it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.
And from what it looks like it might be getting somehow rate limited by
my ISPs recursive resolvers after doing the same query a dozens of times
and getting a "servfail" every time.
At least the timestamps seem to indicate that the responses are getting
delayed up to 10 seconds after a number of queries ...
It might be a complete shot in the dark but spoonbill worked fine on
REL_8_0_STABLE until i disabled reporting 3 month ago.
During this time the large escaping security fix/standard_strings patch
went in - could this be related in any way ?
Stefan
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds. So the failed DNS responses must be the problem. Could
we get a ktrace with timestamps on the syscalls to confirm that?Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.
hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?
I will try to verify what the later branches are doing exactly ...
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
I tcpdump'd the dns-traffic on that box during a postmaster startup and
it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.
I just strace'd postmaster start on a Fedora box and can see nothing
corresponding. Since this is a "make check" we know that the PG
configuration files it's using are stock ... so it must be something
about the system config that's sending it round the bend. What do you
have in /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf?
regards, tom lane
Oh, I think I see the problem:
8.0 pg_regress:
if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses=''"
fi
8.1 pg_regress:
if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses="
fi
regards, tom lane
Stefan Kaltenbrunner wrote:
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds. So the failed DNS responses must be the problem. Could
we get a ktrace with timestamps on the syscalls to confirm that?Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?I will try to verify what the later branches are doing exactly ...
Yes, we're on the margin here. The successful runs I saw got through the
timeout in 5 or 10 seconds over the 60 that we currently allow.
What interests me is where it even gets the string 'kaltenbrunner.cc'
from. It looks to me like the most likely place is the search line in
/etc/resolv.conf. Ir would be nice to know exactly what it is trying to
resolve.
cheers
andrew
cheers
andrew
Tom Lane wrote:
Oh, I think I see the problem:
8.0 pg_regress:
if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses=''"
fi8.1 pg_regress:
if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses="
fi
Good catch! I'm impressed! This is surely the heart of the problem.
That change (from rev 1.56) clearly needs to be backported to 8.0.
cheers
andrew
I wrote:
8.0 pg_regress:
postmaster_options="$postmaster_options -c listen_addresses=''"
8.1 pg_regress:
postmaster_options="$postmaster_options -c listen_addresses="
and in fact here's the commit that changed that:
2005-06-19 22:26 tgl
* src/test/regress/pg_regress.sh: Change shell syntax that seems
not to work right on FreeBSD 6-CURRENT buildfarm machines.
So apparently it's some generic disease of the BSD shell. I should have
back-patched at the time but did not. Will take care of it.
On the timezone search business, it's still the case that HEAD will
search through all the timezones if it's not given an explicit setting
(eg an explicit environment TZ value). We could suppress that by having
pg_regress set TZ, but then the regression tests wouldn't exercise the
search code at all, which is probably not a great idea.
regards, tom lane