regresssion script hole

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Andrew Dunstan (#1)

Re: regresssion script hole

Andrew Dunstan <andrew@dunslane.net> writes:

The problem is that if the postmaster takes more than 60 seconds to
start listening (as is apparently happening on spoonbill - don't yet
know why) this code falls through.

If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.

I'm inclined to run the psql test one more time to make sure we can
actually connect, and if not then fail at this point.

How does that differ from just iterating the loop one more time?

regards, tom lane

andrew@dunslane.net

over 19 years ago

In reply to: Tom Lane (#2)

Re: regresssion script hole

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

The problem is that if the postmaster takes more than 60 seconds to
start listening (as is apparently happening on spoonbill - don't yet
know why) this code falls through.

If the postmaster takes that long to start listening, I'd say we need to
fix the postmaster not pg_regress.

We need both, I think. I am still trying to find out why it's taking so
long. This is on the 8.0 branch, though. Later branches seem to be working.

I'm inclined to run the psql test one more time to make sure we can
actually connect, and if not then fail at this point.

How does that differ from just iterating the loop one more time?

There is no guarantee that at the end of the loop we have connected
successfully to postgres.

I will post a patch that shows what I suggest.

cheers

andrew

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Andrew Dunstan (#3)

Re: regresssion script hole

Andrew Dunstan <andrew@dunslane.net> writes:

We need both, I think. I am still trying to find out why it's taking so
long. This is on the 8.0 branch, though. Later branches seem to be working.

One idea that comes to mind is a DNS lookup timeout. Can you strace the
postmaster to see what it's doing?

regards, tom lane

andrew@dunslane.net

over 19 years ago

In reply to: Tom Lane (#4)

Re: regresssion script hole

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

We need both, I think. I am still trying to find out why it's taking so
long. This is on the 8.0 branch, though. Later branches seem to be working.

One idea that comes to mind is a DNS lookup timeout. Can you strace the
postmaster to see what it's doing?

There is ktrace output I managed to capture at
http://developer.postgresql.org/~adunstan/ktrace.txt

Not sure what it tells us. I do see it reading in the whole timezone db,
which isn't pretty.

cheers

andrew

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Andrew Dunstan (#5)

Re: regresssion script hole

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

One idea that comes to mind is a DNS lookup timeout. Can you strace the
postmaster to see what it's doing?

There is ktrace output I managed to capture at
http://developer.postgresql.org/~adunstan/ktrace.txt
Not sure what it tells us. I do see it reading in the whole timezone db,
which isn't pretty.

Yeah... I think we fixed that in 8.1. Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:

24583 postgres CALL socket(0x2,0x2,0)
24583 postgres RET socket 3
24583 postgres CALL sendto(0x3,0x43e1e000,0x25,0,0x493a6338,0x10)
24583 postgres GIO fd 3 wrote 37 bytes
"\M-Sr\^A\0\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET sendto 37/0x25
24583 postgres CALL getpid()
24583 postgres RET getpid 24583/0x6007
24583 postgres CALL select(0x4,0x40739600,0,0,0xfffffffffffe4d90)
24583 postgres RET select 1
24583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)
24583 postgres GIO fd 3 read 37 bytes
"\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET recvfrom 37/0x25
24583 postgres CALL close(0x3)
24583 postgres RET close 0

I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.

regards, tom lane

andrew@dunslane.net

over 19 years ago

In reply to: Tom Lane (#6)

Re: regresssion script hole

Tom Lane wrote:

Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:

I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.

Why are we actually looking up anything? Just so we can bind to a
listening socket?

Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something like

lookup file bind

Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.

cheers

andrew

Michael Fuhr

mike@fuhr.org

over 19 years ago

In reply to: Tom Lane (#6)

Re: regresssion script hole

On Sun, Jun 18, 2006 at 07:50:04PM -0400, Tom Lane wrote:

24583 postgres CALL recvfrom(0x3,0x477e4000,0x10000,0,0xfffffffffffe4da0,0xfffffffffffe4d5c)
24583 postgres GIO fd 3 read 37 bytes
"\M-Sr\M^A\M^B\0\^A\0\0\0\0\0\0\^B''\rkaltenbrunner\^Bcc\0\0\^A\0\^A"
24583 postgres RET recvfrom 37/0x25
24583 postgres CALL close(0x3)
24583 postgres RET close 0

I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.

Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
and the DNS server is responding with SRVFAIL.

--
Michael Fuhr

Michael Fuhr

mike@fuhr.org

over 19 years ago

In reply to: Michael Fuhr (#8)

Re: regresssion script hole

On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:

Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)

Correction: ''.kaltenbrunner.cc

--
Michael Fuhr

#10

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Michael Fuhr (#9)

Re: regresssion script hole

Michael Fuhr wrote:

On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:

Maybe I'm misreading the packet, but I think the query is for
''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)

Correction: ''.kaltenbrunner.cc

yes that is exactly the issue - the postmaster tries to resolve
''.kaltenbrunner.cc multiple times during startup and getting ServFail
as a response from the upstream resolver.

Stefan

#11

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Andrew Dunstan (#7)

Re: regresssion script hole

Andrew Dunstan wrote:

Tom Lane wrote:

Anyway, the tail end of the trace
shows it repeatedly sending off a UDP packet and getting practically the
same data back:

I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
bet this is it. I think it's trying to look up "kaltenbrunner.cc" and
failing.

Why are we actually looking up anything? Just so we can bind to a
listening socket?

Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something like

lookup file bind

Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.

no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)

Stefan

#12

andrew@dunslane.net

over 19 years ago

In reply to: Stefan Kaltenbrunner (#11)

Re: regresssion script hole

Stefan Kaltenbrunner wrote:

Andrew Dunstan wrote:

Why are we actually looking up anything? Just so we can bind to a
listening socket?

Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something like

lookup file bind

Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.

no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)

Why are we looking up 'kaltenbrunner.cc' at all then? In any case, can
we just try with that resolver line?

The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.

cheers

andrew

#13

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Andrew Dunstan (#12)

Re: regresssion script hole

Andrew Dunstan <andrew@dunslane.net> writes:

The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.

The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds. So the failed DNS responses must be the problem. Could
we get a ktrace with timestamps on the syscalls to confirm that?

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.

regards, tom lane

#14

Martijn van Oosterhout

kleptog@svana.org

over 19 years ago

In reply to: Tom Lane (#13)

Re: regresssion script hole

On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.

The formatting of the message suggests it is a gethostbyname("''")
doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
configuration files?

I wonder it it'd be worth adding some conditional code around
gethostbyname() calls that warn if the call took longer than say 10
seconds. By printing out that and the string it's looking up you could
save a lot of time confirming if the delay is there or elsewhere...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#15

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Martijn van Oosterhout (#14)

Re: regresssion script hole

Martijn van Oosterhout wrote:

On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.

The formatting of the message suggests it is a gethostbyname("''")
doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
configuration files?

I tcpdump'd the dns-traffic on that box during a postmaster startup and
it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.
And from what it looks like it might be getting somehow rate limited by
my ISPs recursive resolvers after doing the same query a dozens of times
and getting a "servfail" every time.
At least the timestamps seem to indicate that the responses are getting
delayed up to 10 seconds after a number of queries ...
It might be a complete shot in the dark but spoonbill worked fine on
REL_8_0_STABLE until i disabled reporting 3 month ago.
During this time the large escaping security fix/standard_strings patch
went in - could this be related in any way ?

Stefan

#16

stefan@kaltenbrunner.cc

over 19 years ago

In reply to: Tom Lane (#13)

Re: regresssion script hole

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.

The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds. So the failed DNS responses must be the problem. Could
we get a ktrace with timestamps on the syscalls to confirm that?

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.

hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?

I will try to verify what the later branches are doing exactly ...

Stefan

#17

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Stefan Kaltenbrunner (#15)

Re: regresssion script hole

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

I tcpdump'd the dns-traffic on that box during a postmaster startup and
it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.

I just strace'd postmaster start on a Fedora box and can see nothing
corresponding. Since this is a "make check" we know that the PG
configuration files it's using are stock ... so it must be something
about the system config that's sending it round the bend. What do you
have in /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf?

regards, tom lane

#18

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Stefan Kaltenbrunner (#15)

Re: regresssion script hole

Oh, I think I see the problem:

8.0 pg_regress:

if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses=''"
fi

8.1 pg_regress:

if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses="
fi

regards, tom lane

#19

andrew@dunslane.net

over 19 years ago

In reply to: Stefan Kaltenbrunner (#16)

Re: regresssion script hole

Stefan Kaltenbrunner wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

The question isn't whether is succeeds, it's how long it takes to
succeed. When I increased the pg_regress timeout it actually went
through the whole regression test happily. I suspect we have 2 things
eating up the 60s timeout here: loading the timezone db and resolving
whatever it is we are trying to resolve.

The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds. So the failed DNS responses must be the problem. Could
we get a ktrace with timestamps on the syscalls to confirm that?

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.

hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?

I will try to verify what the later branches are doing exactly ...

Yes, we're on the margin here. The successful runs I saw got through the
timeout in 5 or 10 seconds over the 60 that we currently allow.

What interests me is where it even gets the string 'kaltenbrunner.cc'
from. It looks to me like the most likely place is the search line in
/etc/resolv.conf. Ir would be nice to know exactly what it is trying to
resolve.

cheers

andrew

cheers

andrew

#20

andrew@dunslane.net

over 19 years ago

In reply to: Tom Lane (#18)

Re: regresssion script hole

Tom Lane wrote:

Oh, I think I see the problem:

8.0 pg_regress:

if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses=''"
fi

8.1 pg_regress:

if [ "$unix_sockets" = no ]; then
postmaster_options="$postmaster_options -c listen_addresses=$hostname"
else
postmaster_options="$postmaster_options -c listen_addresses="
fi

Good catch! I'm impressed! This is surely the heart of the problem.

That change (from rev 1.56) clearly needs to be backported to 8.0.

cheers

andrew

#21