cuckoo is hung during regression test
The 8.1 build for cuckoo is currently hung, with the *postmaster* taking
all the CPU it can get. The build started almost 5 hours ago.
The postmaster is stuck in the following loop, according to
ktrace/kdump:
2023 postgres RET write 59/0x3b
2023 postgres CALL close(0xffffffff)
2023 postgres RET close -1 errno 9 Bad file descriptor
2023 postgres CALL sigprocmask(0x3,0x2e6400,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
2023 postgres RET select 1
2023 postgres CALL sigprocmask(0x3,0x2f0d38,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL accept(0x7,0x200148c,0x200150c)
2023 postgres RET accept -1 errno 24 Too many open files
2023 postgres CALL write(0x2,0x2003928,0x3b)
2023 postgres GIO fd 2 wrote 59 bytes
"LOG: could not accept new connection: Too many open files
"
2023 postgres RET write 59/0x3b
2023 postgres CALL close(0xffffffff)
2023 postgres RET close -1 errno 9 Bad file descriptor
2023 postgres CALL sigprocmask(0x3,0x2e6400,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
2023 postgres RET select 1
2023 postgres CALL sigprocmask(0x3,0x2f0d38,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL accept(0x7,0x200148c,0x200150c)
2023 postgres RET accept -1 errno 24 Too many open files
2023 postgres CALL write(0x2,0x200381c,0x3b)
2023 postgres GIO fd 2 wrote 59 bytes
"LOG: could not accept new connection: Too many open files
"
2023 postgres RET write 59/0x3b
ulimit is set to 1224 open files, though I seem to keep bumping into that
(anyone know what the system-level limit is, or how to change it?)
Is there other useful info to be had about this process, or should I just kill
it?
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes:
The postmaster is stuck in the following loop, according to
ktrace/kdump:
2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
2023 postgres RET select 1
2023 postgres CALL sigprocmask(0x3,0x2f0d38,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL accept(0x7,0x200148c,0x200150c)
2023 postgres RET accept -1 errno 24 Too many open files
2023 postgres CALL write(0x2,0x2003928,0x3b)
2023 postgres GIO fd 2 wrote 59 bytes
"LOG: could not accept new connection: Too many open files
"
2023 postgres RET write 59/0x3b
2023 postgres CALL close(0xffffffff)
2023 postgres RET close -1 errno 9 Bad file descriptor
2023 postgres CALL sigprocmask(0x3,0x2e6400,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
2023 postgres RET select 1
Interesting. So accept() fails because it can't allocate an FD, which
means that the select condition isn't cleared, so we keep retrying
forever. I don't see what else we could do though. Having the
postmaster abort on what might well be a transient condition doesn't
sound like a hot idea. We could possibly sleep() a bit before retrying,
just to not suck 100% CPU, but that doesn't really *fix* anything ...
I've been meaning to bug you about increasing cuckoo's FD limit anyway;
it keeps failing in the regression tests.
ulimit is set to 1224 open files, though I seem to keep bumping into that
(anyone know what the system-level limit is, or how to change it?)
On my OS X machine, "ulimit -n unlimited" seems to set the limit to
10240 (or so a subsequent ulimit -a reports). But you could probably
fix it using the buildfarm parameter that cuts the number of concurrent
regression test runs.
regards, tom lane
On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
Interesting. So accept() fails because it can't allocate an FD, which
means that the select condition isn't cleared, so we keep retrying
forever. I don't see what else we could do though. Having the
postmaster abort on what might well be a transient condition doesn't
sound like a hot idea. We could possibly sleep() a bit before
retrying,
just to not suck 100% CPU, but that doesn't really *fix* anything ...
Well, not only that, but the machine is currently writing to the
postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
(perhaps growing exponentially to some limit) would be a good idea.
I've been meaning to bug you about increasing cuckoo's FD limit
anyway;
it keeps failing in the regression tests.ulimit is set to 1224 open files, though I seem to keep bumping
into that
(anyone know what the system-level limit is, or how to change it?)On my OS X machine, "ulimit -n unlimited" seems to set the limit to
10240 (or so a subsequent ulimit -a reports). But you could probably
fix it using the buildfarm parameter that cuts the number of
concurrent
regression test runs.
Odd... that works on my MBP (sudo bash; ulimit -n unlimited) and I
get 12288. But the same thing doesn't work on cuckoo, which is a G4;
the limit stays at 1224 no matter what. Perhaps because I'm setting
maxfiles in launchd.conf.
In any case, I've upped it to a bit over 2k; we'll see what that
does. I find it interesting that aubrac isn't affected by this, since
it's still running with the default of only 256 open files.
I'm thinking we might want to change the default value for
max_files_per_process on OS X, or have initdb test it like it does
for other things.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby <jim@nasby.net> writes:
On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
We could possibly sleep() a bit before retrying,
just to not suck 100% CPU, but that doesn't really *fix* anything ...
Well, not only that, but the machine is currently writing to the
postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
(perhaps growing exponentially to some limit) would be a good idea.
Well, since the code has always behaved that way and no one noticed
before, I don't think it's worth anything as complicated as a variable
delay. I just stuck a fixed 100msec delay into the accept-failed code
path.
regards, tom lane
Tom Lane wrote:
Jim Nasby <jim@nasby.net> writes:
On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
We could possibly sleep() a bit before retrying,
just to not suck 100% CPU, but that doesn't really *fix* anything ...Well, not only that, but the machine is currently writing to the
postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
(perhaps growing exponentially to some limit) would be a good idea.Well, since the code has always behaved that way and no one noticed
before, I don't think it's worth anything as complicated as a variable
delay. I just stuck a fixed 100msec delay into the accept-failed code
path.
Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
(And so does the autovac code I'm currently looking at).
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote:
Tom Lane wrote:
Jim Nasby <jim@nasby.net> writes:
On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
We could possibly sleep() a bit before retrying,
just to not suck 100% CPU, but that doesn't really *fix* anything ...Well, not only that, but the machine is currently writing to the
postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
(perhaps growing exponentially to some limit) would be a good idea.Well, since the code has always behaved that way and no one noticed
before, I don't think it's worth anything as complicated as a variable
delay. I just stuck a fixed 100msec delay into the accept-failed code
path.Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
(And so does the autovac code I'm currently looking at).
There is probably a good case for a shorter delay in postmaster, though.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Alvaro Herrera wrote:
Tom Lane wrote:
delay. I just stuck a fixed 100msec delay into the accept-failed code
path.Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
(And so does the autovac code I'm currently looking at).
There is probably a good case for a shorter delay in postmaster, though.
Yeah, that's what I thought. We don't really care if either bgwriter or
autovac goes AWOL for a little while, but if the postmaster's asleep
then nobody can connect.
regards, tom lane