Maybe BF "timedout" failures are the client script's fault?
We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07[3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37.
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05 says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.
Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.
I propose the attached patch to the BF client to try to make this
more robust.
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37
Attachments:
make-bf-timeout-more-robust.patchtext/x-diff; charset=us-ascii; name=make-bf-timeout-more-robust.patchDownload+7-1
Hi,
On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:
We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.
That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.
I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.
I propose the attached patch to the BF client to try to make this
more robust.
Looks sensible, though I wonder whether something should be logged in
case we get woken up early so that we can gather some evidence for this?
Michael
Hi,
On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:
We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.
That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.
I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.
I propose the attached patch to the BF client to try to make this
more robust.
Looks sensible, though I wonder whether something should be logged in
case we get woken up early so that we can gather some evidence for this?
Michael
On 2026-01-09 Fr 3:41 PM, Tom Lane wrote:
We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.I propose the attached patch to the BF client to try to make this
more robust.regards, tom lane
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37
The patch seems reasonable on its face, but I doubt it's the issue.
Rather I think what's happening here is that a test is hanging silently
and lastcommand.log's mtime doesn't get updated, causing a misreporting
of the run duration. So in addition to the above I have added some code
to update that timestamp if the file exists (which should only be the
case with a timeout).
See
https://github.com/PGBuildFarm/client-code/commit/e5d67a35a0136a53e441fccf0ecc9b1b6322526c
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
The patch seems reasonable on its face, but I doubt it's the issue.
Rather I think what's happening here is that a test is hanging silently
and lastcommand.log's mtime doesn't get updated, causing a misreporting
of the run duration. So in addition to the above I have added some code
to update that timestamp if the file exists (which should only be the
case with a timeout).
Sounds reasonable. Thanks!
regards, tom lane
Hi
On Sat, Jan 10, 2026 at 09:34:16AM -0500, Andrew Dunstan wrote:
So in addition to the above I have added some code to update that
timestamp if the file exists (which should only be the case with a
timeout).See https://github.com/PGBuildFarm/client-code/commit/e5d67a35a0136a53e441fccf0ecc9b1b6322526c
I've applied this to fruitcrow and it now shows correct runtimes for a
timed-out run, see
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fruitcrow&br=master
and
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fruitcrow&dt=2026-01-12%2008%3A10%3A04
Thanks!
Michael
Hi
On Sat, Jan 10, 2026 at 09:34:16AM -0500, Andrew Dunstan wrote:
So in addition to the above I have added some code to update that
timestamp if the file exists (which should only be the case with a
timeout).See https://github.com/PGBuildFarm/client-code/commit/e5d67a35a0136a53e441fccf0ecc9b1b6322526c
I've applied this to fruitcrow and it now shows correct runtimes for a
timed-out run, see
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fruitcrow&br=master
and
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fruitcrow&dt=2026-01-12%2008%3A10%3A04
Thanks!
Michael