Maybe BF "timedout" failures are the client script's fault?

Started by Tom Lane3 months ago7 messageshackers

tgl@sss.pgh.pa.us

3 months ago

We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07[3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37.
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05 says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.

Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.

I propose the attached patch to the BF client to try to make this
more robust.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37

michael.banck@credativ.de

3 months ago

In reply to: Tom Lane (#1)

Re: Maybe BF "timedout" failures are the client script's fault?

Hi,

On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:

We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.

Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.

That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.

I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.

I propose the attached patch to the BF client to try to make this
more robust.

Looks sensible, though I wonder whether something should be logged in
case we get woken up early so that we can gather some evidence for this?

Michael

michael.banck@credativ.de

3 months ago

In reply to: Tom Lane (#1)

Re: Maybe BF "timedout" failures are the client script's fault?

Hi,

On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:

We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.

Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.

That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.

I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.

I propose the attached patch to the BF client to try to make this
more robust.

Looks sensible, though I wonder whether something should be logged in
case we get woken up early so that we can gather some evidence for this?

Michael

andrew@dunslane.net

3 months ago

In reply to: Tom Lane (#1)

Re: Maybe BF "timedout" failures are the client script's fault?

On 2026-01-09 Fr 3:41 PM, Tom Lane wrote:

We've been assuming that all the "timedout" failures on BF member
fruitcrow were due to some wonkiness in the GNU/Hurd platform.
I got suspicious about that though after noticing that there are
a small number of such failures on other animals, eg [1][2][3].
In each case, the failure message claims it waited a good long
time, which is at variance with the actually observed runtime.
For instance [1] says "timed out after 14400 secs", but the
actual total test runtime is only 01:24:28 according to the
summary at the top of the page.

Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.
GNU/Hurd might just be more prone to that than other platforms.

I propose the attached patch to the BF client to try to make this
more robust.

regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=ovenbird&dt=2025-11-14%2009%3A21%3A05
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2025-10-17%2018%3A32%3A07
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2026-01-08%2023%3A07%3A37

The patch seems reasonable on its face, but I doubt it's the issue.
Rather I think what's happening here is that a test is hanging silently
and lastcommand.log's mtime doesn't get updated, causing a misreporting
of the run duration. So in addition to the above I have added some code
to update that timestamp if the file exists (which should only be the
case with a timeout).

See
https://github.com/PGBuildFarm/client-code/commit/e5d67a35a0136a53e441fccf0ecc9b1b6322526c

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

tgl@sss.pgh.pa.us

3 months ago

In reply to: Andrew Dunstan (#4)

Re: Maybe BF "timedout" failures are the client script's fault?

Andrew Dunstan <andrew@dunslane.net> writes:

The patch seems reasonable on its face, but I doubt it's the issue.
Rather I think what's happening here is that a test is hanging silently
and lastcommand.log's mtime doesn't get updated, causing a misreporting
of the run duration. So in addition to the above I have added some code
to update that timestamp if the file exists (which should only be the
case with a timeout).

Sounds reasonable. Thanks!

regards, tom lane

michael.banck@credativ.de

3 months ago

In reply to: Andrew Dunstan (#4)

Re: Maybe BF "timedout" failures are the client script's fault?

Hi

On Sat, Jan 10, 2026 at 09:34:16AM -0500, Andrew Dunstan wrote:

So in addition to the above I have added some code to update that
timestamp if the file exists (which should only be the case with a
timeout).

See https://github.com/PGBuildFarm/client-code/commit/e5d67a35a0136a53e441fccf0ecc9b1b6322526c

I've applied this to fruitcrow and it now shows correct runtimes for a
timed-out run, see

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fruitcrow&br=master
and
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fruitcrow&dt=2026-01-12%2008%3A10%3A04

Thanks!

Michael

michael.banck@credativ.de

3 months ago

In reply to: Andrew Dunstan (#4)

Re: Maybe BF "timedout" failures are the client script's fault?

Hi

On Sat, Jan 10, 2026 at 09:34:16AM -0500, Andrew Dunstan wrote:

So in addition to the above I have added some code to update that
timestamp if the file exists (which should only be the case with a
timeout).

See https://github.com/PGBuildFarm/client-code/commit/e5d67a35a0136a53e441fccf0ecc9b1b6322526c

I've applied this to fruitcrow and it now shows correct runtimes for a
timed-out run, see

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fruitcrow&br=master
and
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fruitcrow&dt=2026-01-12%2008%3A10%3A04

Thanks!

Michael