Re: Maybe BF "timedout" failures are the client script's fault?

Started by Tom Lane3 months ago3 messageshackers

tgl@sss.pgh.pa.us

3 months ago

Michael Banck <mbanck@gmx.net> writes:

On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:

Looking into the buildfarm client, I realized that it's assuming that
"sleep($wait_time)" is sufficient to wait for $wait_time seconds.
However, the Perl docs point out that sleep() can be interrupted by a
signal. So now I'm suspicious that many of these failures are caused
by a stray signal waking up the wait_timeout thread prematurely.

That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.

If it's always the same test, then yeah that's evidence against
my theory (at least for fruitcrow's failures).

I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.

Hmm. Then why is the BF report showing that the total runtime is
nowhere near that? I wonder how those times are gathered ...

regards, tom lane

Import Notes

Reply to msg id not found: 20260109213255.GB21237@p46.dedyn.iolightning.p46.dedyn.ioReference msg id not found: 20260109213255.GB21237@p46.dedyn.iolightning.p46.dedyn.io

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Tom Lane (#1)

Hi,

On Fri, Jan 09, 2026 at 04:42:22PM -0500, Tom Lane wrote:

Michael Banck <mbanck@gmx.net> writes:

I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.

Hmm. Then why is the BF report showing that the total runtime is
nowhere near that? I wonder how those times are gathered ...

Looks like the server takes the timestamp of the logfile as an educated
guess when the particular stage finished:

https://github.com/PGBuildFarm/server-code/blob/main/cgi-bin/pgstatus.pl#L496

But this does not work when the stage hangs and gets terminated
externally, it seems no output is appended to the stage log in this case
by the buildfarm client and (at least for fruitcrow) neither by the
stage itself, so the server thinks the stage duration was whatever time
it took until the last piece of output before the hang.

Michael

Michael Banck

michael.banck@credativ.de

3 months ago

In reply to: Tom Lane (#1)

Hi,

On Fri, Jan 09, 2026 at 04:42:22PM -0500, Tom Lane wrote:

Michael Banck <mbanck@gmx.net> writes:

I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.

Hmm. Then why is the BF report showing that the total runtime is
nowhere near that? I wonder how those times are gathered ...

Looks like the server takes the timestamp of the logfile as an educated
guess when the particular stage finished:

https://github.com/PGBuildFarm/server-code/blob/main/cgi-bin/pgstatus.pl#L496

Michael