buildfarm animals and 'snapshot too old'

Started by Tomas Vondraover 11 years ago23 messages

tv@fuzzy.cz

over 11 years ago

Hi all,

today I got a few of errors like these (this one is from last week, though):

Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT
Content:
snapshot to old: Wed May 7 04:36:57 2014 GMT

on the new buildfarm animals. I believe it was my mistake (incorrectly
configured local git mirror), but it got me thinking about how this will
behave with the animals running CLOBBER_CACHE_RECURSIVELY.

If I understand the Perl code correctly, it does this:

(1) update the repository
(2) run the tests
(3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
(4) fail if older

Now, imagine that the test runs for days/weeks. This pretty much means
it's wasted, because the results will be thrown away anyway, no?

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Tomas Vondra (#1)

Re: buildfarm animals and 'snapshot too old'

On 05/15/2014 12:43 PM, Tomas Vondra wrote:

Hi all,

today I got a few of errors like these (this one is from last week, though):

Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT
Content:
snapshot to old: Wed May 7 04:36:57 2014 GMT

on the new buildfarm animals. I believe it was my mistake (incorrectly
configured local git mirror), but it got me thinking about how this will
behave with the animals running CLOBBER_CACHE_RECURSIVELY.

If I understand the Perl code correctly, it does this:

(1) update the repository
(2) run the tests
(3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
(4) fail if older

Now, imagine that the test runs for days/weeks. This pretty much means
it's wasted, because the results will be thrown away anyway, no?

The 24 hours runs from the time of the latest commit on the branch in
question, not the current time, but basically yes.

We've never had machines with runs that long. The longest in recent
times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes
around 4.5 hours. But we have had misconfigured machines reporting
unbelievable snapshot times. I'll take a look and see if we can tighten
up the sanity check. It's worth noting that one thing friarbird does is
skip the install-check stage - it's almost certainly not going to have
terribly much interesting to tell us from that, given it has already run
a plain "make check".

How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems
kinda nuts.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Andrew Dunstan (#2)

Re: buildfarm animals and 'snapshot too old'

On 15 Květen 2014, 19:46, Andrew Dunstan wrote:

On 05/15/2014 12:43 PM, Tomas Vondra wrote:

Hi all,

today I got a few of errors like these (this one is from last week,
though):

Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT
Content:
snapshot to old: Wed May 7 04:36:57 2014 GMT

on the new buildfarm animals. I believe it was my mistake (incorrectly
configured local git mirror), but it got me thinking about how this will
behave with the animals running CLOBBER_CACHE_RECURSIVELY.

If I understand the Perl code correctly, it does this:

(1) update the repository
(2) run the tests
(3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
(4) fail if older

Now, imagine that the test runs for days/weeks. This pretty much means
it's wasted, because the results will be thrown away anyway, no?

The 24 hours runs from the time of the latest commit on the branch in
question, not the current time, but basically yes.

We've never had machines with runs that long. The longest in recent
times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes
around 4.5 hours. But we have had misconfigured machines reporting
unbelievable snapshot times. I'll take a look and see if we can tighten
up the sanity check. It's worth noting that one thing friarbird does is
skip the install-check stage - it's almost certainly not going to have
terribly much interesting to tell us from that, given it has already run
a plain "make check".

How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems
kinda nuts.

I don't know. According to this comment from cache/inval.c, it's expected
to be way slower (~100x) compared to CLOBBER_CACHE_ALWAYS.

/*
* Test code to force cache flushes anytime a flush could happen.
*
* If used with CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS provides a
* fairly thorough test that the system contains no cache-flush hazards.
* However, it also makes the system unbelievably slow --- the regression
* tests take about 100 times longer than normal.
*
* If you're a glutton for punishment, try CLOBBER_CACHE_RECURSIVELY. This
* slows things by at least a factor of 10000, so I wouldn't suggest
* trying to run the entire regression tests that way.It's useful to try
* a few simple tests, to make sure that cache reload isn't subject to
* internal cache-flush hazards, but after you've done a few thousand
* recursive reloads it's unlikely you'll learn more.
*/

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Tomas Vondra (#3)

Re: buildfarm animals and 'snapshot too old'

On 05/15/2014 03:57 PM, Tomas Vondra wrote:

How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems
kinda nuts.

I don't know. According to this comment from cache/inval.c, it's expected
to be way slower (~100x) compared to CLOBBER_CACHE_ALWAYS.

/*
* Test code to force cache flushes anytime a flush could happen.
*
* If used with CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS provides a
* fairly thorough test that the system contains no cache-flush hazards.
* However, it also makes the system unbelievably slow --- the regression
* tests take about 100 times longer than normal.
*
* If you're a glutton for punishment, try CLOBBER_CACHE_RECURSIVELY. This
* slows things by at least a factor of 10000, so I wouldn't suggest
* trying to run the entire regression tests that way.It's useful to try
* a few simple tests, to make sure that cache reload isn't subject to
* internal cache-flush hazards, but after you've done a few thousand
* recursive reloads it's unlikely you'll learn more.
*/

Yes, I've seen that. Frankly, a test that takes something like 500 hours
is a bit crazy.

If we really want to run this in the buildfarm we should probably try to
create a massively cut down test schedule for use in this case.

Incidentally, should the CLOBBER_CACHE_ALWAYS machines also be defining
CLOBBER_FREED_MEMORY?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Andrew Dunstan (#4)

Re: buildfarm animals and 'snapshot too old'

Andrew Dunstan <andrew@dunslane.net> writes:

Incidentally, should the CLOBBER_CACHE_ALWAYS machines also be defining
CLOBBER_FREED_MEMORY?

The former does need the latter or it's not very thorough. However,
CLOBBER_FREED_MEMORY is defined automatically by --enable-cassert,
so you shouldn't need to use a -D switch for it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

over 11 years ago

In reply to: Andrew Dunstan (#2)

Re: buildfarm animals and 'snapshot too old'

On 05/15/2014 07:46 PM, Andrew Dunstan wrote:

On 05/15/2014 12:43 PM, Tomas Vondra wrote:

Hi all,

today I got a few of errors like these (this one is from last week,
though):

Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT
Content:
snapshot to old: Wed May 7 04:36:57 2014 GMT

on the new buildfarm animals. I believe it was my mistake (incorrectly
configured local git mirror), but it got me thinking about how this will
behave with the animals running CLOBBER_CACHE_RECURSIVELY.

If I understand the Perl code correctly, it does this:

(1) update the repository
(2) run the tests
(3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
(4) fail if older

Now, imagine that the test runs for days/weeks. This pretty much means
it's wasted, because the results will be thrown away anyway, no?

The 24 hours runs from the time of the latest commit on the branch in
question, not the current time, but basically yes.

We've never had machines with runs that long. The longest in recent
times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes
around 4.5 hours. But we have had misconfigured machines reporting
unbelievable snapshot times. I'll take a look and see if we can tighten
up the sanity check. It's worth noting that one thing friarbird does is
skip the install-check stage - it's almost certainly not going to have
terribly much interesting to tell us from that, given it has already run
a plain "make check".

well I'm not sure about about "misconfigured" but both my personal
buildfarm members and pginfra run ones (like gaibasaurus) got errors
complaining about "snapshot too old" in the past for long running tests
so I'm not sure it is really a "we never had machine with runs that
long". So maybe we should not reject those submissions at submission
time but rather mark them clearly on the dashboard and leave the final
interpretation to a human...

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Stefan Kaltenbrunner (#6)

Re: buildfarm animals and 'snapshot too old'

On 05/15/2014 04:30 PM, Stefan Kaltenbrunner wrote:

On 05/15/2014 07:46 PM, Andrew Dunstan wrote:

On 05/15/2014 12:43 PM, Tomas Vondra wrote:

Hi all,

today I got a few of errors like these (this one is from last week,
though):

Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT
Content:
snapshot to old: Wed May 7 04:36:57 2014 GMT

on the new buildfarm animals. I believe it was my mistake (incorrectly
configured local git mirror), but it got me thinking about how this will
behave with the animals running CLOBBER_CACHE_RECURSIVELY.

If I understand the Perl code correctly, it does this:

(1) update the repository
(2) run the tests
(3) check that the snapshot is not older than 24 hours (pgstatus.pl:188)
(4) fail if older

Now, imagine that the test runs for days/weeks. This pretty much means
it's wasted, because the results will be thrown away anyway, no?

The 24 hours runs from the time of the latest commit on the branch in
question, not the current time, but basically yes.

We've never had machines with runs that long. The longest in recent
times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes
around 4.5 hours. But we have had misconfigured machines reporting
unbelievable snapshot times. I'll take a look and see if we can tighten
up the sanity check. It's worth noting that one thing friarbird does is
skip the install-check stage - it's almost certainly not going to have
terribly much interesting to tell us from that, given it has already run
a plain "make check".

well I'm not sure about about "misconfigured" but both my personal
buildfarm members and pginfra run ones (like gaibasaurus) got errors
complaining about "snapshot too old" in the past for long running tests
so I'm not sure it is really a "we never had machine with runs that
long". So maybe we should not reject those submissions at submission
time but rather mark them clearly on the dashboard and leave the final
interpretation to a human...

That's a LOT harder and more work to arrange. Frankly, there are more
important things to do.

I would like to know the circumstances of these very long runs. I drive
some of my VMs pretty hard on pretty modest hardware, and they don't
come close to running 24 hours.

The current behaviour goes back to this commit from December 2011:

commit a8b5049e64f9cb08f8e165d0737139dab74e3bce
Author: Andrew Dunstan <andrew@dunslane.net>
Date: Wed Dec 14 14:38:44 2011 -0800

Use git snapshot instead of fixed 10 day timeout.

The sanity checks made sure that an animal wasn't submitting a
snapshot that was too old. But sometimes an old branch doesn't
get any changes for more than 10 days. So accept a snapshot that
is not more than 1 day older than the last known snapshot. Per
complaint from Stefan.

I'm prepared to increase the sanity check time if there is a serious
demand for it, but I'd like to know what to increase it to.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Andrew Dunstan (#7)

Re: buildfarm animals and 'snapshot too old'

On 15.5.2014 22:56, Andrew Dunstan wrote:

On 05/15/2014 04:30 PM, Stefan Kaltenbrunner wrote:

well I'm not sure about about "misconfigured" but both my personal
buildfarm members and pginfra run ones (like gaibasaurus) got errors
complaining about "snapshot too old" in the past for long running tests
so I'm not sure it is really a "we never had machine with runs that
long". So maybe we should not reject those submissions at submission
time but rather mark them clearly on the dashboard and leave the final
interpretation to a human...

That's a LOT harder and more work to arrange. Frankly, there are more
important things to do.

I would like to know the circumstances of these very long runs. I drive
some of my VMs pretty hard on pretty modest hardware, and they don't
come close to running 24 hours.

The current behaviour goes back to this commit from December 2011:

commit a8b5049e64f9cb08f8e165d0737139dab74e3bce
Author: Andrew Dunstan <andrew@dunslane.net>
Date: Wed Dec 14 14:38:44 2011 -0800

Use git snapshot instead of fixed 10 day timeout.

The sanity checks made sure that an animal wasn't submitting a
snapshot that was too old. But sometimes an old branch doesn't
get any changes for more than 10 days. So accept a snapshot that
is not more than 1 day older than the last known snapshot. Per
complaint from Stefan.

I'm prepared to increase the sanity check time if there is a serious
demand for it, but I'd like to know what to increase it to.

I doubt there's no "one size fits all" limit. If the machines running
"recursive clobber" tests need tens of days to complete the tests. then
that limit is pretty useless to most regular animals.

So what about keeping the current value for most animals, but allowing
an override for some selected ones? I'd expect this to be much simpler
to implement, and it shouldn't require any human intervention.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Andrew Dunstan (#4)

Re: buildfarm animals and 'snapshot too old'

On 15.5.2014 22:07, Andrew Dunstan wrote:

Yes, I've seen that. Frankly, a test that takes something like 500
hours is a bit crazy.

Maybe. It certainly is not a test people will use during development.
But if it can detect some hard-to-find errors in the code, that might
possibly lead to serious problems, then +1 from me to run them at least
on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.

Also, once you know where it fails the developer can run just that
single test (which might take minutes/hours, but not days).

If we really want to run this in the buildfarm we should probably
try to create a massively cut down test schedule for use in this
case.

If we can run cut this down in a meaningful way (i.e. without
sacrificing most of the benefits) then sure - let's do that. But I think
that's what CLOBBER_CACHE_ALWAYS is about.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Tomas Vondra (#9)

Re: buildfarm animals and 'snapshot too old'

On 05/15/2014 07:47 PM, Tomas Vondra wrote:

On 15.5.2014 22:07, Andrew Dunstan wrote:

Yes, I've seen that. Frankly, a test that takes something like 500
hours is a bit crazy.

Maybe. It certainly is not a test people will use during development.
But if it can detect some hard-to-find errors in the code, that might
possibly lead to serious problems, then +1 from me to run them at least
on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.

Also, once you know where it fails the developer can run just that
single test (which might take minutes/hours, but not days).

I have made a change that omits the snapshot sanity check for
CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
<https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Andrew Dunstan (#10)

Re: buildfarm animals and 'snapshot too old'

On 17.5.2014 19:58, Andrew Dunstan wrote:

On 05/15/2014 07:47 PM, Tomas Vondra wrote:

On 15.5.2014 22:07, Andrew Dunstan wrote:

Yes, I've seen that. Frankly, a test that takes something like 500
hours is a bit crazy.

Maybe. It certainly is not a test people will use during development.
But if it can detect some hard-to-find errors in the code, that might
possibly lead to serious problems, then +1 from me to run them at least
on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.

Also, once you know where it fails the developer can run just that
single test (which might take minutes/hours, but not days).

I have made a change that omits the snapshot sanity check for
CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
<https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>

OK, thanks. Seems reasonable.

Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Tomas Vondra (#11)

Re: buildfarm animals and 'snapshot too old'

On 17.5.2014 22:35, Tomas Vondra wrote:

On 17.5.2014 19:58, Andrew Dunstan wrote:

On 05/15/2014 07:47 PM, Tomas Vondra wrote:

On 15.5.2014 22:07, Andrew Dunstan wrote:

Yes, I've seen that. Frankly, a test that takes something like 500
hours is a bit crazy.

Maybe. It certainly is not a test people will use during development.
But if it can detect some hard-to-find errors in the code, that might
possibly lead to serious problems, then +1 from me to run them at least
on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.

Also, once you know where it fails the developer can run just that
single test (which might take minutes/hours, but not days).

I have made a change that omits the snapshot sanity check for
CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
<https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>

OK, thanks. Seems reasonable.

Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals.
The problem is that the git mirror is refreshed only at the very
beginning, and while a single branch does not exceed the limit, all the
branches do.

Could this be solved by keeping a local mirror, without a mirror in the
build root? I mean, something like this:

git_keep_mirror => 0,
scmrepo => '/path/to/local/mirror'

And of course a cron script updating the mirror every hour or so.

regards
Tomas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Tomas Vondra (#12)

Re: buildfarm animals and 'snapshot too old'

On 05/19/2014 03:40 PM, Tomas Vondra wrote:

On 17.5.2014 22:35, Tomas Vondra wrote:

On 17.5.2014 19:58, Andrew Dunstan wrote:

On 05/15/2014 07:47 PM, Tomas Vondra wrote:

On 15.5.2014 22:07, Andrew Dunstan wrote:

Yes, I've seen that. Frankly, a test that takes something like 500
hours is a bit crazy.

Maybe. It certainly is not a test people will use during development.
But if it can detect some hard-to-find errors in the code, that might
possibly lead to serious problems, then +1 from me to run them at least
on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.

Also, once you know where it fails the developer can run just that
single test (which might take minutes/hours, but not days).

I have made a change that omits the snapshot sanity check for
CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
<https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>

OK, thanks. Seems reasonable.

Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals.
The problem is that the git mirror is refreshed only at the very
beginning, and while a single branch does not exceed the limit, all the
branches do.

Could this be solved by keeping a local mirror, without a mirror in the
build root? I mean, something like this:

git_keep_mirror => 0,
scmrepo => '/path/to/local/mirror'

And of course a cron script updating the mirror every hour or so.

No, the git mirror should be refreshed at the start of each branch
build. It's not done at all by run_branches.pl. So your premise is
false. This should only happen if the actual run takes more than 24 hours.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tomas Vondra

tv@fuzzy.cz

over 11 years ago

In reply to: Andrew Dunstan (#13)

1 attachment(s)

Re: buildfarm animals and 'snapshot too old'

On 19.5.2014 23:04, Andrew Dunstan wrote:

On 05/19/2014 03:40 PM, Tomas Vondra wrote:

On 17.5.2014 22:35, Tomas Vondra wrote:

On 17.5.2014 19:58, Andrew Dunstan wrote:

On 05/15/2014 07:47 PM, Tomas Vondra wrote:

On 15.5.2014 22:07, Andrew Dunstan wrote:

Yes, I've seen that. Frankly, a test that takes something like 500
hours is a bit crazy.

Maybe. It certainly is not a test people will use during development.
But if it can detect some hard-to-find errors in the code, that might
possibly lead to serious problems, then +1 from me to run them at
least
on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.

Also, once you know where it fails the developer can run just that
single test (which might take minutes/hours, but not days).

I have made a change that omits the snapshot sanity check for
CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
<https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>

OK, thanks. Seems reasonable.

Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals.
The problem is that the git mirror is refreshed only at the very
beginning, and while a single branch does not exceed the limit, all the
branches do.

Could this be solved by keeping a local mirror, without a mirror in the
build root? I mean, something like this:

git_keep_mirror => 0,
scmrepo => '/path/to/local/mirror'

And of course a cron script updating the mirror every hour or so.

No, the git mirror should be refreshed at the start of each branch
build. It's not done at all by run_branches.pl. So your premise is
false. This should only happen if the actual run takes more than 24 hours.

OK. I think I understand what's wrong. This is a summary of log from
'leech' (attached is the full log):

========================================================================
Sat May 17 23:43:24 2014: buildfarm run for leech:REL8_4_STABLE starting
[23:43:24] checking out source ...
...
[04:32:47] OK
Sun May 18 04:32:51 2014: buildfarm run for leech:REL9_0_STABLE starting
[04:32:51] checking out source ...
...
[08:58:57] OK
Sun May 18 08:59:01 2014: buildfarm run for leech:REL9_1_STABLE starting
[08:59:01] checking out source ...
...
[14:09:08] OK
Sun May 18 14:09:12 2014: buildfarm run for leech:REL9_2_STABLE starting
[14:09:12] checking out source ...
...
[00:13:59] OK
Mon May 19 00:14:04 2014: buildfarm run for leech:REL9_3_STABLE starting
[00:14:04] checking out source ...
...
[14:26:29] OK
Query for: stage=OK&animal=leech&ts=1400451244
Target:
http://www.pgbuildfarm.org/cgi-bin/pgstatus.pl/28116b975a4186275d83f7e0f5c3fc92b1a75e85
Status Line: 493 snapshot too old: Fri May 16 07:05:50 2014 GMT
Content:
snapshot to old: Fri May 16 07:05:50 2014 GMT

Web txn failed with status: 1
Mon May 19 14:26:31 2014: buildfarm run for leech:HEAD starting
[14:26:31] checking out source ...
...
[20:07:23] checking test-decoding
Buildfarm member leech failed on HEAD stage test-decoding-check
========================================================================

So the REL9_3_STABLE starts at 00:14, completes at 14:26 (i.e. 14h
runtime), but fails with 'snapshot too old'.

These are the commits for each branch:

REL8_4_STABLE => 1dd0b3eeccaffd33b9c970a91c53fe42692ce8c2 (May 15, 2014)
REL9_0_STABLE => 0fc94340753f19da8acca5fc53039adbf2fa3632 (May 16, 2014)
REL9_1_STABLE => 39b3739c05688b5cd5d5da8c52fa5476304eff11 (May 16, 2014)
REL9_2_STABLE => 0d4c75f4de68012bb6f3bc52ebb58234334259d2 (May 16, 2014)
REL9_3_STABLE => d6a9767404cfee7f037a58e445b601af5837e4a5 (May 16, 2014)
HEAD => f097d70b7227c1f9aa2ed0af1d6deb05367ef657 (May 19, 2014)

IMHO the problem is that d6a97674 was the last revision in the
REL9_3_STABLE branch when the test started (00:14), but at 06:06
777d07d7 got committed. So the check at the end failed, because the
tested revision was suddenly ~2 days over the limit.

This seems wrong to me, because even a very fast test including the
commit (e.g. starting at 06:00, finishing at 06:10) would fail exactly
like this.

This is more probable on the old stable branches, because the commits
are not that frequent (on HEAD the commits are usually less than a few
hours apart, so the new one won't obsolete the previous one). It's also
made more likely to hit by the long runtime, because it increases the
probability something will be committed into the branch. And it also
makes it more "expensive" because it effectively throws all the cpu time
to /dev/null.

regards
Tomas

#15

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Tomas Vondra (#14)

Re: buildfarm animals and 'snapshot too old'

On 05/19/2014 05:37 PM, Tomas Vondra wrote:

IMHO the problem is that d6a97674 was the last revision in the
REL9_3_STABLE branch when the test started (00:14), but at 06:06
777d07d7 got committed. So the check at the end failed, because the
tested revision was suddenly ~2 days over the limit.

This seems wrong to me, because even a very fast test including the
commit (e.g. starting at 06:00, finishing at 06:10) would fail exactly
like this.

This is more probable on the old stable branches, because the commits
are not that frequent (on HEAD the commits are usually less than a few
hours apart, so the new one won't obsolete the previous one). It's also
made more likely to hit by the long runtime, because it increases the
probability something will be committed into the branch. And it also
makes it more "expensive" because it effectively throws all the cpu time
to /dev/null.

Well, the original code was put in for a reason, presumably that we were
getting some stale data and wanted to exclude it. So I'm unwilling to
throw it out altogether. If someone can propose a reasonable sanity
check then I'm prepared to implement it.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Andrew Dunstan (#15)

Re: buildfarm animals and 'snapshot too old'

Andrew Dunstan <andrew@dunslane.net> writes:

Well, the original code was put in for a reason, presumably that we were
getting some stale data and wanted to exclude it. So I'm unwilling to
throw it out altogether. If someone can propose a reasonable sanity
check then I'm prepared to implement it.

Would it be practical to make the timeout configurable on a per-animal
basis? A small number of hours is certainly sufficient for most of the
critters, but for ones that are particularly slow or are doing CLOBBER
type tests, we could raise the limit.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andrew Dunstan (#15)

Re: buildfarm animals and 'snapshot too old'

On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

Well, the original code was put in for a reason, presumably that we were
getting some stale data and wanted to exclude it. So I'm unwilling to throw
it out altogether. If someone can propose a reasonable sanity check then I'm
prepared to implement it.

While I generally agree that long-established code shouldn't be
changed for light or transient causes, I have to admit I'm pretty
skeptical about this particular instance. I can't think of any
particularly compelling reason why it's BAD for an old result to show
up. We now show the commit ID on the main page, so if you see 512abc4
in the middle of a bunch of ef9ab5f's, you'll notice. And if you
don't notice, so what?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Robert Haas (#17)

Re: buildfarm animals and 'snapshot too old'

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

Well, the original code was put in for a reason, presumably that we were
getting some stale data and wanted to exclude it. So I'm unwilling to throw
it out altogether. If someone can propose a reasonable sanity check then I'm
prepared to implement it.

While I generally agree that long-established code shouldn't be
changed for light or transient causes, I have to admit I'm pretty
skeptical about this particular instance. I can't think of any
particularly compelling reason why it's BAD for an old result to show
up. We now show the commit ID on the main page, so if you see 512abc4
in the middle of a bunch of ef9ab5f's, you'll notice. And if you
don't notice, so what?

Robert's got a point here. In my usage, the annoying thing is not animals
that take a long time to report in; it's the ones that lie about the
snapshot time (which is how you get "512abc4 in the middle of a bunch of
ef9ab5f's"). That is an issue of incorrect system clock, not of how long
it takes to do the run. I wonder if the buildfarm script could be taught
to get the timestamp from an NTP server somewhere? Or at least
sanity-check the system clock reading by comparing it to the newest commit
timestamp in the git repo.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Tom Lane (#18)

Re: buildfarm animals and 'snapshot too old'

On 05/20/2014 07:09 AM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

Well, the original code was put in for a reason, presumably that we were
getting some stale data and wanted to exclude it. So I'm unwilling to throw
it out altogether. If someone can propose a reasonable sanity check then I'm
prepared to implement it.

While I generally agree that long-established code shouldn't be
changed for light or transient causes, I have to admit I'm pretty
skeptical about this particular instance. I can't think of any
particularly compelling reason why it's BAD for an old result to show
up. We now show the commit ID on the main page, so if you see 512abc4
in the middle of a bunch of ef9ab5f's, you'll notice. And if you
don't notice, so what?

Robert's got a point here. In my usage, the annoying thing is not animals
that take a long time to report in; it's the ones that lie about the
snapshot time (which is how you get "512abc4 in the middle of a bunch of
ef9ab5f's"). That is an issue of incorrect system clock, not of how long
it takes to do the run. I wonder if the buildfarm script could be taught
to get the timestamp from an NTP server somewhere? Or at least
sanity-check the system clock reading by comparing it to the newest commit
timestamp in the git repo.

Showing the commit id is a relatively recent phenomenon, dating back to
July 2013. I agree with Robert that it might obsolete this check, so I
have disabled it for now. I have also disabled the other timestamp
check, on the time the client actually took the snapshot (as opposed to
the time of the last commit in the snapshot) for the
CLOBBER_CACHE_RECURSIVELY case.

Regarding clock skew, I think we can do better then what you suggest.
The web transaction code in the client adds its own timestamp just
before running the web transaction. It would be quite reasonable to
reject reports from machines with skewed clocks based on this value. I'm
not sure what a reasonable skew might be. Somewhere in the range of 5 to
15 minutes seems reasonable.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Andrew Dunstan (#19)

Re: buildfarm animals and 'snapshot too old'

Andrew Dunstan <andrew@dunslane.net> writes:

On 05/20/2014 07:09 AM, Tom Lane wrote:

Robert's got a point here. In my usage, the annoying thing is not animals
that take a long time to report in; it's the ones that lie about the
snapshot time (which is how you get "512abc4 in the middle of a bunch of
ef9ab5f's"). That is an issue of incorrect system clock, not of how long
it takes to do the run. I wonder if the buildfarm script could be taught
to get the timestamp from an NTP server somewhere? Or at least
sanity-check the system clock reading by comparing it to the newest commit
timestamp in the git repo.

Regarding clock skew, I think we can do better then what you suggest.
The web transaction code in the client adds its own timestamp just
before running the web transaction. It would be quite reasonable to
reject reports from machines with skewed clocks based on this value. I'm
not sure what a reasonable skew might be. Somewhere in the range of 5 to
15 minutes seems reasonable.

Rather than reject, why not take the result and adjust the claimed start
timestamp by the difference between the web transaction timestamp and the
buildfarm server's time?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Gavin Flower

GavinFlower@archidevsys.co.nz

over 11 years ago

In reply to: Tom Lane (#20)

Re: buildfarm animals and 'snapshot too old'

On 21/05/14 01:42, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 05/20/2014 07:09 AM, Tom Lane wrote:

Robert's got a point here. In my usage, the annoying thing is not animals
that take a long time to report in; it's the ones that lie about the
snapshot time (which is how you get "512abc4 in the middle of a bunch of
ef9ab5f's"). That is an issue of incorrect system clock, not of how long
it takes to do the run. I wonder if the buildfarm script could be taught
to get the timestamp from an NTP server somewhere? Or at least
sanity-check the system clock reading by comparing it to the newest commit
timestamp in the git repo.

Regarding clock skew, I think we can do better then what you suggest.
The web transaction code in the client adds its own timestamp just
before running the web transaction. It would be quite reasonable to
reject reports from machines with skewed clocks based on this value. I'm
not sure what a reasonable skew might be. Somewhere in the range of 5 to
15 minutes seems reasonable.

Rather than reject, why not take the result and adjust the claimed start
timestamp by the difference between the web transaction timestamp and the
buildfarm server's time?

regards, tom lane

I think, that if possible, any such adjustment should be noted along
with the original time, so that:

1. the timing issue can be remedied
2. it is possible to link the output to any messages in the machines
log etc.

Cheers,
Gavin

#22

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Gavin Flower (#21)

Re: buildfarm animals and 'snapshot too old'

On 05/20/2014 03:59 PM, Gavin Flower wrote:

On 21/05/14 01:42, Tom Lane wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 05/20/2014 07:09 AM, Tom Lane wrote:

Robert's got a point here. In my usage, the annoying thing is not animals
that take a long time to report in; it's the ones that lie about the
snapshot time (which is how you get "512abc4 in the middle of a bunch of
ef9ab5f's"). That is an issue of incorrect system clock, not of how long
it takes to do the run. I wonder if the buildfarm script could be taught
to get the timestamp from an NTP server somewhere? Or at least
sanity-check the system clock reading by comparing it to the newest commit
timestamp in the git repo.

Regarding clock skew, I think we can do better then what you suggest.
The web transaction code in the client adds its own timestamp just
before running the web transaction. It would be quite reasonable to
reject reports from machines with skewed clocks based on this value. I'm
not sure what a reasonable skew might be. Somewhere in the range of 5 to
15 minutes seems reasonable.

Rather than reject, why not take the result and adjust the claimed start
timestamp by the difference between the web transaction timestamp and the
buildfarm server's time?

I think, that if possible, any such adjustment should be noted along
with the original time, so that:

1. the timing issue can be remedied
2. it is possible to link the output to any messages in the machines
log etc.

I don't see how that's going to help anyone. Major clock skew is a sign
of client misconfiguration. And where would we note this adjustment, and
who would do anything about it?

We seem to be engaging in a sort of PoohBearism* here. More information
is not always better.

cheers

andrew

*
Rabbit: Would you like *condensed milk*, or *honey* on your bread?
Winnie the Pooh: Both. But never mind the bread, please.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Andrew Dunstan

andrew@dunslane.net

over 11 years ago

In reply to: Tom Lane (#20)

Re: buildfarm animals and 'snapshot too old'

On 05/20/2014 09:42 AM, Tom Lane wrote:

Regarding clock skew, I think we can do better then what you suggest.
The web transaction code in the client adds its own timestamp just
before running the web transaction. It would be quite reasonable to
reject reports from machines with skewed clocks based on this value. I'm
not sure what a reasonable skew might be. Somewhere in the range of 5 to
15 minutes seems reasonable.

Rather than reject, why not take the result and adjust the claimed start
timestamp by the difference between the web transaction timestamp and the
buildfarm server's time?

Right now I have added some logging of clock skew to see what we're
actually getting.

It won't be visible in the Web UI, but when I have some data I will post
more info.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

buildfarm animals and 'snapshot too old'

Attachments: