Continuous buildfarm failures on hamster with bin-check

Started by Michael Paquierover 8 years ago14 messages

michael.paquier@gmail.com

over 8 years ago

Hi all,

Some of you may have noticed that hamster is heavily red on the
buildfarm. I have done a bit of investigation, and I am able to
reproduce the failure manually. But actually after looking at the logs
the error has obviously showed up:
2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to
accept connections
2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics
instead of current ones because stats collector is not responding
2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
terminating walsender process due to replication timeout
2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics
instead of current ones because stats collector is not responding

Stale regressions means that the system is just constrained so much
that things are timing out.

In order to avoid such failures with normal regression tests, I have
set up extra_config so as stats_temp_directory goes to a tmpfs to
avoid stale statistics, but it is not possible to set up that with the
TAP tests. I could always disable --enable-tap-tests on this machine
but I don't think that this is a correct answer. Something that could
be done is to use an environment variable to set up a base directory
for all the nodes created in PostgresNode.pm, and use that for
temporary statistics with a custom folder name. But that's not
scalable if we want to enforce more parameter.

Thoughts?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Michael Paquier (#1)

Re: Continuous buildfarm failures on hamster with bin-check

Hi,

On 2017-04-18 16:07:38 +0900, Michael Paquier wrote:

Some of you may have noticed that hamster is heavily red on the
buildfarm. I have done a bit of investigation, and I am able to
reproduce the failure manually. But actually after looking at the logs
the error has obviously showed up:
2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to
accept connections
2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics
instead of current ones because stats collector is not responding
2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
terminating walsender process due to replication timeout
2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics
instead of current ones because stats collector is not responding

Stale regressions means that the system is just constrained so much
that things are timing out.

In order to avoid such failures with normal regression tests, I have
set up extra_config so as stats_temp_directory goes to a tmpfs to
avoid stale statistics

How high do you need to make the hardcoded limit for this to succeed
without a tmpfs? If hamster fails this regularly I think we have to do
something about it, rather than paper over it. What's the storage
situation currently like?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Andres Freund (#2)

Re: Continuous buildfarm failures on hamster with bin-check

On Tue, Apr 18, 2017 at 4:15 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-04-18 16:07:38 +0900, Michael Paquier wrote:

Some of you may have noticed that hamster is heavily red on the
buildfarm. I have done a bit of investigation, and I am able to
reproduce the failure manually. But actually after looking at the logs
the error has obviously showed up:
2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to
accept connections
2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics
instead of current ones because stats collector is not responding
2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
terminating walsender process due to replication timeout
2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics
instead of current ones because stats collector is not responding

Stale regressions means that the system is just constrained so much
that things are timing out.

In order to avoid such failures with normal regression tests, I have
set up extra_config so as stats_temp_directory goes to a tmpfs to
avoid stale statistics

How high do you need to make the hardcoded limit for this to succeed
without a tmpfs?

Increasing wal_sender_timeout helps visibly to reduce the failure
rate. With 10 attempts I can see before at least 3 failures, and
nothing after.

If hamster fails this regularly I think we have to do
something about it, rather than paper over it. What's the storage
situation currently like?

The SD card of this RPI is half-empty.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 8 years ago

In reply to: Michael Paquier (#1)

Re: Continuous buildfarm failures on hamster with bin-check

On 04/18/2017 03:07 AM, Michael Paquier wrote:

Hi all,

Some of you may have noticed that hamster is heavily red on the
buildfarm. I have done a bit of investigation, and I am able to
reproduce the failure manually. But actually after looking at the logs
the error has obviously showed up:
2017-04-16 05:07:19.650 JST [18282] LOG: database system is ready to
accept connections
2017-04-16 05:08:36.725 JST [18296] LOG: using stale statistics
instead of current ones because stats collector is not responding
2017-04-16 05:10:22.207 JST [18303] t/010_pg_basebackup.pl LOG:
terminating walsender process due to replication timeout
2017-04-16 05:10:30.180 JST [18306] LOG: using stale statistics
instead of current ones because stats collector is not responding

Stale regressions means that the system is just constrained so much
that things are timing out.

In order to avoid such failures with normal regression tests, I have
set up extra_config so as stats_temp_directory goes to a tmpfs to
avoid stale statistics,

Yeah, but the way you have done it could also to lead to errors unless
you're very careful, as I found on axolotl (which died recently,
unfortunately). There I had to set the stats_temp directory to a
branch-specific name so a crash on branch A didn't leave stats picked up
by a run on branch B. I had a script that ran before every buildfarm run
that made sure the branch-specific directories existed on the tmpfs.

but it is not possible to set up that with the
TAP tests. I could always disable --enable-tap-tests on this machine
but I don't think that this is a correct answer. Something that could
be done is to use an environment variable to set up a base directory
for all the nodes created in PostgresNode.pm, and use that for
temporary statistics with a custom folder name. But that's not
scalable if we want to enforce more parameter.

What more parameters do you want?

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Andrew Dunstan (#4)

Re: Continuous buildfarm failures on hamster with bin-check

On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

Yeah, but the way you have done it could also to lead to errors unless
you're very careful, as I found on axolotl (which died recently,
unfortunately). There I had to set the stats_temp directory to a
branch-specific name so a crash on branch A didn't leave stats picked up
by a run on branch B. I had a script that ran before every buildfarm run
that made sure the branch-specific directories existed on the tmpfs.

I didn't consider that. Still hamster is only set to run master so
that's not a problem. Running more branches would be too much for this
little dear as it includes TAP tests.

but it is not possible to set up that with the
TAP tests. I could always disable --enable-tap-tests on this machine
but I don't think that this is a correct answer. Something that could
be done is to use an environment variable to set up a base directory
for all the nodes created in PostgresNode.pm, and use that for
temporary statistics with a custom folder name. But that's not
scalable if we want to enforce more parameter.

What more parameters do you want?

Increasing wal_sender_timeout and wal_receiver_timeout can help in
reducing the failures seen.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 8 years ago

In reply to: Michael Paquier (#5)

Re: Continuous buildfarm failures on hamster with bin-check

On 04/18/2017 08:23 AM, Michael Paquier wrote:

On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

Yeah, but the way you have done it could also to lead to errors unless
you're very careful, as I found on axolotl (which died recently,
unfortunately). There I had to set the stats_temp directory to a
branch-specific name so a crash on branch A didn't leave stats picked up
by a run on branch B. I had a script that ran before every buildfarm run
that made sure the branch-specific directories existed on the tmpfs.

I didn't consider that. Still hamster is only set to run master so
that's not a problem. Running more branches would be too much for this
little dear as it includes TAP tests.

but it is not possible to set up that with the
TAP tests. I could always disable --enable-tap-tests on this machine
but I don't think that this is a correct answer. Something that could
be done is to use an environment variable to set up a base directory
for all the nodes created in PostgresNode.pm, and use that for
temporary statistics with a custom folder name. But that's not
scalable if we want to enforce more parameter.

What more parameters do you want?

Increasing wal_sender_timeout and wal_receiver_timeout can help in
reducing the failures seen.

OK, but you're only talking about a handful of these, right?

Lets's say we have a bunch of possible environment settings with names
that all begin with "PG_TAP_" PostgresNode.pm could check for the
existence of these and take action accordingly, and you could set them
on a buildfarm animal in the config file, or for interactive use in your
.profile.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Andrew Dunstan (#6)

Re: Continuous buildfarm failures on hamster with bin-check

On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

On 04/18/2017 08:23 AM, Michael Paquier wrote:

On Tue, Apr 18, 2017 at 9:13 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

Yeah, but the way you have done it could also to lead to errors unless
you're very careful, as I found on axolotl (which died recently,
unfortunately). There I had to set the stats_temp directory to a
branch-specific name so a crash on branch A didn't leave stats picked up
by a run on branch B. I had a script that ran before every buildfarm run
that made sure the branch-specific directories existed on the tmpfs.

I didn't consider that. Still hamster is only set to run master so
that's not a problem. Running more branches would be too much for this
little dear as it includes TAP tests.

but it is not possible to set up that with the
TAP tests. I could always disable --enable-tap-tests on this machine
but I don't think that this is a correct answer. Something that could
be done is to use an environment variable to set up a base directory
for all the nodes created in PostgresNode.pm, and use that for
temporary statistics with a custom folder name. But that's not
scalable if we want to enforce more parameter.

What more parameters do you want?

Increasing wal_sender_timeout and wal_receiver_timeout can help in
reducing the failures seen.

OK, but you're only talking about a handful of these, right?

Yup, that would be one solution but that's not attacking the problem
at its root.

Lets's say we have a bunch of possible environment settings with names
that all begin with "PG_TAP_" PostgresNode.pm could check for the
existence of these and take action accordingly, and you could set them
on a buildfarm animal in the config file, or for interactive use in your
.profile.

That's the point I am trying to make upthread: slow buildfarm animals
should have minimal impact on core code modifications. We could for
example have one environment variable that lists all the parameters to
modify in a single string and appends them at the end of
postgresql.conf. But honestly I don't think that this is necessary if
there is only one variable able to define a base directory for
temporary statistics as the real bottleneck comes from there at least
in the case of hamster. When initializing a node via PostgresNode.pm,
we would just check for this variable, and the init() routine just
creates a temporary folder in it, setting up temp_stats_path in
postgresql.conf.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 8 years ago

In reply to: Michael Paquier (#7)

Re: Continuous buildfarm failures on hamster with bin-check

On 04/18/2017 08:59 AM, Michael Paquier wrote:

Lets's say we have a bunch of possible environment settings with names
that all begin with "PG_TAP_" PostgresNode.pm could check for the
existence of these and take action accordingly, and you could set them
on a buildfarm animal in the config file, or for interactive use in your
.profile.

That's the point I am trying to make upthread: slow buildfarm animals
should have minimal impact on core code modifications. We could for
example have one environment variable that lists all the parameters to
modify in a single string and appends them at the end of
postgresql.conf. But honestly I don't think that this is necessary if
there is only one variable able to define a base directory for
temporary statistics as the real bottleneck comes from there at least
in the case of hamster. When initializing a node via PostgresNode.pm,
we would just check for this variable, and the init() routine just
creates a temporary folder in it, setting up temp_stats_path in
postgresql.conf.

How is that going to deal with your wal_*_timeout etc settings?

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Michael Paquier (#7)

Re: Continuous buildfarm failures on hamster with bin-check

Michael Paquier <michael.paquier@gmail.com> writes:

That's the point I am trying to make upthread: slow buildfarm animals
should have minimal impact on core code modifications. We could for
example have one environment variable that lists all the parameters to
modify in a single string and appends them at the end of
postgresql.conf. But honestly I don't think that this is necessary if
there is only one variable able to define a base directory for
temporary statistics as the real bottleneck comes from there at least
in the case of hamster.

FWIW, I'm a bit suspicious of relocating the temp stats directory as
being a reliable fix for this. It looks to me like hamster isn't that
much slower than gaur/pademelon, ie the same machine that was my primary
development machine for well over a decade, and on which I have NEVER
seen a "stats collector not responding" failure. Plus, if hamster's
main storage is SSD, that seems unlikely to be slower than the spinning
rust in gaur/pademelon. So I can't escape the suspicion that something
else is going on there. Seemingly-unexplainable stats collector problems
have been a bugaboo for a long time ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Tom Lane (#9)

Re: Continuous buildfarm failures on hamster with bin-check

Tom Lane wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

That's the point I am trying to make upthread: slow buildfarm animals
should have minimal impact on core code modifications. We could for
example have one environment variable that lists all the parameters to
modify in a single string and appends them at the end of
postgresql.conf. But honestly I don't think that this is necessary if
there is only one variable able to define a base directory for
temporary statistics as the real bottleneck comes from there at least
in the case of hamster.

FWIW, I'm a bit suspicious of relocating the temp stats directory as
being a reliable fix for this. It looks to me like hamster isn't that
much slower than gaur/pademelon, ie the same machine that was my primary
development machine for well over a decade, and on which I have NEVER
seen a "stats collector not responding" failure. Plus, if hamster's
main storage is SSD, that seems unlikely to be slower than the spinning
rust in gaur/pademelon. So I can't escape the suspicion that something
else is going on there. Seemingly-unexplainable stats collector problems
have been a bugaboo for a long time ...

It's an SD card (the kind typically used in cameras and phones), not SSD.
Saying it's slow is an understatement. It's *excruciatingly* slow.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Alvaro Herrera (#10)

Re: Continuous buildfarm failures on hamster with bin-check

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

FWIW, I'm a bit suspicious of relocating the temp stats directory as
being a reliable fix for this.

It's an SD card (the kind typically used in cameras and phones), not SSD.
Saying it's slow is an understatement. It's *excruciatingly* slow.

Oh, I misread it ... but still, the modern definition of "excruciatingly
slow" doesn't seem all that far off what 90s-era hard drives could do.
It is clear from googling though that there's an enormous performance
range in SD cards' random write performance, eg wikipedia's entry has
a link to

http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/

Seems like it's hard to judge this without knowing exactly which
SD card Michael has got in that thing.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Tom Lane (#11)

Re: Continuous buildfarm failures on hamster with bin-check

On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

FWIW, I'm a bit suspicious of relocating the temp stats directory as
being a reliable fix for this.

It's an SD card (the kind typically used in cameras and phones), not SSD.
Saying it's slow is an understatement. It's *excruciatingly* slow.

Oh, I misread it ... but still, the modern definition of "excruciatingly
slow" doesn't seem all that far off what 90s-era hard drives could do.
It is clear from googling though that there's an enormous performance
range in SD cards' random write performance, eg wikipedia's entry has
a link to

http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/

Seems like it's hard to judge this without knowing exactly which
SD card Michael has got in that thing.

Professional micro SD HC 8GB, with class 10, which is utterly slow in
the things I have tested up to now and of course I cannot find its
read/write properties. I had a SanDisk class 10 in it that died some
months ago, and that was way faster. RIP.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Michael Paquier (#12)

Re: Continuous buildfarm failures on hamster with bin-check

On Wed, Apr 19, 2017 at 7:03 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 19, 2017 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

FWIW, I'm a bit suspicious of relocating the temp stats directory as
being a reliable fix for this.

It's an SD card (the kind typically used in cameras and phones), not SSD.
Saying it's slow is an understatement. It's *excruciatingly* slow.

Oh, I misread it ... but still, the modern definition of "excruciatingly
slow" doesn't seem all that far off what 90s-era hard drives could do.
It is clear from googling though that there's an enormous performance
range in SD cards' random write performance, eg wikipedia's entry has
a link to

http://goughlui.com/2014/01/16/testing-sd-card-performance-round-up/

Seems like it's hard to judge this without knowing exactly which
SD card Michael has got in that thing.

Professional micro SD HC 8GB, with class 10, which is utterly slow in
the things I have tested up to now and of course I cannot find its
read/write properties. I had a SanDisk class 10 in it that died some
months ago, and that was way faster. RIP.

Until something happens, I have removed --enable-tap-tests from the
script of this buildfarm machine. There is no point to let it in red
without a clear resolution plan on the table first.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Michael Paquier (#7)

Re: Continuous buildfarm failures on hamster with bin-check

On Tue, Apr 18, 2017 at 09:59:26PM +0900, Michael Paquier wrote:

On Tue, Apr 18, 2017 at 9:35 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:

On 04/18/2017 08:23 AM, Michael Paquier wrote:

Increasing wal_sender_timeout and wal_receiver_timeout can help in
reducing the failures seen.

OK, but you're only talking about a handful of these, right?

Yup, that would be one solution but that's not attacking the problem
at its root.

Lets's say we have a bunch of possible environment settings with names
that all begin with "PG_TAP_" PostgresNode.pm could check for the
existence of these and take action accordingly, and you could set them
on a buildfarm animal in the config file, or for interactive use in your
.profile.

That's the point I am trying to make upthread: slow buildfarm animals
should have minimal impact on core code modifications. We could for
example have one environment variable that lists all the parameters to
modify in a single string and appends them at the end of
postgresql.conf. But honestly I don't think that this is necessary if
there is only one variable able to define a base directory for
temporary statistics as the real bottleneck comes from there at least
in the case of hamster. When initializing a node via PostgresNode.pm,
we would just check for this variable, and the init() routine just
creates a temporary folder in it, setting up temp_stats_path in
postgresql.conf.

Each of the above approaches has fairly low impact on the code, so we should
use other criteria to choose. I'd welcome a feature for augmenting every
postgresql.conf of every test suite (a generalization of "pg_regress
--temp-config", which has proven its value). I can envision using it with
force_parallel_mode, default_transaction_isolation, log_*, wal_*_timeout,
autovacuum_naptime, and others.

Even for hamster, I'm skeptical that changing stats_temp_directory would
suffice. Every hamster BinInstallCheck failure since 2017-02-13 had a "LOG:
terminating walsender process due to replication timeout". Most, but not all,
of those replication timeouts followed a "LOG: using stale statistics instead
of current ones because stats collector is not responding". For the remaining
minority, I expect to eventually need wal_sender_timeout. Example:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2017-02-24%2016%3A00%3A06

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers