Recent eelpout failures on 9.x branches

Started by Tom Laneabout 5 years ago5 messages

tgl@sss.pgh.pa.us

about 5 years ago

For about a week, eelpout has been failing the pg_basebackup test
more often than not, but only in the 9.5 and 9.6 branches:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_6_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_5_STABLE

The failures all look pretty alike:

# Running: pg_basebackup -D /home/tmunro/build-farm/buildroot/REL9_6_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/tmp_test_jJOm/backupxs -X stream
pg_basebackup: could not send copy-end packet: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: child process exited with exit code 1
not ok 44 - pg_basebackup -X stream runs

What shows up in the postmaster log is

2020-12-02 09:04:53.064 NZDT [29536:1] [unknown] LOG: connection received: host=[local]
2020-12-02 09:04:53.065 NZDT [29536:2] [unknown] LOG: replication connection authorized: user=tmunro
2020-12-02 09:04:53.175 NZDT [29537:1] [unknown] LOG: connection received: host=[local]
2020-12-02 09:04:53.178 NZDT [29537:2] [unknown] LOG: replication connection authorized: user=tmunro
2020-12-02 09:05:42.860 NZDT [29502:2] LOG: using stale statistics instead of current ones because stats collector is not responding
2020-12-02 09:05:53.074 NZDT [29542:1] LOG: using stale statistics instead of current ones because stats collector is not responding
2020-12-02 09:05:53.183 NZDT [29537:3] pg_basebackup LOG: terminating walsender process due to replication timeout
2020-12-02 09:05:53.183 NZDT [29537:4] pg_basebackup LOG: disconnection: session time: 0:01:00.008 user=tmunro database= host=[local]
2020-12-02 09:06:33.996 NZDT [29536:3] pg_basebackup LOG: disconnection: session time: 0:01:40.933 user=tmunro database= host=[local]

The "using stale statistics" gripes seem to be from autovacuum, so they
may be unrelated to the problem; but they suggest that the system
is under very heavy load, or else that there's some kernel-level issue.
Note however that some of the failures don't have those messages, and
I also see those messages in some runs that didn't fail.

Perhaps this is just a question of the machine being too slow to complete
the test, in which case we ought to raise wal_sender_timeout. But it's
weird that it would've started to fail just now, because I don't really
see any changes in those branches that would explain a week-old change
in the test runtime.

Any thoughts?

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Tom Lane (#1)

Re: Recent eelpout failures on 9.x branches

On Wed, Dec 2, 2020 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Perhaps this is just a question of the machine being too slow to complete
the test, in which case we ought to raise wal_sender_timeout. But it's
weird that it would've started to fail just now, because I don't really
see any changes in those branches that would explain a week-old change
in the test runtime.

Unfortunately, eelpout got kicked off the nice shiny ARM server it was
running on so last week I moved it to a rack mounted Raspberry Pi. It
seems to be totally I/O starved causing some timeouts to be reached,
and I'm looking into fixing that by adding fast storage. This may
take a few days. Sorry for the noise.

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Thomas Munro (#2)

Re: Recent eelpout failures on 9.x branches

Thomas Munro <thomas.munro@gmail.com> writes:

Unfortunately, eelpout got kicked off the nice shiny ARM server it was
running on so last week I moved it to a rack mounted Raspberry Pi. It
seems to be totally I/O starved causing some timeouts to be reached,
and I'm looking into fixing that by adding fast storage. This may
take a few days. Sorry for the noise.

Ah-hah. Now that I look, eelpout is very clearly slower overall
than it was a couple weeks ago, so all is explained.

It might still be reasonable to raise wal_sender_timeout in the
buildfarm environment, though. We usually try to make sure that
buildfarm timeouts border on ridiculous, not just because of
underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS
animals.

I'm also wondering a bit why the issue isn't affecting the newer
branches. It's certainly not because we made the test shorter ...

regards, tom lane

Noah Misch

noah@leadboat.com

about 5 years ago

In reply to: Tom Lane (#3)

Re: Recent eelpout failures on 9.x branches

On Tue, Dec 01, 2020 at 06:07:17PM -0500, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Unfortunately, eelpout got kicked off the nice shiny ARM server it was
running on so last week I moved it to a rack mounted Raspberry Pi. It
seems to be totally I/O starved causing some timeouts to be reached,
and I'm looking into fixing that by adding fast storage. This may
take a few days. Sorry for the noise.

Ah-hah. Now that I look, eelpout is very clearly slower overall
than it was a couple weeks ago, so all is explained.

It might still be reasonable to raise wal_sender_timeout in the
buildfarm environment, though. We usually try to make sure that
buildfarm timeouts border on ridiculous, not just because of
underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS
animals.

My buildfarm animals override these:

extra_config =>{
DEFAULT => [
"authentication_timeout = '600s'",
"wal_receiver_timeout = '18000s'",
"wal_sender_timeout = '18000s'",
],
},
build_env =>{
PGCTLTIMEOUT => 18000,
},

Each of those timeouts caused failures before I changed it. For animals fast
enough to make them irrelevant, I've not yet encountered a disadvantage.

I'm also wondering a bit why the issue isn't affecting the newer
branches. It's certainly not because we made the test shorter ...

That is peculiar.

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Tom Lane (#3)

Re: Recent eelpout failures on 9.x branches

On Wed, Dec 2, 2020 at 12:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm also wondering a bit why the issue isn't affecting the newer
branches. It's certainly not because we made the test shorter ...

I looked at htop while it was building the 9.x branches and saw
pg_basebackup sitting in D state waiting for glacial I/O. Perhaps
this was improved by the --no-sync option that got sprinkled around
the place in later releases?