BUG #14321: pg_basebackup --xlog-method=stream fails

Started by Jürgen Strobelover 9 years ago2 messagesbugs
Jump to latest
#1Jürgen Strobel
juergen+postgresql@strobel.info

The following bug has been logged on the website:

Bug reference: 14321
Logged by: Jürgen Strobel
Email address: juergen+postgresql@strobel.info
PostgreSQL version: 9.5.4
Operating system: CentOS7
Description:

Hello everyone,

Quite often while running pg_basebackup --xlog-method=stream I get the
following warning:

pg_basebackup: could not receive data from WAL stream: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

The filsystem backup continues successfully to its end, but it concludes
without the necessary WAL files. I verified in pg_stat_replication that
pg_basebackup is not trying to reconnect to the master.

I am running this in a VM taking a backup of a live ~300-900GB DBs.
Sometimes IO spikes seem to cause hangs larger than the server's
wal_sender_timeout, which is the default 60s. The VM has much less resources
than the upstream DB. I don't really want to increase wal_sender_timeout
because there are other (non-backup) HA standbys too, and I wouldn't know to
how much.

I understand how to repair this manually and it's not an end-of-the-world
bug, but it would be nice if pg_basebackup would just reconnect the
streaming WAL connection in the same way as pg_receivexlog does. Especially
as that error happens in a long script run by cron and/or other people who
do not have this insight.

I haven't had time to try 9.6's --slot option yet, but I suspect this won't
be a full cure either unless it also changes the re-connect behavior.

Best regards,
Jürgen Strobel

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Michael Paquier
michael@paquier.xyz
In reply to: Jürgen Strobel (#1)
Re: BUG #14321: pg_basebackup --xlog-method=stream fails

On Sat, Sep 10, 2016 at 1:58 AM, <juergen+postgresql@strobel.info> wrote:

The filsystem backup continues successfully to its end, but it concludes
without the necessary WAL files. I verified in pg_stat_replication that
pg_basebackup is not trying to reconnect to the master.

I understand how to repair this manually and it's not an end-of-the-world
bug, but it would be nice if pg_basebackup would just reconnect the
streaming WAL connection in the same way as pg_receivexlog does. Especially
as that error happens in a long script run by cron and/or other people who
do not have this insight.

Perhaps. The source server logs do prove the fact that pg_basebackup
is requesting for missing WAL segments, right?

I haven't had time to try 9.6's --slot option yet, but I suspect this won't
be a full cure either unless it also changes the re-connect behavior.

If what you are seeing missing are the first WAL segments that your
backup needs, first the backup you took will be useless if you don't
have a WAL archive from where recovery could fetch those missing
segments. And in this case --slot will definitely help, but just be
sure that this does not bloat your pg_xlog partition if disk space is
a concern there.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs