BUG #14321: pg_basebackup --xlog-method=stream fails

Started by Jürgen Strobelover 9 years ago3 messagesbugs
Jump to latest
#1Jürgen Strobel
juergen+postgresql@strobel.info

On 10 September 2016 at 00:09, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Sat, Sep 10, 2016 at 1:58 AM, <juergen+postgresql@strobel.info> wrote:

The filsystem backup continues successfully to its end, but it concludes
without the necessary WAL files. I verified in pg_stat_replication that
pg_basebackup is not trying to reconnect to the master.

I understand how to repair this manually and it's not an end-of-the-world
bug, but it would be nice if pg_basebackup would just reconnect the
streaming WAL connection in the same way as pg_receivexlog does.

Especially

as that error happens in a long script run by cron and/or other people

who

do not have this insight.

Perhaps. The source server logs do prove the fact that pg_basebackup
is requesting for missing WAL segments, right?

I haven't had time to try 9.6's --slot option yet, but I suspect this

won't

be a full cure either unless it also changes the re-connect behavior.

If what you are seeing missing are the first WAL segments that your
backup needs, first the backup you took will be useless if you don't
have a WAL archive from where recovery could fetch those missing
segments. And in this case --slot will definitely help, but just be
sure that this does not bloat your pg_xlog partition if disk space is
a concern there.
--
Michael

First, I do have another WAL archive (usually).

But no I only see the first WAL segments up to the point when the problem
occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections for
backup and WAL streaming.
2. The VM's crappy IO system hickups and stalls the whole VM for a
surprisingly long time.
3. The server runs into wal_sender_timeout and closes the WAL streaming
connection.
4. pg_basebackup prints the warning, and continues the filesystem copy,
*but makes no effort to re-open the WAL streaming connection*. With ps I
see zombie child of the pg_basbackup process, I assume that's the one doing
the WAL streaming.
5. pg_baseback finishes up with the second half of pg_xlog missing, and the
DB fails to start.

​In contrast if the same problem occurs while running pg_receivexlog ​it
waits for 5 seconds then reopens the connection. I think that pg_basebackup
should show the same resilience.

-Jürgen

#2Michael Paquier
michael@paquier.xyz
In reply to: Jürgen Strobel (#1)
Re: BUG #14321: pg_basebackup --xlog-method=stream fails

On Sat, Sep 10, 2016 at 9:10 AM, Jürgen Strobel
<juergen+postgresql@strobel.info> wrote:

First, I do have another WAL archive (usually).
But no I only see the first WAL segments up to the point when the problem
occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections for
backup and WAL streaming.
2. The VM's crappy IO system hickups and stalls the whole VM for a
surprisingly long time.

I know that people can do fancy things here, believe me.

3. The server runs into wal_sender_timeout and closes the WAL streaming
connection.
4. pg_basebackup prints the warning, and continues the filesystem copy, *but
makes no effort to re-open the WAL streaming connection*. With ps I see
zombie child of the pg_basbackup process, I assume that's the one doing the
WAL streaming.
5. pg_baseback finishes up with the second half of pg_xlog missing, and the
DB fails to start.

In contrast if the same problem occurs while running pg_receivexlog it waits
for 5 seconds then reopens the connection. I think that pg_basebackup should
show the same resilience.

You can blame your VM here to begin with :(
Even with the default values of pg_basebackup --status-interval and
wal_sender_timeout on the server there is enough margin to prevent
things to get killed, but if things get heavily constrained on I/O...
Well, there is not much than any software could do... Now I agree that
there would be room for improvement to make pg_basebackup retry a
stream instead of failing, and that may be something that people would
be willing to have. But that's hard to think about improvements in
this area as something else than a new feature, and not a bug.

Anyway, replication slots would not help here if you just rely on
pg_basebackup to finish the job.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Jürgen Strobel
juergen+postgresql@strobel.info
In reply to: Michael Paquier (#2)
Re: BUG #14321: pg_basebackup --xlog-method=stream fails

On 10 September 2016 at 07:30, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Sat, Sep 10, 2016 at 9:10 AM, Jürgen Strobel
<juergen+postgresql@strobel.info> wrote:

First, I do have another WAL archive (usually).
But no I only see the first WAL segments up to the point when the problem
occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections

for

backup and WAL streaming.
2. The VM's crappy IO system hickups and stalls the whole VM for a
surprisingly long time.

I know that people can do fancy things here, believe me.

3. The server runs into wal_sender_timeout and closes the WAL streaming
connection.
4. pg_basebackup prints the warning, and continues the filesystem copy,

*but

makes no effort to re-open the WAL streaming connection*. With ps I see
zombie child of the pg_basbackup process, I assume that's the one doing

the

WAL streaming.
5. pg_baseback finishes up with the second half of pg_xlog missing, and

the

DB fails to start.

In contrast if the same problem occurs while running pg_receivexlog it

waits

for 5 seconds then reopens the connection. I think that pg_basebackup

should

show the same resilience.

You can blame your VM here to begin with :(
Even with the default values of pg_basebackup
​​
--status-interval and
wal_sender_timeout on the server there is enough margin to prevent
things to get killed, but if things get heavily constrained on I/O...
Well, there is not much than any software could do... Now I agree that
there would be room for improvement to make pg_basebackup retry a
stream instead of failing, and that may be something that people would
be willing to have. But that's hard to think about improvements in
this area as something else than a new feature, and not a bug.

Anyway, replication slots would not help here if you just rely on
pg_basebackup to finish the job.
--
Michael

​I do agree the VM is bad, but I have to work with what I got now.

I do not agree it's a pure feature request though. When this problem
happens pg_baseback should either abort fully with a suitable error, or
retry streaming WAL until it got everything it needs for a functional
backup (or streaming fails due to WAL cleanup on the server). The current
behavior of finishing the filesystem backup with a mere warning is
inconsistent and not user friendly. If I use --xlog-method=stream I expect
to end up with all WAL in the end or to get a clear error. It took me quite
some time to figure out what's happening. And of course this never happened
in QA/staging systems, only in production.

I understand that this may not affect many people, and that it's not going
to get immediate attention, classify it as you wish.

​The replication slot feature might make it easier for me to recover from
the problem using pg_receivexlog afterwards.​

-Jürgen