BUG #14321: pg_basebackup --xlog-method=stream fails

Started by Jürgen Strobelover 9 years ago3 messagesbugs

juergen+postgresql@strobel.info

over 9 years ago

On 10 September 2016 at 00:09, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Sat, Sep 10, 2016 at 1:58 AM, <juergen+postgresql@strobel.info> wrote:

The filsystem backup continues successfully to its end, but it concludes
without the necessary WAL files. I verified in pg_stat_replication that
pg_basebackup is not trying to reconnect to the master.

I understand how to repair this manually and it's not an end-of-the-world
bug, but it would be nice if pg_basebackup would just reconnect the
streaming WAL connection in the same way as pg_receivexlog does.

Especially

as that error happens in a long script run by cron and/or other people

who

do not have this insight.

Perhaps. The source server logs do prove the fact that pg_basebackup
is requesting for missing WAL segments, right?

I haven't had time to try 9.6's --slot option yet, but I suspect this

won't

be a full cure either unless it also changes the re-connect behavior.

If what you are seeing missing are the first WAL segments that your
backup needs, first the backup you took will be useless if you don't
have a WAL archive from where recovery could fetch those missing
segments. And in this case --slot will definitely help, but just be
sure that this does not bloat your pg_xlog partition if disk space is
a concern there.
--
Michael

First, I do have another WAL archive (usually).

But no I only see the first WAL segments up to the point when the problem
occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections for
backup and WAL streaming.
2. The VM's crappy IO system hickups and stalls the whole VM for a
surprisingly long time.
3. The server runs into wal_sender_timeout and closes the WAL streaming
connection.
4. pg_basebackup prints the warning, and continues the filesystem copy,
*but makes no effort to re-open the WAL streaming connection*. With ps I
see zombie child of the pg_basbackup process, I assume that's the one doing
the WAL streaming.
5. pg_baseback finishes up with the second half of pg_xlog missing, and the
DB fails to start.

In contrast if the same problem occurs while running pg_receivexlog it
waits for 5 seconds then reopens the connection. I think that pg_basebackup
should show the same resilience.

-Jürgen

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Jürgen Strobel (#1)

Re: BUG #14321: pg_basebackup --xlog-method=stream fails

On Sat, Sep 10, 2016 at 9:10 AM, Jürgen Strobel
<juergen+postgresql@strobel.info> wrote:

First, I do have another WAL archive (usually).
But no I only see the first WAL segments up to the point when the problem
occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections for
backup and WAL streaming.
2. The VM's crappy IO system hickups and stalls the whole VM for a
surprisingly long time.

I know that people can do fancy things here, believe me.

3. The server runs into wal_sender_timeout and closes the WAL streaming
connection.
4. pg_basebackup prints the warning, and continues the filesystem copy, *but
makes no effort to re-open the WAL streaming connection*. With ps I see
zombie child of the pg_basbackup process, I assume that's the one doing the
WAL streaming.
5. pg_baseback finishes up with the second half of pg_xlog missing, and the
DB fails to start.

In contrast if the same problem occurs while running pg_receivexlog it waits
for 5 seconds then reopens the connection. I think that pg_basebackup should
show the same resilience.

You can blame your VM here to begin with :(
Even with the default values of pg_basebackup --status-interval and
wal_sender_timeout on the server there is enough margin to prevent
things to get killed, but if things get heavily constrained on I/O...
Well, there is not much than any software could do... Now I agree that
there would be room for improvement to make pg_basebackup retry a
stream instead of failing, and that may be something that people would
be willing to have. But that's hard to think about improvements in
this area as something else than a new feature, and not a bug.

Anyway, replication slots would not help here if you just rely on
pg_basebackup to finish the job.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Jürgen Strobel

juergen+postgresql@strobel.info

over 9 years ago

In reply to: Michael Paquier (#2)

Re: BUG #14321: pg_basebackup --xlog-method=stream fails

On 10 September 2016 at 07:30, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Sat, Sep 10, 2016 at 9:10 AM, Jürgen Strobel
<juergen+postgresql@strobel.info> wrote:

First, I do have another WAL archive (usually).
But no I only see the first WAL segments up to the point when the problem
occurs, then nothing more.

The timeline as far as I can tell is:

1. pg_basebackup --xlog-method=stream starts and creates 2 connections

for

backup and WAL streaming.
2. The VM's crappy IO system hickups and stalls the whole VM for a
surprisingly long time.

I know that people can do fancy things here, believe me.

3. The server runs into wal_sender_timeout and closes the WAL streaming
connection.
4. pg_basebackup prints the warning, and continues the filesystem copy,

*but

makes no effort to re-open the WAL streaming connection*. With ps I see
zombie child of the pg_basbackup process, I assume that's the one doing

the

WAL streaming.
5. pg_baseback finishes up with the second half of pg_xlog missing, and

the

DB fails to start.

In contrast if the same problem occurs while running pg_receivexlog it

waits

for 5 seconds then reopens the connection. I think that pg_basebackup

should

show the same resilience.

You can blame your VM here to begin with :(
Even with the default values of pg_basebackup

--status-interval and
wal_sender_timeout on the server there is enough margin to prevent
things to get killed, but if things get heavily constrained on I/O...
Well, there is not much than any software could do... Now I agree that
there would be room for improvement to make pg_basebackup retry a
stream instead of failing, and that may be something that people would
be willing to have. But that's hard to think about improvements in
this area as something else than a new feature, and not a bug.

Anyway, replication slots would not help here if you just rely on
pg_basebackup to finish the job.
--
Michael

I do agree the VM is bad, but I have to work with what I got now.

I do not agree it's a pure feature request though. When this problem
happens pg_baseback should either abort fully with a suitable error, or
retry streaming WAL until it got everything it needs for a functional
backup (or streaming fails due to WAL cleanup on the server). The current
behavior of finishing the filesystem backup with a mere warning is
inconsistent and not user friendly. If I use --xlog-method=stream I expect
to end up with all WAL in the end or to get a clear error. It took me quite
some time to figure out what's happening. And of course this never happened
in QA/staging systems, only in production.

I understand that this may not affect many people, and that it's not going
to get immediate attention, classify it as you wish.

The replication slot feature might make it easier for me to recover from
the problem using pg_receivexlog afterwards.

-Jürgen