pg_basebackup, walreceiver and wal_sender_timeout

Started by Nick Balmost 7 years ago13 messages
#1Nick B
nbedxp@gmail.com

Hi, hackers.

When running pg_basebackup with -X s with network file system as target we
would consistently get "could not receive data from WAL stream: server
closed the connection unexpectedly".

On server we see this error firing: "terminating walsender process due to
replication timeout"
The problem occurs during a network or file system acting very slow. One
example of such case looks like this (strace output for fsync calls):

0.033383 fsync(8) = 0 <20.265275>
20.265399 fsync(8) = 0 <0.000011>
0.022892 fsync(7) = 0 <48.350566>
48.350654 fsync(7) = 0 <0.000005>
0.000674 fsync(8) = 0 <0.851536>
0.851619 fsync(8) = 0 <0.000007>
0.000067 fsync(7) = 0 <0.000006>
0.000045 fsync(7) = 0 <0.000005>
0.031733 fsync(8) = 0 <0.826957>
0.827869 fsync(8) = 0 <0.000016>
0.005344 fsync(7) = 0 <1.437103>
1.446450 fsync(6) = 0 <0.063148>
0.063246 fsync(6) = 0 <0.000006>
0.000381 +++ exited with 1 +++

So the longest fsync call took 48 seconds, but how would that result in a
termination if wal_sender_timeout is (default) 60 seconds?

The problem is in the way wal_sender handles this timeout:

/*
* If half of wal_sender_timeout has lapsed without receiving any reply
* from the standby, send a keep-alive message to the standby requesting
* an immediate reply.
*/

Obviously the receiver cannot respond immediately while in a syscall.

This begs a question, why is the GUC handled the way it is? What would be
the correct way to solve this? Shall we change the behaviour or do a better
job explaining what are implications of wal_sender_timeout in the docs?

Regards,
Nick.

#2Michael Paquier
michael@paquier.xyz
In reply to: Nick B (#1)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Fri, Jan 25, 2019 at 03:26:38PM +0100, Nick B wrote:

On server we see this error firing: "terminating walsender process due to
replication timeout"
The problem occurs during a network or file system acting very slow. One
example of such case looks like this (strace output for fsync calls):

0.033383 fsync(8) = 0 <20.265275>
20.265399 fsync(8) = 0 <0.000011>
0.022892 fsync(7) = 0 <48.350566>
48.350654 fsync(7) = 0 <0.000005>
0.000674 fsync(8) = 0 <0.851536>
0.851619 fsync(8) = 0 <0.000007>
0.000067 fsync(7) = 0 <0.000006>
0.000045 fsync(7) = 0 <0.000005>
0.031733 fsync(8) = 0 <0.826957>
0.827869 fsync(8) = 0 <0.000016>
0.005344 fsync(7) = 0 <1.437103>
1.446450 fsync(6) = 0 <0.063148>
0.063246 fsync(6) = 0 <0.000006>
0.000381 +++ exited with 1 +++

These are a bit unregular. Which files are taking that long to
complete while others are way faster? It may be something that we
could improve on the base backup side as there is no actual point in
syncing segments while the backup is running and we could delay that
at the end of the backup (if I recall that stuff correctly).

This begs a question, why is the GUC handled the way it is? What would be
the correct way to solve this? Shall we change the behaviour or do a better
job explaining what are implications of wal_sender_timeout in the
docs?

The following commit and thread are the ones you look for here:
/messages/by-id/506972B9.6060104@vmware.com

commit: 6f60fdd7015b032bf49273c99f80913d57eac284
committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Thu, 11 Oct 2012 17:48:08 +0300
Improve replication connection timeouts.

Rename replication_timeout to wal_sender_timeout, and add a new setting
called wal_receiver_timeout that does the same at the walreceiver side.
There was previously no timeout in walreceiver, so if the network went down,
for example, the walreceiver could take a long time to notice that the
connection was lost. Now with the two settings, both sides of a replication
connection will detect a broken connection similarly.

It is no longer necessary to manually set wal_receiver_status_interval
to a value smaller than the timeout. Both wal sender and receiver now
automatically send a "ping" message if more than 1/2 of the configured
timeout has elapsed, and it hasn't received any messages from the
other end.

The docs could be improved to describe that better..
--
Michael

#3Nick B
nbedxp@gmail.com
In reply to: Michael Paquier (#2)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier <michael@paquier.xyz> wrote:

These are a bit unregular. Which files are taking that long to
complete while others are way faster? It may be something that we
could improve on the base backup side as there is no actual point in
syncing segments while the backup is running and we could delay that
at the end of the backup (if I recall that stuff correctly).

I don't have a good sample for these. One instance of this happening is below:
....
0.000125 fsync(7) = 0 <0.016677>
0.000039 fsync(7) = 0 <0.000005>
# 2048 writes for total of 16777216 bytes (16MB)
0.000618 write(7,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192 <0.000021>
0.000078 fsync(8) = 0 <57.609720>
57.609830 fsync(8) = 0 <0.000007>

Again, it is a problem with our network file system that we are still
investigating. I'm not sure this can be improved easily, since
pg_basebackup shares this code with walreceiver.

The docs could be improved to describe that better..

I will look into that.

Regards,
Nick.

#4Magnus Hagander
magnus@hagander.net
In reply to: Nick B (#3)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Sat, Jan 26, 2019 at 1:35 PM Nick B <nbedxp@gmail.com> wrote:

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier <michael@paquier.xyz>
wrote:

These are a bit unregular. Which files are taking that long to
complete while others are way faster? It may be something that we
could improve on the base backup side as there is no actual point in
syncing segments while the backup is running and we could delay that
at the end of the backup (if I recall that stuff correctly).

I don't have a good sample for these. One instance of this happening is
below:
....
0.000125 fsync(7) = 0 <0.016677>
0.000039 fsync(7) = 0 <0.000005>
# 2048 writes for total of 16777216 bytes (16MB)
0.000618 write(7,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192 <0.000021>
0.000078 fsync(8) = 0 <57.609720>
57.609830 fsync(8) = 0 <0.000007>

Again, it is a problem with our network file system that we are still
investigating. I'm not sure this can be improved easily, since
pg_basebackup shares this code with walreceiver.

One workaround you could perhaps look at here is to run pg_basebackup
with --no-sync. That way there will be no fsyncs issued while running. You
will then of course have to take care of syncing all the files to disk
after it's done, but a network filesystem might be happier in dealing with
a large "batch-sync" like that rather than piece-by-piece sync.

(yes, I realize that wasn't your original question, just wanted to make
sure it was a workaround you had considered)

//Magnus

#5Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#4)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:

One workaround you could perhaps look at here is to run pg_basebackup
with --no-sync. That way there will be no fsyncs issued while running. You
will then of course have to take care of syncing all the files to disk
after it's done, but a network filesystem might be happier in dealing with
a large "batch-sync" like that rather than piece-by-piece sync.

Hm. Aren't we actually wrong in letting the WAL receive method use
the value of do_sync depending on the command line arguments, with
true being the default for pg_basebackup? In plain format, we flush
the full data directory anyway when the backup ends. In tar format,
each individual tar file is flushed one-by-one after being received,
and we issue a final sync on the parent directory at the end. So
what's missing is just to make sure that the fully generated
pg_wal.tar is synced once completed. This would be way cheaper than
letting the stream process issue syncs for each segments, which does
not matter much in the event of a host crash because the base backup
may finish in an inconsistent state, and one should not use it.
--
Michael

#6Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#5)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Sun, Jan 27, 2019 at 1:59 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:

One workaround you could perhaps look at here is to run pg_basebackup
with --no-sync. That way there will be no fsyncs issued while running.

You

will then of course have to take care of syncing all the files to disk
after it's done, but a network filesystem might be happier in dealing

with

a large "batch-sync" like that rather than piece-by-piece sync.

Hm. Aren't we actually wrong in letting the WAL receive method use
the value of do_sync depending on the command line arguments, with
true being the default for pg_basebackup? In plain format, we flush
the full data directory anyway when the backup ends. In tar format,
each individual tar file is flushed one-by-one after being received,
and we issue a final sync on the parent directory at the end. So
what's missing is just to make sure that the fully generated
pg_wal.tar is synced once completed. This would be way cheaper than
letting the stream process issue syncs for each segments, which does
not matter much in the event of a host crash because the base backup
may finish in an inconsistent state, and one should not use it.

Yeah, that could be done without giving up any of the guarantees -- we only
give the guarantee at the end of the completed backup. I wouldn't
necessarily say we're wrong now, but it could definitely be a nice
performance improvement.

And for plain format, we'd do the same -- sync after each file segment, and
then a final one of the directory when done, right?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#7Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#6)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Mon, Jan 28, 2019 at 09:05:26AM +0100, Magnus Hagander wrote:

Yeah, that could be done without giving up any of the guarantees -- we only
give the guarantee at the end of the completed backup. I wouldn't
necessarily say we're wrong now, but it could definitely be a nice
performance improvement.

The code ensures durability in its current shape, and does more than
it actually needs to.

And for plain format, we'd do the same -- sync after each file segment, and
then a final one of the directory when done, right?

Well, the code is doing a double amount of work in its current shape
as we call fsync_pgdata() for the plain format, which cascades to
pg_wal and all its files, so it seems to me that there is little point
in issuing a sync when each segment is finished streaming if that's
what you mean.
--
Michael

#8Alex Kliukin
alexk@hintbits.com
In reply to: Michael Paquier (#7)
1 attachment(s)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Mon, Jan 28, 2019, at 10:25, Michael Paquier wrote:

On Mon, Jan 28, 2019 at 09:05:26AM +0100, Magnus Hagander wrote:

And for plain format, we'd do the same -- sync after each file segment, and
then a final one of the directory when done, right?

Well, the code is doing a double amount of work in its current shape
as we call fsync_pgdata() for the plain format, which cascades to
pg_wal and all its files, so it seems to me that there is little point
in issuing a sync when each segment is finished streaming if that's
what you mean.

Agreed.

While reading the doc page for the pg_basebackup, I've been confused by the fact that it says WAL files will be written to .tarballs (either base.tar or pg_wal.tar) when pg_basebackup is instructed to stream WALs alongside the backup itself. I think it makes sense to elaborate that it only happens when tar format is specified (doc patch is attached).

Cheers,
Oleksii

Attachments:

pg_basebackup_wal_method_clarification_v1.patchtext/plain; name=pg_basebackup_wal_method_clarification_v1.patchDownload
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 57dc83b620..c4f3950e5b 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -306,8 +306,8 @@ PostgreSQL documentation
              backup will fail and be unusable.
            </para>
            <para>
-            The write-ahead log files will be written to
-            the <filename>base.tar</filename> file.
+            When tar format mode is used, the write-ahead log files will be
+            written to the <filename>base.tar</filename> file.
            </para>
           </listitem>
          </varlistentry>
@@ -326,9 +326,10 @@ PostgreSQL documentation
              requires no extra write-ahead logs to be saved on the master.
            </para>
            <para>
-            The write-ahead log files are written to a separate file
-            named <filename>pg_wal.tar</filename> (if the server is a version
-            earlier than 10, the file will be named <filename>pg_xlog.tar</filename>).
+            When tar format mode is used, the write-ahead log files will be
+            written to a separate file named <filename>pg_wal.tar</filename> 
+            (if the server is a version earlier than 10, the file will be named
+            <filename>pg_xlog.tar</filename>).
            </para>
            <para>
             This value is the default.
#9Michael Paquier
michael@paquier.xyz
In reply to: Alex Kliukin (#8)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Mon, Jan 28, 2019 at 02:00:59PM +0100, Alex Kliukin wrote:

While reading the doc page for the pg_basebackup, I've been confused
by the fact that it says WAL files will be written to .tarballs
(either base.tar or pg_wal.tar) when pg_basebackup is instructed to
stream WALs alongside the backup itself. I think it makes sense to
elaborate that it only happens when tar format is specified (doc
patch is attached).

Agreed. The current wording can be confusing depending on the format,
and your suggestion looks right. Any opinions from others?
--
Michael

#10Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#9)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Tue, Jan 29, 2019 at 6:19 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jan 28, 2019 at 02:00:59PM +0100, Alex Kliukin wrote:

While reading the doc page for the pg_basebackup, I've been confused
by the fact that it says WAL files will be written to .tarballs
(either base.tar or pg_wal.tar) when pg_basebackup is instructed to
stream WALs alongside the backup itself. I think it makes sense to
elaborate that it only happens when tar format is specified (doc
patch is attached).

Agreed. The current wording can be confusing depending on the format,
and your suggestion looks right. Any opinions from others?

Agreed, definitely confusing.

Since you also agreed on it, I went ahead and pushed (with backpatch).

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#11Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#10)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On Tue, Jan 29, 2019 at 10:45:34AM +0100, Magnus Hagander wrote:

Since you also agreed on it, I went ahead and pushed (with backpatch).

Thanks for taking care of it, Magnus.
--
Michael

#12Oleksii Kliukin
alexk@hintbits.com
In reply to: Magnus Hagander (#10)
Re: pg_basebackup, walreceiver and wal_sender_timeout

On 29. Jan 2019, at 10:45, Magnus Hagander <magnus@hagander.net> wrote:

On Tue, Jan 29, 2019 at 6:19 AM Michael Paquier <michael@paquier.xyz <mailto:michael@paquier.xyz>> wrote:
On Mon, Jan 28, 2019 at 02:00:59PM +0100, Alex Kliukin wrote:

While reading the doc page for the pg_basebackup, I've been confused
by the fact that it says WAL files will be written to .tarballs
(either base.tar or pg_wal.tar) when pg_basebackup is instructed to
stream WALs alongside the backup itself. I think it makes sense to
elaborate that it only happens when tar format is specified (doc
patch is attached).

Agreed. The current wording can be confusing depending on the format,
and your suggestion looks right. Any opinions from others?

Agreed, definitely confusing.

Since you also agreed on it, I went ahead and pushed (with backpatch).

Thanks a lot (and to Michael as well for looking into it)!

Cheers,
Oleksii

#13Nick B
nbedxp@gmail.com
In reply to: Oleksii Kliukin (#12)
Re: pg_basebackup, walreceiver and wal_sender_timeout

Greetings,
I also would like to thank everyone for looking into this.

On Sat, Jan 26, 2019 at 01:45:46PM +0100, Magnus Hagander wrote:

One workaround you could perhaps look at here is to run pg_basebackup
with --no-sync. That way there will be no fsyncs issued while running. You
will then of course have to take care of syncing all the files to disk
after it's done, but a network filesystem might be happier in dealing with
a large "batch-sync" like that rather than piece-by-piece sync.

Thanks for the pointer. I actually was not aware of the existence of this
flag. I've ran two rounds of tests with --no-sync and backup failed at a
much later point in time, which suggests that the bottleneck is in fact the
metadata server of ceph. We're now looking into ways of improving this.
(This is a 15TB cluster with a few hundred thousands tables which on
average generates 4 WAL segments per second, so throttling transfer rate is
not a good option either).

On Sat, Jan 26, 2019 at 4:23 AM Michael Paquier
<michael(at)paquier(dot)xyz> wrote:

The docs could be improved to describe that better..

I had an off-list discussion of a possible documentation update with
Stephen Frost and he voiced an opinion that the behaviour I was trying to
describe sounds a lot like a bug and documenting that is not a good
practice.

Upon further examination of WalSndKeepaliveIfNecessary I found out that the
implementation of "requesting an immediate reply" is done by setting the
socket into non-blocking mode and issuing a flush. I find it hard to
believe there is a scenario where client can react to that keep-alive on
time (unless of course I misunderstood something). So the question is, will
we ever wait the actual wal_sender_timeout before terminating the
connection?

Regards,
Nick.