pg_rewind docs correction

Started by James Colemanover 6 years ago17 messages
#1James Coleman
jtc331@gmail.com
1 attachment(s)

The pg_rewind docs assert that the state of the target's data directory
after rewind is equivalent to the source's data directory. But that
isn't true both because the base state is further back in time and
because the target's data directory will include the current state on
the source of any copied blocks.

So I've attached a patch to summarize more correctly as well as
document clearly the state of the cluster after the operation and also
the operation sequencing dangers caused by copying configuration
files from the source.

James Coleman

Attachments:

001_pg_rewind_explanation_doc_fixes.patchapplication/octet-stream; name=001_pg_rewind_explanation_doc_fixes.patchDownload
commit 5ccb2ae150b68ab2231b94474b68d16ecbaf90a8
Author: jcoleman <james.coleman@getbraintree.com>
Date:   Fri Sep 13 17:28:53 2019 +0000

    Correct pg_rewind explanation
    
    The pg_rewind docs assert that the state of the target's data directory
    after rewind is equivalent to the source's data directory. But that
    isn't true both because the base state is further back in time and
    because the target's data directory will include the current state on
    the source of any copied blocks.
    
    In additional document clearly the state of the cluster after the
    operation and also the operation sequencing dangers caused by copying
    configuration files from the source.

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index ac142d22fc..e3aeb529de 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,17 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind, the target data directory is equivalent to the
+   to the state of the data directory at the point at which the source and
+   target diverged plus the current state on the source of any blocks changed
+   on the target after that divergence. While only changed blocks from relation
+   files are copied; all other files are copied in full, including configuration
+   files and WAL segments. The advantage of <application>pg_rewind</application>
+   over taking a new base backup, or tools like <application>rsync</application>,
+   is that <application>pg_rewind</application> does not require comparing or 
+   copying unchanged relation blocks in the cluster. As such the rewind operation
+   is significantly faster than other approaches when the database is large and
+   only a small fraction of blocks differ between the clusters.
   </para>
 
   <para>
@@ -116,6 +119,23 @@ PostgreSQL documentation
     be necessary to remove the data copied and restore back the set of links
     used before the rewind.
    </para>
+
+   <para>
+    The result of the rewind operation is not expected to be a consistent
+    data directory state either internally to the node or with respect to
+    the rest of the cluster. Instead the resulting data directory will only
+    be consistent after WAL replay has completed to at least the LSN at which
+    changed blocks copied from the source were originally written on the source.
+   </para>
+   
+  <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>
   </warning>
  </refsect1>
 
#2Michael Paquier
michael@paquier.xyz
In reply to: James Coleman (#1)
Re: pg_rewind docs correction

On Fri, Sep 13, 2019 at 01:47:03PM -0400, James Coleman wrote:

So I've attached a patch to summarize more correctly as well as
document clearly the state of the cluster after the operation and also
the operation sequencing dangers caused by copying configuration
files from the source.

+   After a successful rewind, the target data directory is equivalent
to the
+   to the state of the data directory at the point at which the
source and
+   target diverged plus the current state on the source of any blocks
changed
+   on the target after that divergence. While only changed blocks
from relation
+   files are copied; all other files are copied in full, including
configuration
+   files and WAL segments. The advantage of
<application>pg_rewind</application>
+   over taking a new base backup, or tools like
<application>rsync</application>,
+   is that <application>pg_rewind</application> does not require
comparing or
+   copying unchanged relation blocks in the cluster. As such the
rewind operation
+   is significantly faster than other approaches when the database is
large and
+   only a small fraction of blocks differ between the clusters.

The point of divergence could be defined as the LSN position where WAL
has forked on the new timeline, but the block diffs are copied from
actually the last checkpoint just before WAL has forked. So this new
paragraph brings confusion about the actual divergence point.

Regarding the relation files, if the file does not exist on the target
but does exist on the source, it is also copied fully, so the second
sentence is wrong here to mention as relation files could also be
copied fully.
--
Michael

#3James Coleman
jtc331@gmail.com
In reply to: Michael Paquier (#2)
1 attachment(s)
Re: pg_rewind docs correction

On Sat, Sep 14, 2019 at 12:20 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 13, 2019 at 01:47:03PM -0400, James Coleman wrote:

So I've attached a patch to summarize more correctly as well as
document clearly the state of the cluster after the operation and also
the operation sequencing dangers caused by copying configuration
files from the source.

+   After a successful rewind, the target data directory is equivalent
to the
+   to the state of the data directory at the point at which the
source and
+   target diverged plus the current state on the source of any blocks
changed
+   on the target after that divergence. While only changed blocks
from relation
+   files are copied; all other files are copied in full, including
configuration
+   files and WAL segments. The advantage of
<application>pg_rewind</application>
+   over taking a new base backup, or tools like
<application>rsync</application>,
+   is that <application>pg_rewind</application> does not require
comparing or
+   copying unchanged relation blocks in the cluster. As such the
rewind operation
+   is significantly faster than other approaches when the database is
large and
+   only a small fraction of blocks differ between the clusters.

The point of divergence could be defined as the LSN position where WAL
has forked on the new timeline, but the block diffs are copied from
actually the last checkpoint just before WAL has forked. So this new
paragraph brings confusion about the actual divergence point.

Regarding the relation files, if the file does not exist on the target
but does exist on the source, it is also copied fully, so the second
sentence is wrong here to mention as relation files could also be
copied fully.

Updated (plus some additional wordsmithing).

James Coleman

Attachments:

001_pg_rewind_explanation_doc_fixes_v2.patchapplication/octet-stream; name=001_pg_rewind_explanation_doc_fixes_v2.patchDownload
commit a56be36d9342b3913d225af6e10e5fbc86662789
Author: jcoleman <james.coleman@getbraintree.com>
Date:   Fri Sep 13 17:28:53 2019 +0000

    Correct pg_rewind explanation
    
    The pg_rewind docs assert that the state of the target's data directory
    after rewind is equivalent to the source's data directory. But that
    isn't true both because the base state is further back in time and
    because the target's data directory will include the current state on
    the source of any copied blocks.
    
    In additional document clearly the state of the cluster after the
    operation and also the operation sequencing dangers caused by copying
    configuration files from the source.

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index ac142d22fc..05c23ae720 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,19 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind, the target data directory is equivalent to the
+   to the state of the data directory at the last checkpoint completed prior to
+   the point at which the WAL timelines of the source and target diverged plus
+   the current state on the source of any blocks changed on the target after
+   that divergence. While only changed blocks from existing relation files are
+   copied; all other files are copied in full, including new relation files,
+   configuration files and WAL segments. The advantage of
+   <application>pg_rewind</application> over taking a new base backup, or tools
+   like <application>rsync</application>, is that
+   <application>pg_rewind</application> does not require comparing or copying
+   unchanged relation blocks in the cluster. As such the rewind operation is
+   significantly faster than other approaches when the database is large and
+   only a small fraction of blocks differ between the clusters.
   </para>
 
   <para>
@@ -116,6 +121,23 @@ PostgreSQL documentation
     be necessary to remove the data copied and restore back the set of links
     used before the rewind.
    </para>
+
+   <para>
+    The rewind operation is not expected to result in a consistent data
+    directory state either internally to the node or with respect to the rest
+    of the cluster. Instead the resulting data directory will only be consistent
+    after WAL replay has completed to at least the LSN at which changed blocks
+    copied from the source were originally written on the source.
+   </para>
+   
+  <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>
   </warning>
  </refsect1>
 
#4Michael Paquier
michael@paquier.xyz
In reply to: James Coleman (#3)
Re: pg_rewind docs correction

On Sat, Sep 14, 2019 at 07:00:54PM -0400, James Coleman wrote:

Updated (plus some additional wordsmithing).

+    The rewind operation is not expected to result in a consistent data
+    directory state either internally to the node or with respect to the rest
+    of the cluster. Instead the resulting data directory will only be consistent
+    after WAL replay has completed to at least the LSN at which changed blocks
+    copied from the source were originally written on the source.

That's not necessarily true. pg_rewind enforces in the control file
of the target the minimum consistency LSN to be
pg_current_wal_insert_lsn() when using a live source or the last
checkpoint LSN for a stopped source, so while that sounds true from
the point of view of all the blocks copied, the control file may still
cause a complain that the target recovering has not reached its
consistent point even if all the blocks are already at a position
not-so-far from what has been registered in the control file.

+   the point at which the WAL timelines of the source and target diverged plus
+   the current state on the source of any blocks changed on the target after
+   that divergence. While only changed blocks from existing relation files are

And here we could mention that all the blocks copied from the source
are the ones which are found in the WAL records of the target until
the end of WAL of its timeline. Still, that's basically what is
mentioned in the first part of "How It Works", which explains things
better. I honestly don't really see that all this paragraph is an
improvement over the simplicity of the original when it comes to
understand the global idea of what pg_rewind does.

+  <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>

No objections regarding that part. Now it seems to me that we had
better apply that to the last part of "How it works" instead? I kind
of agree that the last paragraph could provide more details regarding
the risks of overwriting the wanted configuration. The existing docs
also mention that pg_rewind only creates a backup_label file to start
recovery, perhaps we could mention up to which point recovery happens
in this section? There is a bit more here than just "apply the WAL".
--
Michael

#5James Coleman
jtc331@gmail.com
In reply to: Michael Paquier (#4)
Re: pg_rewind docs correction

On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Sep 14, 2019 at 07:00:54PM -0400, James Coleman wrote:

Updated (plus some additional wordsmithing).

+    The rewind operation is not expected to result in a consistent data
+    directory state either internally to the node or with respect to the rest
+    of the cluster. Instead the resulting data directory will only be consistent
+    after WAL replay has completed to at least the LSN at which changed blocks
+    copied from the source were originally written on the source.

That's not necessarily true. pg_rewind enforces in the control file
of the target the minimum consistency LSN to be
pg_current_wal_insert_lsn() when using a live source or the last
checkpoint LSN for a stopped source, so while that sounds true from
the point of view of all the blocks copied, the control file may still
cause a complain that the target recovering has not reached its
consistent point even if all the blocks are already at a position
not-so-far from what has been registered in the control file.

I could just say "after WAL replay has completed to a consistent state"?

+   the point at which the WAL timelines of the source and target diverged plus
+   the current state on the source of any blocks changed on the target after
+   that divergence. While only changed blocks from existing relation files are

And here we could mention that all the blocks copied from the source
are the ones which are found in the WAL records of the target until
the end of WAL of its timeline. Still, that's basically what is
mentioned in the first part of "How It Works", which explains things
better. I honestly don't really see that all this paragraph is an
improvement over the simplicity of the original when it comes to
understand the global idea of what pg_rewind does.

The problem with the original is that while simple, it's actually
incorrect in that simplicity. Pg_rewind does *not* result in the data
directory on the target matching the data directory on the source.

+  <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>

No objections regarding that part. Now it seems to me that we had
better apply that to the last part of "How it works" instead? I kind
of agree that the last paragraph could provide more details regarding
the risks of overwriting the wanted configuration. The existing docs
also mention that pg_rewind only creates a backup_label file to start
recovery, perhaps we could mention up to which point recovery happens
in this section? There is a bit more here than just "apply the WAL".

I'll look to see if there's a better place to put this.

James Coleman

#6Michael Paquier
michael@paquier.xyz
In reply to: James Coleman (#5)
Re: pg_rewind docs correction

On Sun, Sep 15, 2019 at 10:36:04AM -0400, James Coleman wrote:

On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier <michael@paquier.xyz> wrote:

+    The rewind operation is not expected to result in a consistent data
+    directory state either internally to the node or with respect to the rest
+    of the cluster. Instead the resulting data directory will only be consistent
+    after WAL replay has completed to at least the LSN at which changed blocks
+    copied from the source were originally written on the source.

That's not necessarily true. pg_rewind enforces in the control file
of the target the minimum consistency LSN to be
pg_current_wal_insert_lsn() when using a live source or the last
checkpoint LSN for a stopped source, so while that sounds true from
the point of view of all the blocks copied, the control file may still
cause a complain that the target recovering has not reached its
consistent point even if all the blocks are already at a position
not-so-far from what has been registered in the control file.

I could just say "after WAL replay has completed to a consistent state"?

I still would not change this paragraph. The first sentence means
that we have an equivalency, because that's the case if you think
about it as we make sure that the target is able to sync with the
source, and the target gets into a state where it as an on-disk state
equivalent to the target up to the minimum consistency point defined
in the control file once the tool has done its work (this last point
is too precise to be included in a global description to be honest).
And the second sentence makes clear what are the actual diffs are.

+   the point at which the WAL timelines of the source and target diverged plus
+   the current state on the source of any blocks changed on the target after
+   that divergence. While only changed blocks from existing relation files are

And here we could mention that all the blocks copied from the source
are the ones which are found in the WAL records of the target until
the end of WAL of its timeline. Still, that's basically what is
mentioned in the first part of "How It Works", which explains things
better. I honestly don't really see that all this paragraph is an
improvement over the simplicity of the original when it comes to
understand the global idea of what pg_rewind does.

The problem with the original is that while simple, it's actually
incorrect in that simplicity. Pg_rewind does *not* result in the data
directory on the target matching the data directory on the source.

That's not what I get from the original docs, but I may be too much
used to it.

+  <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>

No objections regarding that part. Now it seems to me that we had
better apply that to the last part of "How it works" instead? I kind
of agree that the last paragraph could provide more details regarding
the risks of overwriting the wanted configuration. The existing docs
also mention that pg_rewind only creates a backup_label file to start
recovery, perhaps we could mention up to which point recovery happens
in this section? There is a bit more here than just "apply the WAL".

I'll look to see if there's a better place to put this.

Thanks. From what I can see, we could improve further the doc part
about how the tool works in details, especially regarding the
configuration files which may get overwritten and be more precise
about that.
--
Michael

#7James Coleman
jtc331@gmail.com
In reply to: Michael Paquier (#6)
Re: pg_rewind docs correction

On Tue, Sep 17, 2019 at 3:51 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Sep 15, 2019 at 10:36:04AM -0400, James Coleman wrote:

On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier <michael@paquier.xyz> wrote:

+    The rewind operation is not expected to result in a consistent data
+    directory state either internally to the node or with respect to the rest
+    of the cluster. Instead the resulting data directory will only be consistent
+    after WAL replay has completed to at least the LSN at which changed blocks
+    copied from the source were originally written on the source.

That's not necessarily true. pg_rewind enforces in the control file
of the target the minimum consistency LSN to be
pg_current_wal_insert_lsn() when using a live source or the last
checkpoint LSN for a stopped source, so while that sounds true from
the point of view of all the blocks copied, the control file may still
cause a complain that the target recovering has not reached its
consistent point even if all the blocks are already at a position
not-so-far from what has been registered in the control file.

I could just say "after WAL replay has completed to a consistent state"?

I still would not change this paragraph. The first sentence means
that we have an equivalency, because that's the case if you think
about it as we make sure that the target is able to sync with the
source, and the target gets into a state where it as an on-disk state
equivalent to the target up to the minimum consistency point defined
in the control file once the tool has done its work (this last point
is too precise to be included in a global description to be honest).
And the second sentence makes clear what are the actual diffs are.

+   the point at which the WAL timelines of the source and target diverged plus
+   the current state on the source of any blocks changed on the target after
+   that divergence. While only changed blocks from existing relation files are

And here we could mention that all the blocks copied from the source
are the ones which are found in the WAL records of the target until
the end of WAL of its timeline. Still, that's basically what is
mentioned in the first part of "How It Works", which explains things
better. I honestly don't really see that all this paragraph is an
improvement over the simplicity of the original when it comes to
understand the global idea of what pg_rewind does.

The problem with the original is that while simple, it's actually
incorrect in that simplicity. Pg_rewind does *not* result in the data
directory on the target matching the data directory on the source.

That's not what I get from the original docs, but I may be too much
used to it.

I don't agree that that's a valid equivalency. I myself spent a lot of
time trying to understand how this could possibly be true a while
back, and even looked at source code to be certain. I've asked other
people and found the same confusion.

As I read it the 2nd second sentence doesn't actually tell you the
differences; it makes a quick attempt at summarizing *how* the first
sentence is true, but if the first sentence isn't accurate, then it's
hard to read the 2nd one as helping.

If you'd prefer something less detailed at this point at that point in
the docs, then something along the lines of "results in a data
directory state which can then be safely replayed from the source" or
some such.

The docs shouldn't be correct just for someone how already understands
the intricacies. And the end user shouldn't have to read the "how it
works" (which incidentally is kinda hidden at the bottom underneath
the CLI args -- perhaps we could move that?) to extrapolate things in
the primary documentation.

James Coleman

#8Michael Paquier
michael@paquier.xyz
In reply to: James Coleman (#7)
Re: pg_rewind docs correction

On Tue, Sep 17, 2019 at 08:38:18AM -0400, James Coleman wrote:

I don't agree that that's a valid equivalency. I myself spent a lot of
time trying to understand how this could possibly be true a while
back, and even looked at source code to be certain. I've asked other
people and found the same confusion.

As I read it the 2nd second sentence doesn't actually tell you the
differences; it makes a quick attempt at summarizing *how* the first
sentence is true, but if the first sentence isn't accurate, then it's
hard to read the 2nd one as helping.

Well, then it comes back to the part where I am used to the existing
docs :)

If you'd prefer something less detailed at this point at that point in
the docs, then something along the lines of "results in a data
directory state which can then be safely replayed from the source" or
some such.

Actually this is a good suggestion, and could replace the first
sentence of this paragraph.

The docs shouldn't be correct just for someone how already understands
the intricacies. And the end user shouldn't have to read the "how it
works" (which incidentally is kinda hidden at the bottom underneath
the CLI args -- perhaps we could move that?) to extrapolate things in
the primary documentation.

Perhaps. This doc page is not that long either.
--
Michael

#9James Coleman
jtc331@gmail.com
In reply to: Michael Paquier (#8)
1 attachment(s)
Re: pg_rewind docs correction

On Tue, Sep 17, 2019 at 9:41 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Sep 17, 2019 at 08:38:18AM -0400, James Coleman wrote:

I don't agree that that's a valid equivalency. I myself spent a lot of
time trying to understand how this could possibly be true a while
back, and even looked at source code to be certain. I've asked other
people and found the same confusion.

As I read it the 2nd second sentence doesn't actually tell you the
differences; it makes a quick attempt at summarizing *how* the first
sentence is true, but if the first sentence isn't accurate, then it's
hard to read the 2nd one as helping.

Well, then it comes back to the part where I am used to the existing
docs :)

If you'd prefer something less detailed at this point at that point in
the docs, then something along the lines of "results in a data
directory state which can then be safely replayed from the source" or
some such.

Actually this is a good suggestion, and could replace the first
sentence of this paragraph.

The docs shouldn't be correct just for someone how already understands
the intricacies. And the end user shouldn't have to read the "how it
works" (which incidentally is kinda hidden at the bottom underneath
the CLI args -- perhaps we could move that?) to extrapolate things in
the primary documentation.

Perhaps. This doc page is not that long either.

I'd set this aside for quite a while, but I was looking at it again this
afternoon, and I've come to see your concern about the opening paragraphs
remaining relatively simple. To that end I believe I've come up with a
patch that's a good compromise: retaining that simplicity and being more
clear and accurate at the same time.

In the first paragraph I've updated it to refer to both "successful rewind
and subsequent WAL replay" and the result I describe as being equivalent to
the result of a base backup, since that's more technically correct anyway
(the current text could be read as implying a full out copy of the data
directory, but that's not really true just as it isn't with pg_basebackup).

I've added the information about how the backup label control file is
written, and updated the How It Works steps to refer to that separately
from restart.

Additionally the How It Works is updated to include WAL segments and new
relation files in the list of files copied wholesale, since that was
previously stated but somewhat contradicted there.

I realized I didn't previously add this to the CF; since it's not a new
patch I've added it to the current CF, but if this is incorrect please let
me know.

Thanks,
James

Attachments:

v3-0001-Improve-pg_rewind-explanation-and-warnings.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Improve-pg_rewind-explanation-and-warnings.patchDownload
From 592ba15c35bb16e55b0bb0a7e7bdbb6dd4e08a0b Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sun, 8 Mar 2020 16:39:45 -0400
Subject: [PATCH v3] Improve pg_rewind explanation and warnings

The pg_rewind docs currently assert that the state of the target's
data directory after rewind is equivalent to the source's data
directory. But that isn't quite true both because the base state is
further back in time and because the target's data directory will
include the current state on the source of any copied blocks.
Additionally the state isn't equal to a copy of the source data
directory; it's equivalent to a base backup of the source.

The How It Works section now:
- Includes details about how the backup label file is created.
- Is updated to include WAL segments and new relation files in the
  list of files copied wholesale from the source.

Finally, document clearly the state of the cluster after the operation
and also the operation sequencing dangers caused by copying
configuration files from the source.
---
 doc/src/sgml/ref/pg_rewind.sgml | 87 ++++++++++++++++++++-------------
 1 file changed, 54 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 42d29edd4e..bc6f0009cc 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,16 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind and subsequent WAL replay, the target data
+   directory is equivalent to a base backup of the source data directory. While
+   only changed blocks from existing relation files are copied; all other files
+   are copied in full, including new relation files, configuration files, and WAL
+   segments. The advantage of <application>pg_rewind</application> over taking a
+   new base backup, or tools like <application>rsync</application>, is that
+   <application>pg_rewind</application> does not require comparing or copying
+   unchanged relation blocks in the cluster. As such the rewind operation is
+   significantly faster than other approaches when the database is large and
+   only a small fraction of blocks differ between the clusters.
   </para>
 
   <para>
@@ -77,16 +79,18 @@ PostgreSQL documentation
   </para>
 
   <para>
-   When the target server is started for the first time after running
-   <application>pg_rewind</application>, it will go into recovery mode and replay all
-   WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   <application>pg_rewind</application> was run, and therefore could not be copied by the
-   <application>pg_rewind</application> session, it must be made available when the
-   target server is started. This can be done by creating a
-   <filename>recovery.signal</filename> file in the target data directory
-   and configuring suitable <xref linkend="guc-restore-command"/>
-   in <filename>postgresql.conf</filename>.
+   After running <application>pg_rewind</application> the data directory is
+   not immediately in a consistent state. However
+   <application>pg_rewind</application> configures the control file so that when
+   the target server is started again it will enter recovery mode and replay all
+   WAL generated in the source server after the point of divergence. If some of
+   the WAL was no longer available in the source server when
+   <application>pg_rewind</application> was run, and therefore could not be
+   copied by the <application>pg_rewind</application> session, it must be made
+   available when the target server is started. This can be done by creating a
+   <filename>recovery.signal</filename> file in the target data directory and
+   configuring suitable <xref linkend="guc-restore-command"/> in
+   <filename>postgresql.conf</filename>.
   </para>
 
   <para>
@@ -105,6 +109,15 @@ PostgreSQL documentation
     recovered.  In such a case, taking a new fresh backup is recommended.
    </para>
 
+   <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>
+
    <para>
     <application>pg_rewind</application> will fail immediately if it finds
     files it cannot write directly to.  This can happen for example when
@@ -326,34 +339,42 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       Copy all those changed blocks from the source cluster to
       the target cluster, either using direct file system access
       (<option>--source-pgdata</option>) or SQL (<option>--source-server</option>).
+      The relation files are now to their state at the last checkpoint completed
+      prior to the point at which the WAL timelines of the source and target
+      diverged plus the current state on the source of any blocks changed on the
+      target after that divergence.
      </para>
     </step>
     <step>
      <para>
-      Copy all other files such as <filename>pg_xact</filename> and
-      configuration files from the source cluster to the target cluster
-      (everything except the relation files). Similarly to base backups,
-      the contents of the directories <filename>pg_dynshmem/</filename>,
+      Copy all other files, including new relation files, WAL segments,
+      <filename>pg_xact</filename>, and configuration files from the source
+      cluster to the target cluster. Similarly to base backups, the contents
+      of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files
       <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
-      <filename>postmaster.opts</filename> and
-      <filename>postmaster.pid</filename>.
+      <filename>postmaster.opts</filename>, and
+      <filename>postmaster.pid</filename>, as well as any file or directory
+      beginning with <filename>pgsql_tmp</filename>, are omitted.
+     </para>
+    </step>
+    <step>
+     <para>
+      Create a backup label file to begin WAL replay at the checkpoint created
+      at failover and  a minimum consistency LSN using
+      <literal>pg_current_wal_insert_lsn()</literal>, when using a live source
+      and the last checkpoint LSN, when using a stopped source.
      </para>
     </step>
     <step>
      <para>
-      Apply the WAL from the source cluster, starting from the checkpoint
-      created at failover. (Strictly speaking, <application>pg_rewind</application>
-      doesn't apply the WAL, it just creates a backup label file that
-      makes <productname>PostgreSQL</productname> start by replaying all WAL from
-      that checkpoint forward.)
+      On restart, <productname>PostgreSQL</productname> replays the required WAL
+      resulting in a consistent data directory state.
      </para>
     </step>
    </procedure>

base-commit: 691e8b2e1889d61df47ae76601fa9db6cbac6f1c
-- 
2.17.1

#10James Coleman
jtc331@gmail.com
In reply to: James Coleman (#9)
Re: pg_rewind docs correction

On Sun, Mar 8, 2020 at 5:13 PM James Coleman <jtc331@gmail.com> wrote:

I realized I didn't previously add this to the CF; since it's not a new
patch I've added it to the current CF, but if this is incorrect please let
me know.

Hmm, looks like I can't add it to the current one. I added it to the next
one. I think it could probably go now, since the patch is really 6 months
old, but either way is fine -- it's just a docs patch.

James

#11Michael Paquier
michael@paquier.xyz
In reply to: James Coleman (#9)
Re: pg_rewind docs correction

On Sun, Mar 08, 2020 at 05:13:21PM -0400, James Coleman wrote:

I've added the information about how the backup label control file is
written, and updated the How It Works steps to refer to that separately
from restart.

Additionally the How It Works is updated to include WAL segments and new
relation files in the list of files copied wholesale, since that was
previously stated but somewhat contradicted there.

-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind and subsequent WAL replay, the target data
+   directory is equivalent to a base backup of the source data directory. While
+   only changed blocks from existing relation files are copied; all other files
+   are copied in full, including new relation files, configuration files, and WAL
+   segments. The advantage of <application>pg_rewind</application> over taking a

The first sentence you are adding refers to "subsequent WAL replay".
However, this paragraph emphasizes with the state of the target
cluster after running pg_rewind but *before* make the target cluster
start recovery. So shouldn't you just remove the part "and subsequent
WAL replay" from your first new sentence?

In the same paragraph, I think that you should remove the "While" from
"While only changed blocks", as the second part of the sentence refers
to the other files, WAl segments, etc.

The second paragraph of the docs regarding timeline lookup is
unchanged, which is fine.

-   When the target server is started for the first time after running
-   <application>pg_rewind</application>, it will go into recovery mode and replay all
-   WAL generated in the source server after the point of divergence.
+   After running <application>pg_rewind</application> the data directory is
+   not immediately in a consistent state. However
+   <application>pg_rewind</application> configures the control file so that when
+   the target server is started again it will enter recovery mode and replay all
+   WAL generated in the source server after the point of divergence.

The second part of the third paragraph is not changed, and the
modification you are doing here is about the control file. I am
still unconvinced that this is a good change, because mentioning the
control file would be actually more adapted to the part "How it
works", where you are adding details about the backup_label file, and
already include details about the minimum consistency LSN itself
stored in the control file.

+   <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>

True that this is not outlined enough.

+      The relation files are now to their state at the last checkpoint completed
+      prior to the point at which the WAL timelines of the source and target
+      diverged plus the current state on the source of any blocks changed on the
+      target after that divergence.

"Relation files are now in a state equivalent to the moment of the
last completed checkpoint prior to the point.."?

-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files

This is just reorganizing an existing list, why?

+      Create a backup label file to begin WAL replay at the checkpoint created
+      at failover and  a minimum consistency LSN using
+      <literal>pg_current_wal_insert_lsn()</literal>, when using a live source
+      and the last checkpoint LSN, when using a stopped source.

Now would be the moment to mention the control file.

I realized I didn't previously add this to the CF; since it's not a new
patch I've added it to the current CF, but if this is incorrect please let
me know.

The last CF of Postgres 13 began at the beginning of February :(
--
Michael

#12James Coleman
jtc331@gmail.com
In reply to: Michael Paquier (#11)
1 attachment(s)
Re: pg_rewind docs correction

On Mon, Mar 9, 2020 at 2:59 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Mar 08, 2020 at 05:13:21PM -0400, James Coleman wrote:

I've added the information about how the backup label control file is
written, and updated the How It Works steps to refer to that separately
from restart.

Additionally the How It Works is updated to include WAL segments and new
relation files in the list of files copied wholesale, since that was
previously stated but somewhat contradicted there.

-   The result is equivalent to replacing the target data directory with
the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new
base backup, or
-   tools like <application>rsync</application>, is that
<application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind and subsequent WAL replay, the target data
+   directory is equivalent to a base backup of the source data directory.
While
+   only changed blocks from existing relation files are copied; all other
files
+   are copied in full, including new relation files, configuration files,
and WAL
+   segments. The advantage of <application>pg_rewind</application> over
taking a

The first sentence you are adding refers to "subsequent WAL replay".
However, this paragraph emphasizes with the state of the target
cluster after running pg_rewind but *before* make the target cluster
start recovery. So shouldn't you just remove the part "and subsequent
WAL replay" from your first new sentence?

I'd originally typed this:
I'm not sure I follow. After pg_rewind but before replay the directory is
*not* equivalent to a base backup. I don't see how paragraph is clearly
limited to describing what pg_rewind does. While the 2nd sentence is about
pg_rewind steps specifically, the paragraph (even in the original) goes on
to compare it to a base backup so we're talking about the operation in
totality not just the one tool.

But I realized while typing it that I was probably missing something of
what you were getting at: is the hangup on calling out the WAL replay that
a base backup (or rsync even) *also* requires WAL reply to reach a
consistent state? I hadn't thought of that while writing this initially, so
I've updated the patch to eliminate that part but also to make the analogy
to base backups more direct, since it's helpful in understanding what
result the tool is trying to accomplish and how it differs.

In the same paragraph, I think that you should remove the "While" from

"While only changed blocks", as the second part of the sentence refers
to the other files, WAl segments, etc.

Fixed as part of the above.

The second paragraph of the docs regarding timeline lookup is
unchanged, which is fine.

-   When the target server is started for the first time after running
-   <application>pg_rewind</application>, it will go into recovery mode
and replay all
-   WAL generated in the source server after the point of divergence.
+   After running <application>pg_rewind</application> the data directory
is
+   not immediately in a consistent state. However
+   <application>pg_rewind</application> configures the control file so
that when
+   the target server is started again it will enter recovery mode and
replay all
+   WAL generated in the source server after the point of divergence.

The second part of the third paragraph is not changed, and the
modification you are doing here is about the control file. I am
still unconvinced that this is a good change, because mentioning the
control file would be actually more adapted to the part "How it
works", where you are adding details about the backup_label file, and
already include details about the minimum consistency LSN itself
stored in the control file.

I've removed the control file reference and instead continued the analogy
to base backups.

+   <para>
+    Because <application>pg_rewind</application> copies configuration
files
+    entirely from the source, correcting recovery configuration options
before
+    restarting the server is necessary if you intend to re-introduce the
target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target
will
+    again diverge from the primary.
+   </para>

True that this is not outlined enough.

Thanks.

+      The relation files are now to their state at the last checkpoint
completed
+      prior to the point at which the WAL timelines of the source and
target
+      diverged plus the current state on the source of any blocks changed
on the
+      target after that divergence.

"Relation files are now in a state equivalent to the moment of the
last completed checkpoint prior to the point.."?

Updated.

-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and
<filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files

This is just reorganizing an existing list, why?

The grammar seemed a bit awkward to me, so while I was already reworking
this paragraph I tried to clean that up a bit.

+      Create a backup label file to begin WAL replay at the checkpoint
created
+      at failover and  a minimum consistency LSN using
+      <literal>pg_current_wal_insert_lsn()</literal>, when using a live
source
+      and the last checkpoint LSN, when using a stopped source.

Now would be the moment to mention the control file.

I made that more explicit here, and also referenced the filenames directly
(and with tags).

I realized I didn't previously add this to the CF; since it's not a new
patch I've added it to the current CF, but if this is incorrect please

let

me know.

The last CF of Postgres 13 began at the beginning of February :(

Still ongoing, correct? I guess I mentally think of them as being only one
month, but I guess that's not actually true. Regardless I'm not sure what
policy is for patches that have been in flight in hackers for a while but
just missed being added to the CF app.

Thanks,
James

Attachments:

v4-0001-Improve-pg_rewind-explanation-and-warnings.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Improve-pg_rewind-explanation-and-warnings.patchDownload
From e2d10defaaefae116e21eebdf2302b1785d43e50 Mon Sep 17 00:00:00 2001
From: James Coleman <jtc331@gmail.com>
Date: Sun, 8 Mar 2020 16:39:45 -0400
Subject: [PATCH v4] Improve pg_rewind explanation and warnings

The pg_rewind docs currently assert that the state of the target's
data directory after rewind is equivalent to the source's data
directory. But that isn't quite true both because the base state is
further back in time and because the target's data directory will
include the current state on the source of any copied blocks.
Instead using the analogy to back backups helps explain the state,
as well as the pros and cons of using the utility.

The How It Works section now:
- Includes details about how the backup_label file is created.
- Includes details about how the pg_control file is updated.
- Is updated to include WAL segments and new relation files in the
  list of files copied wholesale from the source.

Finally, document clearly the state of the cluster after the operation
and also the operation sequencing dangers caused by copying
configuration files from the source.
---
 doc/src/sgml/ref/pg_rewind.sgml | 87 ++++++++++++++++++++-------------
 1 file changed, 54 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 42d29edd4e..c3a2775abc 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,16 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind the state of the target data directory is analogous
+   to a base backup of the source data directory. Unlike taking a new base
+   backup or using a tool like <application>rsync</application>,
+   <application>pg_rewind</application> does not require comparing or copying
+   unchanged relation blocks in the cluster. Only changed blocks from existing
+   relation files are copied; all other files, including new relation files,
+   configuration files, and WAL segments, are copied in full. As such the
+   rewind operation is significantly faster than other approaches when the
+   database is large and only a small fraction of blocks differ between the
+   clusters.
   </para>
 
   <para>
@@ -77,16 +79,17 @@ PostgreSQL documentation
   </para>
 
   <para>
-   When the target server is started for the first time after running
-   <application>pg_rewind</application>, it will go into recovery mode and replay all
-   WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   <application>pg_rewind</application> was run, and therefore could not be copied by the
-   <application>pg_rewind</application> session, it must be made available when the
-   target server is started. This can be done by creating a
-   <filename>recovery.signal</filename> file in the target data directory
-   and configuring suitable <xref linkend="guc-restore-command"/>
-   in <filename>postgresql.conf</filename>.
+   As with a base backup, after running <application>pg_rewind</application>
+   WAL replay needs to complete for the data directory to be in a consistent
+   state. When the target server is started again it will enter recovery mode
+   and replay all WAL generated in the source server after the point of
+   divergence. If some of the WAL was no longer available in the source server
+   when <application>pg_rewind</application> was run, and therefore could not be
+   copied by the <application>pg_rewind</application> session, it must be made
+   available when the target server is started. This can be done by creating a
+   <filename>recovery.signal</filename> file in the target data directory and
+   configuring suitable <xref linkend="guc-restore-command"/> in
+   <filename>postgresql.conf</filename>.
   </para>
 
   <para>
@@ -105,6 +108,15 @@ PostgreSQL documentation
     recovered.  In such a case, taking a new fresh backup is recommended.
    </para>
 
+   <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>
+
    <para>
     <application>pg_rewind</application> will fail immediately if it finds
     files it cannot write directly to.  This can happen for example when
@@ -326,34 +338,43 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       Copy all those changed blocks from the source cluster to
       the target cluster, either using direct file system access
       (<option>--source-pgdata</option>) or SQL (<option>--source-server</option>).
+      Relation files are now in a state equivalent to the moment of the last
+      completed checkpoint prior to the point at which the WAL timelines of the
+      source and target diverged plus the current state on the source of any
+      blocks changed on the target after that divergence.
      </para>
     </step>
     <step>
      <para>
-      Copy all other files such as <filename>pg_xact</filename> and
-      configuration files from the source cluster to the target cluster
-      (everything except the relation files). Similarly to base backups,
-      the contents of the directories <filename>pg_dynshmem/</filename>,
+      Copy all other files, including new relation files, WAL segments,
+      <filename>pg_xact</filename>, and configuration files from the source
+      cluster to the target cluster. Similarly to base backups, the contents
+      of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files
       <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
-      <filename>postmaster.opts</filename> and
-      <filename>postmaster.pid</filename>.
+      <filename>postmaster.opts</filename>, and
+      <filename>postmaster.pid</filename>, as well as any file or directory
+      beginning with <filename>pgsql_tmp</filename>, are omitted.
+     </para>
+    </step>
+    <step>
+     <para>
+      Create a <filename>backup_label</filename> file to begin WAL replay at
+      the checkpoint created at failover and configure the
+      <filename>pg_control</filename> file with a minimum consistency LSN using
+      <literal>pg_current_wal_insert_lsn()</literal> from a live source and the
+      last checkpoint LSN from a stopped source.
      </para>
     </step>
     <step>
      <para>
-      Apply the WAL from the source cluster, starting from the checkpoint
-      created at failover. (Strictly speaking, <application>pg_rewind</application>
-      doesn't apply the WAL, it just creates a backup label file that
-      makes <productname>PostgreSQL</productname> start by replaying all WAL from
-      that checkpoint forward.)
+      On restart, <productname>PostgreSQL</productname> replays the required WAL
+      resulting in a consistent data directory state.
      </para>
     </step>
    </procedure>

base-commit: 691e8b2e1889d61df47ae76601fa9db6cbac6f1c
-- 
2.17.1

#13Michael Paquier
michael@paquier.xyz
In reply to: James Coleman (#12)
1 attachment(s)
Re: pg_rewind docs correction

On Mon, Mar 09, 2020 at 09:26:17AM -0400, James Coleman wrote:

-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and
<filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files

This is just reorganizing an existing list, why?

The grammar seemed a bit awkward to me, so while I was already reworking
this paragraph I tried to clean that up a bit.

Thanks for the new patch, and sorry for the delay.

Okay, I saw what you were coming at here, with one sentence for
directories, and one for files.

Still ongoing, correct? I guess I mentally think of them as being only one
month, but I guess that's not actually true. Regardless I'm not sure what
policy is for patches that have been in flight in hackers for a while but
just missed being added to the CF app.

This is a documentation patch, so improving this part of the docs now
is fine by me, particularly as this is an improvement. Here are more
notes from me:
- I have removed the "As with a base backup" at the beginning of the
second paragraph you modified. The first paragraph modified already
references a base backup, so one reference is enough IMO.
- WAL replay does not happen from the WAL position where WAL diverged,
but from the last checkpoint before WAL diverged.
- Did some tweaks about the new part for configuration files, as it
may actually not be necessary to update the configuration for recovery
to complete (depending on the settings of the source, the target may
just require the creation of a standby.signal file in its data
directory particularly with a common archive location for multiple
clusters).
- Some word-smithing in the step-by-step description.

Is the updated version fine for you?
--
Michael

Attachments:

v5-0001-Improve-pg_rewind-explanation-and-warnings.patchtext/x-diff; charset=us-asciiDownload
From 30d0e80e8e777c9b1c3f34aa281f9623e61ea17c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 28 Apr 2020 13:29:26 +0900
Subject: [PATCH v5] Improve pg_rewind explanation and warnings

The pg_rewind docs currently assert that the state of the target's
data directory after rewind is equivalent to the source's data
directory. But that isn't quite true both because the base state is
further back in time and because the target's data directory will
include the current state on the source of any copied blocks.
Instead using the analogy to back backups helps explain the state,
as well as the pros and cons of using the utility.

The How It Works section now:
- Includes details about how the backup_label file is created.
- Includes details about how the pg_control file is updated.
- Is updated to include WAL segments and new relation files in the
  list of files copied wholesale from the source.

Finally, document clearly the state of the cluster after the operation
and also the operation sequencing dangers caused by copying
configuration files from the source.
---
 doc/src/sgml/ref/pg_rewind.sgml | 90 +++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 07c49e4719..9525e09c98 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,16 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind, the state of the target data directory is
+   analogous to a base backup of the source data directory. Unlike taking
+   a new base backup or using a tool like <application>rsync</application>,
+   <application>pg_rewind</application> does not require comparing or copying
+   unchanged relation blocks in the cluster. Only changed blocks from existing
+   relation files are copied; all other files, including new relation files,
+   configuration files, and WAL segments, are copied in full. As such the
+   rewind operation is significantly faster than other approaches when the
+   database is large and only a small fraction of blocks differ between the
+   clusters.
   </para>
 
   <para>
@@ -77,16 +79,18 @@ PostgreSQL documentation
   </para>
 
   <para>
-   When the target server is started for the first time after running
-   <application>pg_rewind</application>, it will go into recovery mode and replay all
-   WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   <application>pg_rewind</application> was run, and therefore could not be copied by the
-   <application>pg_rewind</application> session, it must be made available when the
-   target server is started. This can be done by creating a
-   <filename>recovery.signal</filename> file in the target data directory
-   and configuring suitable <xref linkend="guc-restore-command"/>
-   in <filename>postgresql.conf</filename>.
+   After running <application>pg_rewind</application>, WAL replay needs to
+   complete for the data directory to be in a consistent state. When the
+   target server is started again it will enter archive recovery and replay
+   all WAL generated in the source server from the last checkpoint before
+   the point of divergence. If some of the WAL was no longer available in the
+   source server when <application>pg_rewind</application> was run, and
+   therefore could not be copied by the <application>pg_rewind</application>
+   session, it must be made available when the target server is started.
+   This can be done by creating a <filename>recovery.signal</filename> file
+   in the target data directory and by configuring a suitable
+   <xref linkend="guc-restore-command"/> in
+   <filename>postgresql.conf</filename>.
   </para>
 
   <para>
@@ -105,6 +109,15 @@ PostgreSQL documentation
     recovered.  In such a case, taking a new fresh backup is recommended.
    </para>
 
+   <para>
+    As <application>pg_rewind</application> copies configuration files
+    entirely from the source, it may be required to correct the configuration
+    used for recovery before restarting the target server, especially the
+    the target is reintroduced as a standby of the source. If you restart
+    the server after the rewind operation has finished but without configuring
+    recovery, the target may again diverge from the primary.
+   </para>
+
    <para>
     <application>pg_rewind</application> will fail immediately if it finds
     files it cannot write directly to.  This can happen for example when
@@ -342,34 +355,45 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       Copy all those changed blocks from the source cluster to
       the target cluster, either using direct file system access
       (<option>--source-pgdata</option>) or SQL (<option>--source-server</option>).
+      Relation files are now in a state equivalent to the moment of the last
+      completed checkpoint prior to the point at which the WAL timelines of the
+      source and target diverged plus the current state on the source of any
+      blocks changed on the target after that divergence.
      </para>
     </step>
     <step>
      <para>
-      Copy all other files such as <filename>pg_xact</filename> and
-      configuration files from the source cluster to the target cluster
-      (everything except the relation files). Similarly to base backups,
-      the contents of the directories <filename>pg_dynshmem/</filename>,
+      Copy all other files, including new relation files, WAL segments,
+      <filename>pg_xact</filename>, and configuration files from the source
+      cluster to the target cluster. Similarly to base backups, the contents
+      of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files
       <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
-      <filename>postmaster.opts</filename> and
-      <filename>postmaster.pid</filename>.
+      <filename>postmaster.opts</filename>, and
+      <filename>postmaster.pid</filename>, as well as any file or directory
+      beginning with <filename>pgsql_tmp</filename>, are omitted.
      </para>
     </step>
     <step>
      <para>
-      Apply the WAL from the source cluster, starting from the checkpoint
-      created at failover. (Strictly speaking, <application>pg_rewind</application>
-      doesn't apply the WAL, it just creates a backup label file that
-      makes <productname>PostgreSQL</productname> start by replaying all WAL from
-      that checkpoint forward.)
+      Create a <filename>backup_label</filename> file to begin WAL replay at
+      the checkpoint created at failover and configure the
+      <filename>pg_control</filename> file with a minimum consistency LSN
+      defined as the result of <literal>pg_current_wal_insert_lsn()</literal>
+      when rewinding from a live source and using the last checkpoint LSN
+      when rewinding from a stopped source.
+     </para>
+    </step>
+    <step>
+     <para>
+      When starting the target, <productname>PostgreSQL</productname> replays
+      all the required WAL, resulting in a data directory in a consistent
+      state.
      </para>
     </step>
    </procedure>
-- 
2.26.2

#14James Coleman
jtc331@gmail.com
In reply to: Michael Paquier (#13)
Re: pg_rewind docs correction

On Tue, Apr 28, 2020 at 12:31 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Mar 09, 2020 at 09:26:17AM -0400, James Coleman wrote:

-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and
<filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files

This is just reorganizing an existing list, why?

The grammar seemed a bit awkward to me, so while I was already reworking
this paragraph I tried to clean that up a bit.

Thanks for the new patch, and sorry for the delay.

Okay, I saw what you were coming at here, with one sentence for
directories, and one for files.

Still ongoing, correct? I guess I mentally think of them as being only one
month, but I guess that's not actually true. Regardless I'm not sure what
policy is for patches that have been in flight in hackers for a while but
just missed being added to the CF app.

This is a documentation patch, so improving this part of the docs now
is fine by me, particularly as this is an improvement. Here are more
notes from me:
- I have removed the "As with a base backup" at the beginning of the
second paragraph you modified. The first paragraph modified already
references a base backup, so one reference is enough IMO.
- WAL replay does not happen from the WAL position where WAL diverged,
but from the last checkpoint before WAL diverged.
- Did some tweaks about the new part for configuration files, as it
may actually not be necessary to update the configuration for recovery
to complete (depending on the settings of the source, the target may
just require the creation of a standby.signal file in its data
directory particularly with a common archive location for multiple
clusters).
- Some word-smithing in the step-by-step description.

Is the updated version fine for you?

In your revised patched the follow paragraph:

+   <para>
+    As <application>pg_rewind</application> copies configuration files
+    entirely from the source, it may be required to correct the configuration
+    used for recovery before restarting the target server, especially the
+    the target is reintroduced as a standby of the source. If you restart
+    the server after the rewind operation has finished but without configuring
+    recovery, the target may again diverge from the primary.
+   </para>

I think is missing a word. Instead of "especially the the target"
should be "especially if the target".

In this block:

+      Create a <filename>backup_label</filename> file to begin WAL replay at
+      the checkpoint created at failover and configure the
+      <filename>pg_control</filename> file with a minimum consistency LSN
+      defined as the result of <literal>pg_current_wal_insert_lsn()</literal>
+      when rewinding from a live source and using the last checkpoint LSN
+      when rewinding from a stopped source.
+     </para>

Perhaps change "and using the last checkpoint LSN" to "or the last
checkpoint LSN". Alternatively you could make the grammar parallel by
changing to "and defined as the last checkpoint LSN", but that seems
wordy, and the "defined as [item or item]" is already a good grammar
construction.

Other than those two small things, your proposed revision looks good to me.

Thanks,
James

#15Michael Paquier
michael@paquier.xyz
In reply to: James Coleman (#14)
1 attachment(s)
Re: pg_rewind docs correction

On Tue, Apr 28, 2020 at 12:13:38PM -0400, James Coleman wrote:

I think is missing a word. Instead of "especially the the target"
should be "especially if the target".

Thanks, fixed.

In this block:

+      Create a <filename>backup_label</filename> file to begin WAL replay at
+      the checkpoint created at failover and configure the
+      <filename>pg_control</filename> file with a minimum consistency LSN
+      defined as the result of <literal>pg_current_wal_insert_lsn()</literal>
+      when rewinding from a live source and using the last checkpoint LSN
+      when rewinding from a stopped source.
+     </para>

Perhaps change "and using the last checkpoint LSN" to "or the last
checkpoint LSN". Alternatively you could make the grammar parallel by
changing to "and defined as the last checkpoint LSN", but that seems
wordy, and the "defined as [item or item]" is already a good grammar
construction.

Using your first suggestion of "or the last checkpoint LSN" sounds
more natural as of this morning, so updated the patch with that.

I am letting that aside for a couple of days to see if others have
more comments, and will likely commit it after an extra lookup.
--
Michael

Attachments:

v6-0001-Improve-pg_rewind-explanation-and-warnings.patchtext/x-diff; charset=us-asciiDownload
From 48b9e85fe9ae414a0f42881558ab2fd920ba0c60 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 29 Apr 2020 09:11:28 +0900
Subject: [PATCH v6] Improve pg_rewind explanation and warnings

The pg_rewind docs currently assert that the state of the target's
data directory after rewind is equivalent to the source's data
directory. But that isn't quite true both because the base state is
further back in time and because the target's data directory will
include the current state on the source of any copied blocks.
Instead using the analogy to back backups helps explain the state,
as well as the pros and cons of using the utility.

The How It Works section now:
- Includes details about how the backup_label file is created.
- Includes details about how the pg_control file is updated.
- Is updated to include WAL segments and new relation files in the
  list of files copied wholesale from the source.

Finally, document clearly the state of the cluster after the operation
and also the operation sequencing dangers caused by copying
configuration files from the source.
---
 doc/src/sgml/ref/pg_rewind.sgml | 90 +++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 07c49e4719..391e5db2e2 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,16 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind, the state of the target data directory is
+   analogous to a base backup of the source data directory. Unlike taking
+   a new base backup or using a tool like <application>rsync</application>,
+   <application>pg_rewind</application> does not require comparing or copying
+   unchanged relation blocks in the cluster. Only changed blocks from existing
+   relation files are copied; all other files, including new relation files,
+   configuration files, and WAL segments, are copied in full. As such the
+   rewind operation is significantly faster than other approaches when the
+   database is large and only a small fraction of blocks differ between the
+   clusters.
   </para>
 
   <para>
@@ -77,16 +79,18 @@ PostgreSQL documentation
   </para>
 
   <para>
-   When the target server is started for the first time after running
-   <application>pg_rewind</application>, it will go into recovery mode and replay all
-   WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   <application>pg_rewind</application> was run, and therefore could not be copied by the
-   <application>pg_rewind</application> session, it must be made available when the
-   target server is started. This can be done by creating a
-   <filename>recovery.signal</filename> file in the target data directory
-   and configuring suitable <xref linkend="guc-restore-command"/>
-   in <filename>postgresql.conf</filename>.
+   After running <application>pg_rewind</application>, WAL replay needs to
+   complete for the data directory to be in a consistent state. When the
+   target server is started again it will enter archive recovery and replay
+   all WAL generated in the source server from the last checkpoint before
+   the point of divergence. If some of the WAL was no longer available in the
+   source server when <application>pg_rewind</application> was run, and
+   therefore could not be copied by the <application>pg_rewind</application>
+   session, it must be made available when the target server is started.
+   This can be done by creating a <filename>recovery.signal</filename> file
+   in the target data directory and by configuring a suitable
+   <xref linkend="guc-restore-command"/> in
+   <filename>postgresql.conf</filename>.
   </para>
 
   <para>
@@ -105,6 +109,15 @@ PostgreSQL documentation
     recovered.  In such a case, taking a new fresh backup is recommended.
    </para>
 
+   <para>
+    As <application>pg_rewind</application> copies configuration files
+    entirely from the source, it may be required to correct the configuration
+    used for recovery before restarting the target server, especially if
+    the target is reintroduced as a standby of the source. If you restart
+    the server after the rewind operation has finished but without configuring
+    recovery, the target may again diverge from the primary.
+   </para>
+
    <para>
     <application>pg_rewind</application> will fail immediately if it finds
     files it cannot write directly to.  This can happen for example when
@@ -342,34 +355,45 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       Copy all those changed blocks from the source cluster to
       the target cluster, either using direct file system access
       (<option>--source-pgdata</option>) or SQL (<option>--source-server</option>).
+      Relation files are now in a state equivalent to the moment of the last
+      completed checkpoint prior to the point at which the WAL timelines of the
+      source and target diverged plus the current state on the source of any
+      blocks changed on the target after that divergence.
      </para>
     </step>
     <step>
      <para>
-      Copy all other files such as <filename>pg_xact</filename> and
-      configuration files from the source cluster to the target cluster
-      (everything except the relation files). Similarly to base backups,
-      the contents of the directories <filename>pg_dynshmem/</filename>,
+      Copy all other files, including new relation files, WAL segments,
+      <filename>pg_xact</filename>, and configuration files from the source
+      cluster to the target cluster. Similarly to base backups, the contents
+      of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files
       <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
-      <filename>postmaster.opts</filename> and
-      <filename>postmaster.pid</filename>.
+      <filename>postmaster.opts</filename>, and
+      <filename>postmaster.pid</filename>, as well as any file or directory
+      beginning with <filename>pgsql_tmp</filename>, are omitted.
      </para>
     </step>
     <step>
      <para>
-      Apply the WAL from the source cluster, starting from the checkpoint
-      created at failover. (Strictly speaking, <application>pg_rewind</application>
-      doesn't apply the WAL, it just creates a backup label file that
-      makes <productname>PostgreSQL</productname> start by replaying all WAL from
-      that checkpoint forward.)
+      Create a <filename>backup_label</filename> file to begin WAL replay at
+      the checkpoint created at failover and configure the
+      <filename>pg_control</filename> file with a minimum consistency LSN
+      defined as the result of <literal>pg_current_wal_insert_lsn()</literal>
+      when rewinding from a live source or the last checkpoint LSN when
+      rewinding from a stopped source.
+     </para>
+    </step>
+    <step>
+     <para>
+      When starting the target, <productname>PostgreSQL</productname> replays
+      all the required WAL, resulting in a data directory in a consistent
+      state.
      </para>
     </step>
    </procedure>
-- 
2.26.2

#16Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#15)
Re: pg_rewind docs correction

On Wed, Apr 29, 2020 at 09:15:06AM +0900, Michael Paquier wrote:

I am letting that aside for a couple of days to see if others have
more comments, and will likely commit it after an extra lookup.

And applied after an extra lookup. Thanks for the discussion, James.
--
Michael

#17James Coleman
jtc331@gmail.com
In reply to: Michael Paquier (#16)
Re: pg_rewind docs correction

On Fri, May 1, 2020 at 4:46 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Apr 29, 2020 at 09:15:06AM +0900, Michael Paquier wrote:

I am letting that aside for a couple of days to see if others have
more comments, and will likely commit it after an extra lookup.

And applied after an extra lookup. Thanks for the discussion, James.

Yep. Thanks for pushing to make sure it was as correct as possible
while improving it.

James