Corner-case bug in pg_rewind

Started by Ian Barwickover 5 years ago15 messages

ian.barwick@2ndquadrant.com

over 5 years ago

1 attachment(s)

Take the following cluster with:
- node1 (initial primary)
- node2 (standby)
- node3 (standby)

Following activity takes place (greatly simplified from a real-world situation):

1. node1 is shut down.
2. node3 is promoted
3. node2 is attached to node3.
4. node1 is attached to node3
5. node1 is then promoted (creating a split brain situation with
node1 and node3 as primaries)
6. node2 and node3 are shut down (in that order).
7. pg_rewind is executed to reset node2 so it can reattach
to node1 as a standby. pg_rewind claims:

pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
pg_rewind: no rewind required

8. based off that assurance, node2 is restarted with replication configuration
pointing to node1 - but it is unable to attach, with node2's log reporting
something like:

new timeline 3 forked off current database system timeline 2
before current recovery point X/XXXXXXX

The cause is that pg_rewind is assuming that if the node's last
checkpoint matches the
divergence point, no rewind is needed:

if (chkptendrec == divergerec)
rewind_needed = false;

but in this case there *are* records beyond the last checkpoint, which can be
inferred from "minRecoveryPoint" - but this is not checked.

Attached patch addresses this. It includes a test, which doesn't make use of
the RewindTest module, as that hard-codes a primary and a standby, while here
three nodes are needed (I can't come up with a situation where this can be
reproduced with only two nodes). The test sets "wal_keep_size" so would need
modification for Pg12 and earlier.

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_rewind-minRecoveryPoint.v1.patchtext/x-patch; charset=US-ASCII; name=pg_rewind-minRecoveryPoint.v1.patchDownload

commit ec0465014825628ec9b868703444214ac4738c53
Author: Ian Barwick <ian@2ndquadrant.com>
Date:   Fri Sep 11 10:13:17 2020 +0900

    pg_rewind: catch corner-case situation
    
    It's possible that a standby, after diverging from the source node,
    is shut down without a shutdown checkpoint record, and the divergence
    point matches a shutdown checkpoint from a previous shutdown. In this
    case the presence of WAL records beyond the shutdown checkpoint (asi
    indicated by minRecoveryPoint) needs to be detected in order to
    determine whether a rewind is needed.

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e44..393c8ebbcd 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -342,13 +342,20 @@ main(int argc, char **argv)
 										targetNentries - 1,
 										restore_command);
 
+			/*
+			 * If the minimum recovery ending location is beyond the end of
+			 * the last checkpoint, that means there are WAL records beyond
+			 * the divergence point and a rewind is needed.
+			 */
+			if (ControlFile_target.minRecoveryPoint > chkptendrec)
+				rewind_needed = true;
 			/*
 			 * If the histories diverged exactly at the end of the shutdown
 			 * checkpoint record on the target, there are no WAL records in
 			 * the target that don't belong in the source's history, and no
 			 * rewind is needed.
 			 */
-			if (chkptendrec == divergerec)
+			else if (chkptendrec == divergerec)
 				rewind_needed = false;
 			else
 				rewind_needed = true;
diff --git a/src/bin/pg_rewind/t/007_min_recovery_point.pl b/src/bin/pg_rewind/t/007_min_recovery_point.pl
new file mode 100644
index 0000000000..ac842fd0ed
--- /dev/null
+++ b/src/bin/pg_rewind/t/007_min_recovery_point.pl
@@ -0,0 +1,118 @@
+#
+# Test situation where a target data directory contains
+# WAL records beyond both the last checkpoint and the divergence
+# point.
+#
+# This test does not make use of RewindTest as it requires three
+# nodes.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+my $node_1 = get_new_node('node_1');
+$node_1->init(allows_streaming => 1);
+$node_1->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_1->start;
+
+# Add an arbitrary table
+$node_1->safe_psql('postgres',
+	'CREATE TABLE public.foo (id INT)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_1->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_2 = get_new_node('node_2');
+$node_2->init_from_backup($node_1, $backup_name,
+	has_streaming => 1);
+
+$node_2->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_2->start;
+
+# Create streaming standby from backup
+my $node_3 = get_new_node('node_3');
+$node_3->init_from_backup($node_1, $backup_name,
+	has_streaming => 1);
+
+$node_3->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_3->start;
+
+# Stop node_1
+
+$node_1->stop('fast');
+
+# Promote node_3
+$node_3->promote;
+
+# node_1 rejoins node_3
+
+my $node_3_connstr = $node_3->connstr;
+
+$node_1->append_conf(
+		'postgresql.conf', qq(
+primary_conninfo='$node_3_connstr'
+));
+$node_1->set_standby_mode();
+$node_1->start();
+
+# node_2 follows node_3
+
+$node_2->append_conf(
+		'postgresql.conf', qq(
+primary_conninfo='$node_3_connstr'
+));
+$node_2->restart();
+
+# Promote node_1
+
+$node_1->promote;
+
+# We now have a split-brain with two primaries. Insert a row on both to
+# demonstratively create a split brain; this is not strictly necessary
+# for this test, but creates an easily identifiable WAL record and
+# enables us to verify that node_2 has the required changes to
+# reproduce the situation we're handling.
+
+$node_1->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (0)');
+$node_3->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (1)');
+
+$node_2->poll_query_until('postgres',
+	q|SELECT COUNT(*) > 0 FROM public.foo|, 't');
+
+# At this point node_2 will shut down without a shutdown checkpoint,
+# but with WAL entries beyond the preceding shutdown checkpoint.
+$node_2->stop('fast');
+$node_3->stop('fast');
+
+
+my $node_2_pgdata = $node_2->data_dir;
+my $node_1_connstr = $node_1->connstr;
+
+command_checks_all(
+    [
+        'pg_rewind',
+        "--source-server=$node_1_connstr",
+        "--target-pgdata=$node_2_pgdata",
+        '--dry-run'
+    ],
+    0,
+    [],
+    [qr|rewinding from last common checkpoint at|],
+    'pg_rewind detects rewind needed');
+

Heikki Linnakangas

hlinnaka@iki.fi

over 5 years ago

In reply to: Ian Barwick (#1)

Re: Corner-case bug in pg_rewind

On 11/09/2020 09:42, Ian Barwick wrote:

Take the following cluster with:
- node1 (initial primary)
- node2 (standby)
- node3 (standby)

Following activity takes place (greatly simplified from a real-world situation):

1. node1 is shut down.
2. node3 is promoted
3. node2 is attached to node3.
4. node1 is attached to node3
5. node1 is then promoted (creating a split brain situation with
node1 and node3 as primaries)
6. node2 and node3 are shut down (in that order).
7. pg_rewind is executed to reset node2 so it can reattach
to node1 as a standby. pg_rewind claims:

pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
pg_rewind: no rewind required

8. based off that assurance, node2 is restarted with replication configuration
pointing to node1 - but it is unable to attach, with node2's log reporting
something like:

new timeline 3 forked off current database system timeline 2
before current recovery point X/XXXXXXX

The cause is that pg_rewind is assuming that if the node's last
checkpoint matches the
divergence point, no rewind is needed:

if (chkptendrec == divergerec)
rewind_needed = false;

but in this case there *are* records beyond the last checkpoint, which can be
inferred from "minRecoveryPoint" - but this is not checked.

Yep, I think you're right.

Attached patch addresses this. It includes a test, which doesn't make use of
the RewindTest module, as that hard-codes a primary and a standby, while here
three nodes are needed (I can't come up with a situation where this can be
reproduced with only two nodes). The test sets "wal_keep_size" so would need
modification for Pg12 and earlier.

I think we also need to change the extractPageMap() call:

/*
* Read the target WAL from last checkpoint before the point of fork, to
* extract all the pages that were modified on the target cluster after
* the fork. We can stop reading after reaching the final shutdown record.
* XXX: If we supported rewinding a server that was not shut down cleanly,
* we would need to replay until the end of WAL here.
*/
if (showprogress)
pg_log_info("reading WAL in target");
extractPageMap(datadir_target, chkptrec, lastcommontliIndex,
ControlFile_target.checkPoint, restore_command);
filemap_finalize();

so that it scans all the way up to minRecoveryPoint, instead of stopping
at ControlFile_target.checkPoint. Otherwise, we will fail to rewind
changes that happened after the last checkpoint. And we need to do that
regardless of the "no rewind required" bug, any time we rewind a server
that's in DB_SHUTDOWNED_IN_RECOVERY state. I'm surprised none of the
existing tests have caught that. Am I missing something?

- Heikki

Pavel Borisov

pashkin.elfe@gmail.com

about 5 years ago

In reply to: Heikki Linnakangas (#2)

Re: Corner-case bug in pg_rewind

I did some effort to review your patch which seems legit to me.
I think some minor things are better to be improved i.e.

1. Comment regarding
------
347 * Check for the possibility that the target is in fact a direct
348 * ancestor of the source. In that case, there is no divergent
history
349 * in the target that needs rewinding.
------
are better be reformulated as overall block contents are mostly cover vice
versa case of a target NOT being a direct ancestor of the source. Maybe:
"We need rewind in cases when .... and don't need only if the target is a
direct ancestor of the source." I think it will be more understandable if
it would be a commentary with descriptions of all cases in the block or no
commentary before the block at all with commentaries of these cases on each
if/else inside the block instead.

2. When I do the test with no patching of pg_rewind.c I get the output:
-----
# Failed test 'pg_rewind detects rewind needed stderr /(?^:rewinding from
last common checkpoint at)/'
# at t/007_min_recovery_point.pl line 107.
# 'pg_rewind: servers diverged at WAL location 0/3000180
on timeline 2
# pg_rewind: no rewind required
# '
# doesn't match '(?^:rewinding from last common checkpoint at)'
# Looks like you failed 1 test of 2.
t/007_min_recovery_point.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/2 subtests

Test Summary Report
-------------------
t/007_min_recovery_point.pl (Wstat: 256 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 1
-------
Maybe it can just give "failed" without so many details?

Also, I think Heikki's notion could be fulfilled.

Apart from this I consider the patch clean, clear, tests are passes and I'd
recommend it to commit after a minor improvement described.
Thank you!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

Ian Lawrence Barwick

barwick@gmail.com

about 5 years ago

In reply to: Pavel Borisov (#3)

Re: Corner-case bug in pg_rewind

2020年11月10日(火) 18:07 Pavel Borisov <pashkin.elfe@gmail.com>:

I did some effort to review your patch which seems legit to me.

Thanks for the review and feedback.

I think some minor things are better to be improved i.e.

1. Comment regarding
------
347 * Check for the possibility that the target is in fact a direct
348 * ancestor of the source. In that case, there is no divergent history
349 * in the target that needs rewinding.
------
are better be reformulated as overall block contents are mostly cover vice
versa case of a target NOT being a direct ancestor of the source. Maybe: "We
need rewind in cases when .... and don't need only if the target is a direct
ancestor of the source." I think it will be more understandable if it would be
a commentary with descriptions of all cases in the block or no commentary
before the block at all with commentaries of these cases on each if/else
inside the block instead.

The comment you cite is not part of this patch. I suspect there might be some
scope for improving the comments in general, but that would need to be done
seperately.

2. When I do the test with no patching of pg_rewind.c I get the output:
-----
# Failed test 'pg_rewind detects rewind needed stderr /(?^:rewinding from last common checkpoint at)/'
# at t/007_min_recovery_point.pl line 107.
# 'pg_rewind: servers diverged at WAL location 0/3000180 on timeline 2
# pg_rewind: no rewind required
# '
# doesn't match '(?^:rewinding from last common checkpoint at)'
# Looks like you failed 1 test of 2.
t/007_min_recovery_point.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/2 subtests

Test Summary Report
-------------------
t/007_min_recovery_point.pl (Wstat: 256 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 1
-------
Maybe it can just give "failed" without so many details?

That's just the way the tests work - an individual test has no influence
over that output.

Also, I think Heikki's notion could be fulfilled.

I spent a bit of time looking at that suggestion but couldn't actually
verify it was an issue which needed fixing.

Note that the patch may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.

Regards

Ian Barwick

--
EnterpriseDB: https://www.enterprisedb.com

Pavel Borisov

pashkin.elfe@gmail.com

about 5 years ago

In reply to: Ian Lawrence Barwick (#4)

Re: Corner-case bug in pg_rewind

1. Comment regarding
------
347 * Check for the possibility that the target is in fact a

direct

348 * ancestor of the source. In that case, there is no

divergent history

349 * in the target that needs rewinding.
------
are better be reformulated as overall block contents are mostly cover

vice

versa case of a target NOT being a direct ancestor of the source. Maybe:

"We

need rewind in cases when .... and don't need only if the target is a

direct

ancestor of the source." I think it will be more understandable if it

would be

a commentary with descriptions of all cases in the block or no commentary
before the block at all with commentaries of these cases on each if/else
inside the block instead.

The comment you cite is not part of this patch. I suspect there might be
some

Sure, it is comment describes the whole block the patch introduces changes
into. If it could be rendered more relevant and anyway you are planning to
revise it again, it would be great to change it also. But I don't insist.

Note that the patch may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.

Thank you!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

about 5 years ago

In reply to: Ian Lawrence Barwick (#4)

Re: Corner-case bug in pg_rewind

On 16.11.2020 05:49, Ian Lawrence Barwick wrote:

Note that the patch may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.

Regards

Ian Barwick

--
EnterpriseDB: https://www.enterprisedb.com

Status update for a commitfest entry.

The patch is Waiting on Author for some time. As this is a bug fix, I am
moving it to the next CF.
Ian, are you planning to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Pavel Borisov

pashkin.elfe@gmail.com

about 5 years ago

In reply to: Anastasia Lubennikova (#6)

Re: Corner-case bug in pg_rewind

Status update for a commitfest entry.

The patch is Waiting on Author for some time. As this is a bug fix, I am
moving it to the next CF.
Ian, are you planning to continue working on it?

As a reviewer, I consider the patch useful and good overall. The comments I
left were purely cosmetic. It's a pity to me that this bugfix delayed for
such a small reason and outdated, therefore. It would be nice to complete
this fix on the next CF.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Pavel Borisov (#7)

2 attachment(s)

Re: Corner-case bug in pg_rewind

On 01/12/2020 16:52, Pavel Borisov wrote:

Status update for a commitfest entry.

The patch is Waiting on Author for some time. As this is a bug fix,
I am
moving it to the next CF.
Ian, are you planning to continue working on it?

As a reviewer, I consider the patch useful and good overall. The
comments I left were purely cosmetic. It's a pity to me that this bugfix
delayed for such a small reason and outdated, therefore. It would be
nice to complete this fix on the next CF.

Yeah, we really should fix this..

On 16/11/2020 04:49, Ian Lawrence Barwick wrote:

Also, I think Heikki's notion could be fulfilled.

I spent a bit of time looking at that suggestion but couldn't actually
verify it was an issue which needed fixing.

Attached are two patches. The first patch is your original patch,
unmodified (except for a cosmetic rename of the test file). The second
patch builds on that, demonstrating and fixing the issue I mentioned. It
took me a while to create a repro for it, it's easily masked by
incidental full-page writes or because rows created by XIDs that are not
marked as committed on the other timeline are invisible, but succeeded
at last.

- Heikki

Attachments:

v2-0001-pg_rewind-catch-corner-case-situation.patchtext/x-patch; charset=UTF-8; name=v2-0001-pg_rewind-catch-corner-case-situation.patchDownload

From 35cca347ab625d6cbf2cd8a0b7a4a57ee64a5699 Mon Sep 17 00:00:00 2001
From: Ian Barwick <ian@2ndquadrant.com>
Date: Fri, 11 Sep 2020 10:13:17 +0900
Subject: [PATCH v2 1/2] pg_rewind: catch corner-case situation

It's possible that a standby, after diverging from the source node,
is shut down without a shutdown checkpoint record, and the divergence
point matches a shutdown checkpoint from a previous shutdown. In this
case the presence of WAL records beyond the shutdown checkpoint (asi
indicated by minRecoveryPoint) needs to be detected in order to
determine whether a rewind is needed.
---
 src/bin/pg_rewind/pg_rewind.c                 |   9 +-
 src/bin/pg_rewind/t/000_min_recovery_point.pl | 118 ++++++++++++++++++
 2 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_rewind/t/000_min_recovery_point.pl

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 2bbf8e7438..f2f815042f 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -362,13 +362,20 @@ main(int argc, char **argv)
 										targetNentries - 1,
 										restore_command);
 
+			/*
+			 * If the minimum recovery ending location is beyond the end of
+			 * the last checkpoint, that means there are WAL records beyond
+			 * the divergence point and a rewind is needed.
+			 */
+			if (ControlFile_target.minRecoveryPoint > chkptendrec)
+				rewind_needed = true;
 			/*
 			 * If the histories diverged exactly at the end of the shutdown
 			 * checkpoint record on the target, there are no WAL records in
 			 * the target that don't belong in the source's history, and no
 			 * rewind is needed.
 			 */
-			if (chkptendrec == divergerec)
+			else if (chkptendrec == divergerec)
 				rewind_needed = false;
 			else
 				rewind_needed = true;
diff --git a/src/bin/pg_rewind/t/000_min_recovery_point.pl b/src/bin/pg_rewind/t/000_min_recovery_point.pl
new file mode 100644
index 0000000000..ac842fd0ed
--- /dev/null
+++ b/src/bin/pg_rewind/t/000_min_recovery_point.pl
@@ -0,0 +1,118 @@
+#
+# Test situation where a target data directory contains
+# WAL records beyond both the last checkpoint and the divergence
+# point.
+#
+# This test does not make use of RewindTest as it requires three
+# nodes.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+
+my $node_1 = get_new_node('node_1');
+$node_1->init(allows_streaming => 1);
+$node_1->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_1->start;
+
+# Add an arbitrary table
+$node_1->safe_psql('postgres',
+	'CREATE TABLE public.foo (id INT)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_1->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_2 = get_new_node('node_2');
+$node_2->init_from_backup($node_1, $backup_name,
+	has_streaming => 1);
+
+$node_2->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_2->start;
+
+# Create streaming standby from backup
+my $node_3 = get_new_node('node_3');
+$node_3->init_from_backup($node_1, $backup_name,
+	has_streaming => 1);
+
+$node_3->append_conf(
+		'postgresql.conf', qq(
+wal_keep_size=16
+));
+
+$node_3->start;
+
+# Stop node_1
+
+$node_1->stop('fast');
+
+# Promote node_3
+$node_3->promote;
+
+# node_1 rejoins node_3
+
+my $node_3_connstr = $node_3->connstr;
+
+$node_1->append_conf(
+		'postgresql.conf', qq(
+primary_conninfo='$node_3_connstr'
+));
+$node_1->set_standby_mode();
+$node_1->start();
+
+# node_2 follows node_3
+
+$node_2->append_conf(
+		'postgresql.conf', qq(
+primary_conninfo='$node_3_connstr'
+));
+$node_2->restart();
+
+# Promote node_1
+
+$node_1->promote;
+
+# We now have a split-brain with two primaries. Insert a row on both to
+# demonstratively create a split brain; this is not strictly necessary
+# for this test, but creates an easily identifiable WAL record and
+# enables us to verify that node_2 has the required changes to
+# reproduce the situation we're handling.
+
+$node_1->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (0)');
+$node_3->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (1)');
+
+$node_2->poll_query_until('postgres',
+	q|SELECT COUNT(*) > 0 FROM public.foo|, 't');
+
+# At this point node_2 will shut down without a shutdown checkpoint,
+# but with WAL entries beyond the preceding shutdown checkpoint.
+$node_2->stop('fast');
+$node_3->stop('fast');
+
+
+my $node_2_pgdata = $node_2->data_dir;
+my $node_1_connstr = $node_1->connstr;
+
+command_checks_all(
+    [
+        'pg_rewind',
+        "--source-server=$node_1_connstr",
+        "--target-pgdata=$node_2_pgdata",
+        '--dry-run'
+    ],
+    0,
+    [],
+    [qr|rewinding from last common checkpoint at|],
+    'pg_rewind detects rewind needed');
+
-- 
2.20.1

v2-0002-Correctly-scan-all-WAL-in-target-if-it-was-shut-d.patchtext/x-patch; charset=UTF-8; name=v2-0002-Correctly-scan-all-WAL-in-target-if-it-was-shut-d.patchDownload

From de487455adae3610310101760739af178e80e446 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 2 Dec 2020 13:04:37 +0200
Subject: [PATCH v2 2/2] Correctly scan all WAL in target, if it was shut down
 in recovery.

If minRecoveryPoint is beyond the last checkpoint, we need to scan the
WAL all the way up to minRecoveryPoint.
---
 src/bin/pg_rewind/parsexlog.c                 | 11 ++-
 src/bin/pg_rewind/pg_rewind.c                 | 63 ++++++------
 src/bin/pg_rewind/t/000_min_recovery_point.pl | 98 ++++++++++++++-----
 3 files changed, 120 insertions(+), 52 deletions(-)

diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index eae1797f94..9275cba51b 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -55,6 +55,9 @@ static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
  * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of
  * the data blocks touched by the WAL records, and return them in a page map.
+ *
+ * 'endpoint' is the end of the last record to read. The record starting at
+ * 'endpoint' is the first one that is not read.
  */
 void
 extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
@@ -93,7 +96,13 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 
 		extractPageInfo(xlogreader);
 
-	} while (xlogreader->ReadRecPtr != endpoint);
+	} while (xlogreader->EndRecPtr < endpoint);
+
+	/*
+	 * If 'endpoint' didn't point exactly at a record boundary, the caller
+	 * messed up.
+	 */
+	Assert(xlogreader->EndRecPtr == endpoint);
 
 	XLogReaderFree(xlogreader);
 	if (xlogreadfd != -1)
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index f2f815042f..9791ebac72 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -129,6 +129,7 @@ main(int argc, char **argv)
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
 	XLogRecPtr	chkptredo;
+	XLogRecPtr	target_wal_endrec;
 	size_t		size;
 	char	   *buffer;
 	bool		no_ensure_shutdown = false;
@@ -335,50 +336,56 @@ main(int argc, char **argv)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
+		target_wal_endrec = 0;
 	}
 	else
 	{
+		XLogRecPtr	chkptendrec;
+
 		findCommonAncestorTimeline(&divergerec, &lastcommontliIndex);
 		pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
 					(uint32) (divergerec >> 32), (uint32) divergerec,
 					targetHistory[lastcommontliIndex].tli);
 
+		/*
+		 * Determine the end-of-WAL on the target.
+		 *
+		 * The WAL ends at the last shutdown checkpoint, or at minRecoveryPoint if
+		 * it was a standby. (If we supported rewinding a server that was not shut
+		 * down cleanly, we would need to replay until we reach the first invalid
+		 * record, like crash recovery does.)
+		 */
+
+		/* read the checkpoint record on the target to see where it ends. */
+		chkptendrec = readOneRecord(datadir_target,
+									ControlFile_target.checkPoint,
+									targetNentries - 1,
+									restore_command);
+
+		if (ControlFile_target.minRecoveryPoint > chkptendrec)
+		{
+			target_wal_endrec = ControlFile_target.minRecoveryPoint;
+		}
+		else
+		{
+			target_wal_endrec = chkptendrec;
+		}
+
 		/*
 		 * Check for the possibility that the target is in fact a direct
 		 * ancestor of the source. In that case, there is no divergent history
 		 * in the target that needs rewinding.
 		 */
-		if (ControlFile_target.checkPoint >= divergerec)
+		if (target_wal_endrec > divergerec)
 		{
 			rewind_needed = true;
 		}
 		else
 		{
-			XLogRecPtr	chkptendrec;
-
-			/* Read the checkpoint record on the target to see where it ends. */
-			chkptendrec = readOneRecord(datadir_target,
-										ControlFile_target.checkPoint,
-										targetNentries - 1,
-										restore_command);
+			/* the last common checkpoint record must be part of the target WAL */
+			Assert(target_wal_endrec == divergerec);
 
-			/*
-			 * If the minimum recovery ending location is beyond the end of
-			 * the last checkpoint, that means there are WAL records beyond
-			 * the divergence point and a rewind is needed.
-			 */
-			if (ControlFile_target.minRecoveryPoint > chkptendrec)
-				rewind_needed = true;
-			/*
-			 * If the histories diverged exactly at the end of the shutdown
-			 * checkpoint record on the target, there are no WAL records in
-			 * the target that don't belong in the source's history, and no
-			 * rewind is needed.
-			 */
-			else if (chkptendrec == divergerec)
-				rewind_needed = false;
-			else
-				rewind_needed = true;
+			rewind_needed = false;
 		}
 	}
 
@@ -414,14 +421,12 @@ main(int argc, char **argv)
 	/*
 	 * Read the target WAL from last checkpoint before the point of fork, to
 	 * extract all the pages that were modified on the target cluster after
-	 * the fork. We can stop reading after reaching the final shutdown record.
-	 * XXX: If we supported rewinding a server that was not shut down cleanly,
-	 * we would need to replay until the end of WAL here.
+	 * the fork.
 	 */
 	if (showprogress)
 		pg_log_info("reading WAL in target");
 	extractPageMap(datadir_target, chkptrec, lastcommontliIndex,
-				   ControlFile_target.checkPoint, restore_command);
+				   target_wal_endrec, restore_command);
 
 	/*
 	 * We have collected all information we need from both systems. Decide
diff --git a/src/bin/pg_rewind/t/000_min_recovery_point.pl b/src/bin/pg_rewind/t/000_min_recovery_point.pl
index ac842fd0ed..84b0e947c1 100644
--- a/src/bin/pg_rewind/t/000_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/000_min_recovery_point.pl
@@ -1,7 +1,28 @@
 #
 # Test situation where a target data directory contains
 # WAL records beyond both the last checkpoint and the divergence
-# point.
+# point:
+#
+# Target WAL (TLI 2):
+#
+# backup ... Checkpoint A ... INSERT 'rewind this'
+#            (TLI 1 -> 2)
+#
+#            ^ last common                        ^ minRecoveryPoint
+#              checkpoint
+#
+# Source WAL (TLI 3):
+#
+# backup ... Checkpoint A ... Checkpoint B ... INSERT 'keep this'
+#            (TLI 1 -> 2)     (TLI 2 -> 3)
+#
+#
+# The last common checkpoint is Checkpoint A. But there is WAL on TLI 2
+# after the last common checkpoint that needs to be rewound. We used to
+# have a bug where minRecoveryPoint was ignored, and pg_rewind concluded
+# that the target doesn't need rewinding in this scenario, because the
+# last checkpoint on the target TLI was an ancestor of the source TLI.
+#
 #
 # This test does not make use of RewindTest as it requires three
 # nodes.
@@ -10,20 +31,26 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 2;
+use Test::More tests => 3;
+
+use File::Copy;
+
+my $tmp_folder = TestLib::tempdir;
 
 my $node_1 = get_new_node('node_1');
 $node_1->init(allows_streaming => 1);
 $node_1->append_conf(
 		'postgresql.conf', qq(
-wal_keep_size=16
+wal_keep_size='100 MB'
 ));
 
 $node_1->start;
 
-# Add an arbitrary table
-$node_1->safe_psql('postgres',
-	'CREATE TABLE public.foo (id INT)');
+# Create a couple of test tables
+$node_1->safe_psql('postgres', 'CREATE TABLE public.foo (t TEXT)');
+$node_1->safe_psql('postgres', 'CREATE TABLE public.bar (t TEXT)');
+$node_1->safe_psql('postgres', "INSERT INTO public.bar VALUES ('in both')");
+
 
 # Take backup
 my $backup_name = 'my_backup';
@@ -84,35 +111,62 @@ $node_2->restart();
 $node_1->promote;
 
 # We now have a split-brain with two primaries. Insert a row on both to
-# demonstratively create a split brain; this is not strictly necessary
-# for this test, but creates an easily identifiable WAL record and
-# enables us to verify that node_2 has the required changes to
-# reproduce the situation we're handling.
-
-$node_1->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (0)');
-$node_3->safe_psql('postgres', 'INSERT INTO public.foo (id) VALUES (1)');
-
+# demonstratively create a split brain. After the rewind, we should only
+# see the insert on 1, as the insert on node 3 is rewound away.
+$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('keep this')");
+
+# Insert more rows in node 1, to bump up the XID counter. Otherwise, if
+# rewind doesn't correctly rewind the changes made on the other node,
+# we might fail to notice if the inserts are invisible because the XIDs
+# are not marked as committed.
+$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('and this')");
+$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('and this too')");
+
+# Also insert a row in 'bar' on node 3. It is unmodified in node 1, so it won't get
+# overwritten by replaying the WAL from node 1.
+$node_3->safe_psql('postgres', "INSERT INTO public.bar (t) VALUES ('rewind this')");
+
+# Wait for node 2 to catch up
 $node_2->poll_query_until('postgres',
-	q|SELECT COUNT(*) > 0 FROM public.foo|, 't');
+	q|SELECT COUNT(*) > 1 FROM public.bar|, 't');
 
 # At this point node_2 will shut down without a shutdown checkpoint,
 # but with WAL entries beyond the preceding shutdown checkpoint.
 $node_2->stop('fast');
 $node_3->stop('fast');
 
-
 my $node_2_pgdata = $node_2->data_dir;
 my $node_1_connstr = $node_1->connstr;
 
-command_checks_all(
+# Keep a temporary postgresql.conf or it would be overwritten during the rewind.
+copy(
+	"$node_2_pgdata/postgresql.conf",
+	"$tmp_folder/node_2-postgresql.conf.tmp");
+
+command_ok(
     [
         'pg_rewind',
         "--source-server=$node_1_connstr",
         "--target-pgdata=$node_2_pgdata",
-        '--dry-run'
+        '--write-recovery-conf'
     ],
-    0,
-    [],
-    [qr|rewinding from last common checkpoint at|],
-    'pg_rewind detects rewind needed');
+	'pg_rewind detects rewind needed');
+
+# Now move back postgresql.conf with old settings
+move(
+	"$tmp_folder/node_2-postgresql.conf.tmp",
+	"$node_2_pgdata/postgresql.conf");
+
+$node_2->start;
 
+# Check contents of the test tables after rewind. The rows inserted in node 3
+# before rewind should've been overwritten with the data from node 1.
+my $result;
+$result = $node_2->safe_psql('postgres', 'checkpoint');
+$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.foo');
+is($result, qq(keep this
+and this
+and this too), 'table foo after rewind');
+
+$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.bar');
+is($result, qq(in both), 'table bar after rewind');
-- 
2.20.1

Ian Barwick

ian.barwick@2ndquadrant.com

about 5 years ago

In reply to: Heikki Linnakangas (#8)

Re: Corner-case bug in pg_rewind

On 02/12/2020 20:13, Heikki Linnakangas wrote:

On 01/12/2020 16:52, Pavel Borisov wrote:

    Status update for a commitfest entry.

    The patch is Waiting on Author for some time. As this is a bug fix,
    I am
    moving it to the next CF.
    Ian, are you planning to continue working on it?

As a reviewer, I consider the patch useful and good overall. The comments I left were purely cosmetic. It's a pity to me that this bugfix delayed for such a small reason and outdated, therefore. It would be nice to complete this fix on the next CF.

Yeah, we really should fix this..

On 16/11/2020 04:49, Ian Lawrence Barwick wrote:

Also, I think Heikki's notion could be fulfilled.

I spent a bit of time looking at that suggestion but couldn't actually
verify it was an issue which needed fixing.

Attached are two patches. The first patch is your original patch, unmodified
(except for a cosmetic rename of the test file). The second patch builds on
that, demonstrating and fixing the issue I mentioned. It took me a while to
create a repro for it, it's easily masked by incidental full-page writes or
because rows created by XIDs that are not marked as committed on the other
timeline are invisible, but succeeded at last.

Aha, many thanks. I wasn't entirely sure what I was looking for there and
recently haven't had the time or energy to dig any further.

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Ian Barwick (#9)

Re: Corner-case bug in pg_rewind

On 02/12/2020 15:26, Ian Barwick wrote:

On 02/12/2020 20:13, Heikki Linnakangas wrote:

Attached are two patches. The first patch is your original patch, unmodified
(except for a cosmetic rename of the test file). The second patch builds on
that, demonstrating and fixing the issue I mentioned. It took me a while to
create a repro for it, it's easily masked by incidental full-page writes or
because rows created by XIDs that are not marked as committed on the other
timeline are invisible, but succeeded at last.

Aha, many thanks. I wasn't entirely sure what I was looking for there and
recently haven't had the time or energy to dig any further.

Ok, pushed and backpatched this now.

Thanks!

- Heikki

#11

Pavel Borisov

pashkin.elfe@gmail.com

about 5 years ago

In reply to: Heikki Linnakangas (#10)

Re: Corner-case bug in pg_rewind

Ok, pushed and backpatched this now.

Very nice!
Thanks to you all!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

#12

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Pavel Borisov (#11)

Re: Corner-case bug in pg_rewind

On 03/12/2020 16:49, Pavel Borisov wrote:

Ok, pushed and backpatched this now.

Very nice!
Thanks to you all!

Thanks for the review, Pavel! I just realized that I forgot to credit
you in the commit message. I'm sorry.

- Heikki

#13

Pavel Borisov

pashkin.elfe@gmail.com

about 5 years ago

In reply to: Heikki Linnakangas (#12)

Re: Corner-case bug in pg_rewind

чт, 3 дек. 2020 г. в 19:15, Heikki Linnakangas <hlinnaka@iki.fi>:

On 03/12/2020 16:49, Pavel Borisov wrote:

Ok, pushed and backpatched this now.

Very nice!
Thanks to you all!

Thanks for the review, Pavel! I just realized that I forgot to credit
you in the commit message. I'm sorry.

Don't worry, Heikki. No problem.
--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

#14

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Heikki Linnakangas (#10)

Re: Corner-case bug in pg_rewind

On 03/12/2020 16:10, Heikki Linnakangas wrote:

On 02/12/2020 15:26, Ian Barwick wrote:

On 02/12/2020 20:13, Heikki Linnakangas wrote:

Attached are two patches. The first patch is your original patch, unmodified
(except for a cosmetic rename of the test file). The second patch builds on
that, demonstrating and fixing the issue I mentioned. It took me a while to
create a repro for it, it's easily masked by incidental full-page writes or
because rows created by XIDs that are not marked as committed on the other
timeline are invisible, but succeeded at last.

Aha, many thanks. I wasn't entirely sure what I was looking for there and
recently haven't had the time or energy to dig any further.

Ok, pushed and backpatched this now.

The buildfarm is reporting sporadic failures in the new regression test.
I suspect it's because of timing issues, where a server is promoted or
shut down before some data has been replicated. I'll fix that tomorrow
morning.

- Heikki

#15

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Heikki Linnakangas (#14)

Re: Corner-case bug in pg_rewind

On 04/12/2020 00:16, Heikki Linnakangas wrote:

On 03/12/2020 16:10, Heikki Linnakangas wrote:

On 02/12/2020 15:26, Ian Barwick wrote:

On 02/12/2020 20:13, Heikki Linnakangas wrote:

Attached are two patches. The first patch is your original patch, unmodified
(except for a cosmetic rename of the test file). The second patch builds on
that, demonstrating and fixing the issue I mentioned. It took me a while to
create a repro for it, it's easily masked by incidental full-page writes or
because rows created by XIDs that are not marked as committed on the other
timeline are invisible, but succeeded at last.

Aha, many thanks. I wasn't entirely sure what I was looking for there and
recently haven't had the time or energy to dig any further.

Ok, pushed and backpatched this now.

The buildfarm is reporting sporadic failures in the new regression test.
I suspect it's because of timing issues, where a server is promoted or
shut down before some data has been replicated. I'll fix that tomorrow
morning.

Fixed, I hope. It took me a while to backpatch, because small
differences were needed in almost all versions, because some helpful TAP
test helpers like waiting for a standby to catchup are not available in
backbranches.

There was one curious difference between versions 9.6 and 10. In v10,
you can perform a "clean switchover" like this:

1. Shut down primary (A) with "pg_ctl -m fast".

2. Promote the standby (B) with "pg_ctl promote".

3. Reconfigure the old primary (A) as a standby, by creating
recovery.conf that points to the promoted server, and start it up.

But on 9.6, that leads to an error on the the repurposed primary server (A):

LOG: primary server contains no more WAL on requested timeline 1
LOG: new timeline 2 forked off current database system timeline 1
before current recovery point 0/30000A0

It's not clear to me why that is. It seems that the primary generates
some WAL at shutdown that doesn't get replicated, before the shutdown
happens. Or the standby doesn't replay that WAL before it's promoted.
But we have supported "clean switchover" since 9.1, see commit
985bd7d497. When you shut down the primary, it should wait until all the
WAL has been replicated, including the shutdown checkpoint.

Perhaps I was just doing it wrong in the test. Or maybe there's a
genuine bug in that that was fixed in v10. I worked around that in the
test by re-initializing the primary standby from backup instead of just
reconfiguring it as a standby, and that's good enough for this
particular test, so I'm not planning to dig deeper into that myself.

- Heikki