pg_rewind race condition just after promotion

Started by Heikki Linnakangasabout 5 years ago9 messages

hlinnaka@iki.fi

about 5 years ago

1 attachment(s)

There's a race condition between the checkpoint at promotion and
pg_rewind. When a server is promoted, the startup process writes an
end-of-recovery checkpoint that includes the new TLI, and the server is
immediate opened for business. The startup process requests the
checkpointer process to perform a checkpoint, but it can take a few
seconds or more to complete. If you run pg_rewind, using the just
promoted server as the source, pg_rewind will think that the server is
still on the old timeline, because it only looks at TLI in the control
file's copy of the checkpoint record. That's not updated until the
checkpoint is finished.

This isn't a new issue. Stephen Frost first reported it back 2015 [1]/messages/by-id/20150428180253.GU30322@tamriel.snowman.net.
Back then, it was deemed just a small annoyance, and we just worked
around it in the tests by issuing a checkpoint command after promotion,
to wait for the checkpoint to finish. I just ran into it again today,
with the new pg_rewind test, and silenced it in the similar way.

I think we should fix this properly. I'm not sure if it can lead to a
broken cluster, but at least it can cause pg_rewind to fail
unnecessarily and in a user-unfriendly way. But this is actually pretty
simple to fix. pg_rewind looks at the control file to find out the
timeline the server is on. When promotion happens, the startup process
updates minRecoveryPoint and minRecoveryPointTLI fields in the control
file. We just need to read it from there. Patch attached.

I think we should also backpatch this. Back in 2015, we decided that we
can live with this, but it's always been a bit bogus, and seems simple
enough to fix.

Thoughts?

[1]: /messages/by-id/20150428180253.GU30322@tamriel.snowman.net
/messages/by-id/20150428180253.GU30322@tamriel.snowman.net

- Heikki

Attachments:

0001-pg_rewind-Fix-determining-TLI-when-server-was-just-p.patchtext/x-patch; charset=UTF-8; name=0001-pg_rewind-Fix-determining-TLI-when-server-was-just-p.patchDownload

From 274c59ed0e6573dd387a03e36eda15cb41658447 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 7 Dec 2020 19:59:50 +0200
Subject: [PATCH 1/1] pg_rewind: Fix determining TLI when server was just
 promoted.

If the source server was just promoted, and it hasn't written the
checkpoint record yet, pg_rewind considered the server to be still on
the old timeline. Because of that, it would claim incorrectly that no
rewind is required. Fix that by looking at the minRecoveryPointTLI in
the control file in addition to the ThisTimeLineID on the checkpoint.

Discussion: TODO
---
 src/bin/pg_rewind/pg_rewind.c                 | 55 +++++++++++--------
 src/bin/pg_rewind/t/007_standby_source.pl     |  1 -
 src/bin/pg_rewind/t/008_min_recovery_point.pl |  9 ---
 src/bin/pg_rewind/t/RewindTest.pm             |  8 ---
 4 files changed, 33 insertions(+), 40 deletions(-)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index d89c08f81d..8eb9b378dc 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -44,6 +44,7 @@ static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
+static TimeLineHistoryEntry *getTimelineHistory(TimeLineID tli, bool is_source, int *nentries);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
 static void ensureCleanShutdown(const char *argv0);
 static void disconnect_atexit(void);
@@ -67,9 +68,11 @@ bool		dry_run = false;
 bool		do_sync = true;
 bool		restore_wal = false;
 
-/* Target history */
+/* Source target timeline histories */
 TimeLineHistoryEntry *targetHistory;
 int			targetNentries;
+static TimeLineHistoryEntry *sourceHistory;
+static int	sourceNentries;
 
 /* Progress counters */
 uint64		fetch_size;
@@ -129,6 +132,8 @@ main(int argc, char **argv)
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
 	XLogRecPtr	chkptredo;
+	TimeLineID	source_tli;
+	TimeLineID	target_tli;
 	XLogRecPtr	target_wal_endrec;
 	size_t		size;
 	char	   *buffer;
@@ -325,14 +330,28 @@ main(int argc, char **argv)
 
 	sanityChecks();
 
+	/*
+	 * Usually, the TLI can be found in the latest checkpoint record. But if
+	 * the source server is just being promoted (or it's a standby that's
+	 * following a primary that's just being promoted), and the checkpoint
+	 * requested by the promotion hasn't completed yet, the latest timeline
+	 * is in minRecoveryPoint. So we check which is later, the TLI of the
+	 * minRecoveryPoint or the latest checkpoint.
+	 */
+	source_tli = Max(ControlFile_source.minRecoveryPointTLI,
+					 ControlFile_source.checkPointCopy.ThisTimeLineID);
+
+	/* Similarly for the target. */
+	target_tli = Max(ControlFile_target.minRecoveryPointTLI,
+					 ControlFile_target.checkPointCopy.ThisTimeLineID);
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (target_tli == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -342,6 +361,10 @@ main(int argc, char **argv)
 	{
 		XLogRecPtr	chkptendrec;
 
+		/* Retrieve timelines for both source and target */
+		sourceHistory = getTimelineHistory(source_tli, true, &sourceNentries);
+		targetHistory = getTimelineHistory(target_tli, false, &targetNentries);
+
 		findCommonAncestorTimeline(&divergerec, &lastcommontliIndex);
 		pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
 					(uint32) (divergerec >> 32), (uint32) divergerec,
@@ -651,7 +674,8 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+			endtli = Max(ControlFile_source_after.checkPointCopy.ThisTimeLineID,
+						 ControlFile_source_after.minRecoveryPointTLI);
 		}
 	}
 	else
@@ -802,12 +826,9 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
  * either source or target.
  */
 static TimeLineHistoryEntry *
-getTimelineHistory(ControlFileData *controlFile, int *nentries)
+getTimelineHistory(TimeLineID tli, bool is_source, int *nentries)
 {
 	TimeLineHistoryEntry *history;
-	TimeLineID	tli;
-
-	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
@@ -828,12 +849,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 		TLHistoryFilePath(path, tli);
 
 		/* Get history file from appropriate source */
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			histfile = source->fetch_file(source, path, NULL);
-		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
 		else
-			pg_fatal("invalid control file");
+			histfile = slurpFile(datadir_target, path, NULL);
 
 		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
 		pg_free(histfile);
@@ -843,12 +862,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 	{
 		int			i;
 
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			pg_log_debug("Source timeline history:");
-		else if (controlFile == &ControlFile_target)
-			pg_log_debug("Target timeline history:");
 		else
-			Assert(false);
+			pg_log_debug("Target timeline history:");
 
 		/*
 		 * Print the target timeline history.
@@ -881,15 +898,9 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 static void
 findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
diff --git a/src/bin/pg_rewind/t/007_standby_source.pl b/src/bin/pg_rewind/t/007_standby_source.pl
index 7a597bf12b..82b9fc39d7 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -80,7 +80,6 @@ $node_b->wait_for_catchup('node_c', 'write', $lsn);
 # A (primary) <--- B (standby)      C (primary)
 
 $node_c->promote;
-$node_c->safe_psql('postgres', "checkpoint");
 
 
 # Insert a row in A. This causes A/B and C to have "diverged", so that it's
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e3d94b3346..60d74a7e1a 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -75,13 +75,6 @@ $node_1->wait_for_catchup('node_3', 'replay', $lsn);
 #
 $node_1->stop('fast');
 $node_3->promote;
-# Force a checkpoint after the promotion. pg_rewind looks at the control
-# file to determine what timeline the server is on, and that isn't updated
-# immediately at promotion, but only at the next checkpoint. When running
-# pg_rewind in remote mode, it's possible that we complete the test steps
-# after promotion so quickly that when pg_rewind runs, the standby has not
-# performed a checkpoint after promotion yet.
-$node_3->safe_psql('postgres', "checkpoint");
 
 # reconfigure node_1 as a standby following node_3
 my $node_3_connstr = $node_3->connstr;
@@ -106,8 +99,6 @@ $lsn = $node_3->lsn('insert');
 $node_3->wait_for_catchup('node_1', 'replay', $lsn);
 
 $node_1->promote;
-# Force a checkpoint after promotion, like earlier.
-$node_1->safe_psql('postgres', "checkpoint");
 
 #
 # We now have a split-brain with two primaries. Insert a row on both to
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 41ed7d4b3b..e2041e6fc0 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -206,14 +206,6 @@ sub promote_standby
 	# the primary out-of-sync with the standby.
 	$node_standby->promote;
 
-	# Force a checkpoint after the promotion. pg_rewind looks at the control
-	# file to determine what timeline the server is on, and that isn't updated
-	# immediately at promotion, but only at the next checkpoint. When running
-	# pg_rewind in remote mode, it's possible that we complete the test steps
-	# after promotion so quickly that when pg_rewind runs, the standby has not
-	# performed a checkpoint after promotion yet.
-	standby_psql("checkpoint");
-
 	return;
 }
 
-- 
2.20.1

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Heikki Linnakangas (#1)

Re: pg_rewind race condition just after promotion

At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in

There's a race condition between the checkpoint at promotion and
pg_rewind. When a server is promoted, the startup process writes an
end-of-recovery checkpoint that includes the new TLI, and the server
is immediate opened for business. The startup process requests the
checkpointer process to perform a checkpoint, but it can take a few
seconds or more to complete. If you run pg_rewind, using the just
promoted server as the source, pg_rewind will think that the server is
still on the old timeline, because it only looks at TLI in the control
file's copy of the checkpoint record. That's not updated until the
checkpoint is finished.

This isn't a new issue. Stephen Frost first reported it back 2015
[1]. Back then, it was deemed just a small annoyance, and we just
worked around it in the tests by issuing a checkpoint command after
promotion, to wait for the checkpoint to finish. I just ran into it
again today, with the new pg_rewind test, and silenced it in the
similar way.

I (or we) faced that and avoided that by checking for history file on
the primary.

I think we should fix this properly. I'm not sure if it can lead to a
broken cluster, but at least it can cause pg_rewind to fail
unnecessarily and in a user-unfriendly way. But this is actually
pretty simple to fix. pg_rewind looks at the control file to find out
the timeline the server is on. When promotion happens, the startup
process updates minRecoveryPoint and minRecoveryPointTLI fields in the
control file. We just need to read it from there. Patch attached.

Looks fine to me. A bit concerned about making sourceHistory
needlessly file-local but on the other hand unifying sourceHistory and
targetHistory looks better.

For the test part, that change doesn't necessariry catch the failure
of the current version, but I *believe* the prevous code is the result
of an actual failure in the past so the test probablistically (or
dependently on platforms?) hits the failure if it happned.

I think we should also backpatch this. Back in 2015, we decided that
we can live with this, but it's always been a bit bogus, and seems
simple enough to fix.

I don't think this changes any successful behavior and it just saves
the failure case so +1 for back-patching.

Thoughts?

[1]
/messages/by-id/20150428180253.GU30322@tamriel.snowman.net

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Kyotaro Horiguchi (#2)

1 attachment(s)

Re: pg_rewind race condition just after promotion

On 08/12/2020 06:45, Kyotaro Horiguchi wrote:

At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in

I think we should fix this properly. I'm not sure if it can lead to a
broken cluster, but at least it can cause pg_rewind to fail
unnecessarily and in a user-unfriendly way. But this is actually
pretty simple to fix. pg_rewind looks at the control file to find out
the timeline the server is on. When promotion happens, the startup
process updates minRecoveryPoint and minRecoveryPointTLI fields in the
control file. We just need to read it from there. Patch attached.

Looks fine to me. A bit concerned about making sourceHistory
needlessly file-local but on the other hand unifying sourceHistory and
targetHistory looks better.

Looking closer, findCommonAncestorTimeline() was freeing sourceHistory,
which was pretty horrible when it's a file-local variable. I changed it
so that both the source and target histories are passed to
findCommonAncestorTimeline() as arguments. That seems more clear.

For the test part, that change doesn't necessariry catch the failure
of the current version, but I *believe* the prevous code is the result
of an actual failure in the past so the test probablistically (or
dependently on platforms?) hits the failure if it happned.

Right. I think the current test coverage is good enough. We've been
bitten by this a few times by now, when we've forgotten to add the
manual checkpoint commands to new tests, and the buildfarm has caught it
pretty quickly.

I think we should also backpatch this. Back in 2015, we decided that
we can live with this, but it's always been a bit bogus, and seems
simple enough to fix.

I don't think this changes any successful behavior and it just saves
the failure case so +1 for back-patching.

Thanks for the review! New patch version attached.

- Heikki

Attachments:

v2-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patchtext/x-patch; charset=UTF-8; name=v2-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patchDownload

From 5894d560ca4a4c97b805cfb2c396cedd05f5a19c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 9 Dec 2020 15:28:59 +0200
Subject: [PATCH v2 1/1] pg_rewind: Fix determining TLI when server was just
 promoted.

If the source server was just promoted, and it hasn't written the
checkpoint record yet, pg_rewind considered the server to be still on
the old timeline. Because of that, it would claim incorrectly that no
rewind is required. Fix that by looking at the minRecoveryPointTLI in
the control file in addition to the ThisTimeLineID on the checkpoint.

This has been a known issue since forever, and we had worked around it
in the regression tests by issuing a checkpoint after each promotion,
before running pg_rewind. But that was always quite hacky, so better
to fix this properly. This doesn't add any new tests for this, but
removes the previously-added workarounds from the existing tests, so
that they should occasionally hit this codepath again.

Backpatch to all supported versions.

Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi
---
 src/bin/pg_rewind/pg_rewind.c                 | 96 ++++++++++++-------
 src/bin/pg_rewind/t/007_standby_source.pl     |  1 -
 src/bin/pg_rewind/t/008_min_recovery_point.pl |  9 --
 src/bin/pg_rewind/t/RewindTest.pm             |  8 --
 4 files changed, 60 insertions(+), 54 deletions(-)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index d89c08f81d..8cb3bdac48 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -44,7 +44,13 @@ static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
-static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
+static TimeLineHistoryEntry *getTimelineHistory(TimeLineID tli, bool is_source,
+												int *nentries);
+static void findCommonAncestorTimeline(TimeLineHistoryEntry *sourceHistory,
+									   int sourceNentries,
+									   TimeLineHistoryEntry *targetHistory,
+									   int targetNentries,
+									   XLogRecPtr *recptr, int *tliIndex);
 static void ensureCleanShutdown(const char *argv0);
 static void disconnect_atexit(void);
 
@@ -129,6 +135,8 @@ main(int argc, char **argv)
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
 	XLogRecPtr	chkptredo;
+	TimeLineID	source_tli;
+	TimeLineID	target_tli;
 	XLogRecPtr	target_wal_endrec;
 	size_t		size;
 	char	   *buffer;
@@ -325,14 +333,28 @@ main(int argc, char **argv)
 
 	sanityChecks();
 
+	/*
+	 * Usually, the TLI can be found in the latest checkpoint record. But if
+	 * the source server is just being promoted (or it's a standby that's
+	 * following a primary that's just being promoted), and the checkpoint
+	 * requested by the promotion hasn't completed yet, the latest timeline is
+	 * in minRecoveryPoint. So we check which is later, the TLI of the
+	 * minRecoveryPoint or the latest checkpoint.
+	 */
+	source_tli = Max(ControlFile_source.minRecoveryPointTLI,
+					 ControlFile_source.checkPointCopy.ThisTimeLineID);
+
+	/* Similarly for the target. */
+	target_tli = Max(ControlFile_target.minRecoveryPointTLI,
+					 ControlFile_target.checkPointCopy.ThisTimeLineID);
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (target_tli == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -341,12 +363,29 @@ main(int argc, char **argv)
 	else
 	{
 		XLogRecPtr	chkptendrec;
+		TimeLineHistoryEntry *sourceHistory;
+		int			sourceNentries;
 
-		findCommonAncestorTimeline(&divergerec, &lastcommontliIndex);
+		/*
+		 * Retrieve timelines for both source and target, and find the point
+		 * where they diverged.
+		 */
+		sourceHistory = getTimelineHistory(source_tli, true, &sourceNentries);
+		targetHistory = getTimelineHistory(target_tli, false, &targetNentries);
+
+		findCommonAncestorTimeline(sourceHistory, sourceNentries,
+								   targetHistory, targetNentries,
+								   &divergerec, &lastcommontliIndex);
 		pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
 					(uint32) (divergerec >> 32), (uint32) divergerec,
 					targetHistory[lastcommontliIndex].tli);
 
+		/*
+		 * Don't need the source history anymore. The target history is still
+		 * needed by the routines in parsexlog.c, when we read the target WAL.
+		 */
+		pfree(sourceHistory);
+
 		/*
 		 * Determine the end-of-WAL on the target.
 		 *
@@ -651,7 +690,8 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+			endtli = Max(ControlFile_source_after.checkPointCopy.ThisTimeLineID,
+						 ControlFile_source_after.minRecoveryPointTLI);
 		}
 	}
 	else
@@ -798,16 +838,12 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
 }
 
 /*
- * Retrieve timeline history for given control file which should behold
- * either source or target.
+ * Retrieve timeline history for the source or target system.
  */
 static TimeLineHistoryEntry *
-getTimelineHistory(ControlFileData *controlFile, int *nentries)
+getTimelineHistory(TimeLineID tli, bool is_source, int *nentries)
 {
 	TimeLineHistoryEntry *history;
-	TimeLineID	tli;
-
-	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
@@ -828,12 +864,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 		TLHistoryFilePath(path, tli);
 
 		/* Get history file from appropriate source */
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			histfile = source->fetch_file(source, path, NULL);
-		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
 		else
-			pg_fatal("invalid control file");
+			histfile = slurpFile(datadir_target, path, NULL);
 
 		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
 		pg_free(histfile);
@@ -843,12 +877,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 	{
 		int			i;
 
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			pg_log_debug("Source timeline history:");
-		else if (controlFile == &ControlFile_target)
-			pg_log_debug("Target timeline history:");
 		else
-			Assert(false);
+			pg_log_debug("Target timeline history:");
 
 		/*
 		 * Print the target timeline history.
@@ -869,27 +901,21 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 
 /*
  * Determine the TLI of the last common timeline in the timeline history of the
- * two clusters. targetHistory is filled with target timeline history and
- * targetNentries is number of items in targetHistory. *tliIndex is set to the
- * index of last common timeline in targetHistory array, and *recptr is set to
- * the position where the timeline history diverged (ie. the first WAL record
- * that's not the same in both clusters).
- *
- * Control files of both clusters must be read into ControlFile_target/source
- * before calling this routine.
+ * two clusters. *tliIndex is set to the index of last common timeline in
+ * targetHistory array, and *recptr is set to the position where the timeline
+ * history diverged (ie. the first WAL record that's not the same in both
+ * clusters).
  */
 static void
-findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
+findCommonAncestorTimeline(TimeLineHistoryEntry *sourceHistory,
+						   int sourceNentries,
+						   TimeLineHistoryEntry *targetHistory,
+						   int targetNentries,
+						   XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
@@ -911,8 +937,6 @@ findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 		i--;
 		*recptr = MinXLogRecPtr(sourceHistory[i].end, targetHistory[i].end);
 		*tliIndex = i;
-
-		pg_free(sourceHistory);
 		return;
 	}
 	else
diff --git a/src/bin/pg_rewind/t/007_standby_source.pl b/src/bin/pg_rewind/t/007_standby_source.pl
index 7a597bf12b..82b9fc39d7 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -80,7 +80,6 @@ $node_b->wait_for_catchup('node_c', 'write', $lsn);
 # A (primary) <--- B (standby)      C (primary)
 
 $node_c->promote;
-$node_c->safe_psql('postgres', "checkpoint");
 
 
 # Insert a row in A. This causes A/B and C to have "diverged", so that it's
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e3d94b3346..60d74a7e1a 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -75,13 +75,6 @@ $node_1->wait_for_catchup('node_3', 'replay', $lsn);
 #
 $node_1->stop('fast');
 $node_3->promote;
-# Force a checkpoint after the promotion. pg_rewind looks at the control
-# file to determine what timeline the server is on, and that isn't updated
-# immediately at promotion, but only at the next checkpoint. When running
-# pg_rewind in remote mode, it's possible that we complete the test steps
-# after promotion so quickly that when pg_rewind runs, the standby has not
-# performed a checkpoint after promotion yet.
-$node_3->safe_psql('postgres', "checkpoint");
 
 # reconfigure node_1 as a standby following node_3
 my $node_3_connstr = $node_3->connstr;
@@ -106,8 +99,6 @@ $lsn = $node_3->lsn('insert');
 $node_3->wait_for_catchup('node_1', 'replay', $lsn);
 
 $node_1->promote;
-# Force a checkpoint after promotion, like earlier.
-$node_1->safe_psql('postgres', "checkpoint");
 
 #
 # We now have a split-brain with two primaries. Insert a row on both to
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 41ed7d4b3b..e2041e6fc0 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -206,14 +206,6 @@ sub promote_standby
 	# the primary out-of-sync with the standby.
 	$node_standby->promote;
 
-	# Force a checkpoint after the promotion. pg_rewind looks at the control
-	# file to determine what timeline the server is on, and that isn't updated
-	# immediately at promotion, but only at the next checkpoint. When running
-	# pg_rewind in remote mode, it's possible that we complete the test steps
-	# after promotion so quickly that when pg_rewind runs, the standby has not
-	# performed a checkpoint after promotion yet.
-	standby_psql("checkpoint");
-
 	return;
 }
 
-- 
2.20.1

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 5 years ago

In reply to: Heikki Linnakangas (#3)

1 attachment(s)

Re: pg_rewind race condition just after promotion

On Wed, Dec 9, 2020 at 6:35 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 08/12/2020 06:45, Kyotaro Horiguchi wrote:

At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka@iki.fi>

wrote in

I think we should fix this properly. I'm not sure if it can lead to a
broken cluster, but at least it can cause pg_rewind to fail
unnecessarily and in a user-unfriendly way. But this is actually
pretty simple to fix. pg_rewind looks at the control file to find out
the timeline the server is on. When promotion happens, the startup
process updates minRecoveryPoint and minRecoveryPointTLI fields in the
control file. We just need to read it from there. Patch attached.

Looks fine to me. A bit concerned about making sourceHistory
needlessly file-local but on the other hand unifying sourceHistory and
targetHistory looks better.

Looking closer, findCommonAncestorTimeline() was freeing sourceHistory,
which was pretty horrible when it's a file-local variable. I changed it
so that both the source and target histories are passed to
findCommonAncestorTimeline() as arguments. That seems more clear.

For the test part, that change doesn't necessariry catch the failure
of the current version, but I *believe* the prevous code is the result
of an actual failure in the past so the test probablistically (or
dependently on platforms?) hits the failure if it happned.

Right. I think the current test coverage is good enough. We've been
bitten by this a few times by now, when we've forgotten to add the
manual checkpoint commands to new tests, and the buildfarm has caught it
pretty quickly.

I think we should also backpatch this. Back in 2015, we decided that
we can live with this, but it's always been a bit bogus, and seems
simple enough to fix.

I don't think this changes any successful behavior and it just saves
the failure case so +1 for back-patching.

Thanks for the review! New patch version attached.

- Heikki

The patch does not apply successfully

http://cfbot.cputube.org/patch_32_2864.log
1 out of 10 hunks FAILED -- saving rejects to file
src/bin/pg_rewind/pg_rewind.c.rej

There is a minor issue therefore I rebase the patch. Please take a look at
that.

--
Ibrar Ahmed

Attachments:

v3-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patchapplication/octet-stream; name=v3-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patchDownload

From 5894d560ca4a4c97b805cfb2c396cedd05f5a19c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 9 Dec 2020 15:28:59 +0200
Subject: [PATCH v2 1/1] pg_rewind: Fix determining TLI when server was just
 promoted.

If the source server was just promoted, and it hasn't written the
checkpoint record yet, pg_rewind considered the server to be still on
the old timeline. Because of that, it would claim incorrectly that no
rewind is required. Fix that by looking at the minRecoveryPointTLI in
the control file in addition to the ThisTimeLineID on the checkpoint.

This has been a known issue since forever, and we had worked around it
in the regression tests by issuing a checkpoint after each promotion,
before running pg_rewind. But that was always quite hacky, so better
to fix this properly. This doesn't add any new tests for this, but
removes the previously-added workarounds from the existing tests, so
that they should occasionally hit this codepath again.

Backpatch to all supported versions.

Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi
---
 src/bin/pg_rewind/pg_rewind.c                 | 96 ++++++++++++-------
 src/bin/pg_rewind/t/007_standby_source.pl     |  1 -
 src/bin/pg_rewind/t/008_min_recovery_point.pl |  9 --
 src/bin/pg_rewind/t/RewindTest.pm             |  8 --
 4 files changed, 60 insertions(+), 54 deletions(-)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 9df08ab2b0..c5f900a954 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -44,7 +44,13 @@ static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
-static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
+static TimeLineHistoryEntry *getTimelineHistory(TimeLineID tli, bool is_source,
+												int *nentries);
+static void findCommonAncestorTimeline(TimeLineHistoryEntry *sourceHistory,
+									   int sourceNentries,
+									   TimeLineHistoryEntry *targetHistory,
+									   int targetNentries,
+									   XLogRecPtr *recptr, int *tliIndex);
 static void ensureCleanShutdown(const char *argv0);
 static void disconnect_atexit(void);
 
@@ -129,6 +135,8 @@ main(int argc, char **argv)
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
 	XLogRecPtr	chkptredo;
+	TimeLineID	source_tli;
+	TimeLineID	target_tli;
 	XLogRecPtr	target_wal_endrec;
 	size_t		size;
 	char	   *buffer;
@@ -325,14 +333,28 @@ main(int argc, char **argv)
 
 	sanityChecks();
 
+	/*
+	 * Usually, the TLI can be found in the latest checkpoint record. But if
+	 * the source server is just being promoted (or it's a standby that's
+	 * following a primary that's just being promoted), and the checkpoint
+	 * requested by the promotion hasn't completed yet, the latest timeline is
+	 * in minRecoveryPoint. So we check which is later, the TLI of the
+	 * minRecoveryPoint or the latest checkpoint.
+	 */
+	source_tli = Max(ControlFile_source.minRecoveryPointTLI,
+					 ControlFile_source.checkPointCopy.ThisTimeLineID);
+
+	/* Similarly for the target. */
+	target_tli = Max(ControlFile_target.minRecoveryPointTLI,
+					 ControlFile_target.checkPointCopy.ThisTimeLineID);
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (target_tli == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -341,11 +363,29 @@ main(int argc, char **argv)
 	else
 	{
 		XLogRecPtr	chkptendrec;
+		TimeLineHistoryEntry *sourceHistory;
+		int         sourceNentries;
+
+		/*
+		 * Retrieve timelines for both source and target, and find the point
+		 * where they diverged.
+		 */
+		sourceHistory = getTimelineHistory(source_tli, true, &sourceNentries);
+		targetHistory = getTimelineHistory(target_tli, false, &targetNentries);
+
+		findCommonAncestorTimeline(sourceHistory, sourceNentries,
+									targetHistory, targetNentries,
+									&divergerec, &lastcommontliIndex);
 
-		findCommonAncestorTimeline(&divergerec, &lastcommontliIndex);
 		pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
 					LSN_FORMAT_ARGS(divergerec),
 					targetHistory[lastcommontliIndex].tli);
+		/*
+		 * Don't need the source history anymore. The target history is still
+		 * needed by the routines in parsexlog.c, when we read the target WAL.
+		 */
+		pfree(sourceHistory);
+
 
 		/*
 		 * Determine the end-of-WAL on the target.
@@ -650,7 +690,8 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+			endtli = Max(ControlFile_source_after.checkPointCopy.ThisTimeLineID,
+						 ControlFile_source_after.minRecoveryPointTLI);
 		}
 	}
 	else
@@ -797,16 +838,12 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
 }
 
 /*
- * Retrieve timeline history for given control file which should behold
- * either source or target.
+ * Retrieve timeline history for the source or target system.
  */
 static TimeLineHistoryEntry *
-getTimelineHistory(ControlFileData *controlFile, int *nentries)
+getTimelineHistory(TimeLineID tli, bool is_source, int *nentries)
 {
 	TimeLineHistoryEntry *history;
-	TimeLineID	tli;
-
-	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
@@ -827,12 +864,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 		TLHistoryFilePath(path, tli);
 
 		/* Get history file from appropriate source */
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			histfile = source->fetch_file(source, path, NULL);
-		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
 		else
-			pg_fatal("invalid control file");
+			histfile = slurpFile(datadir_target, path, NULL);
 
 		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
 		pg_free(histfile);
@@ -842,12 +877,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 	{
 		int			i;
 
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			pg_log_debug("Source timeline history:");
-		else if (controlFile == &ControlFile_target)
-			pg_log_debug("Target timeline history:");
 		else
-			Assert(false);
+			pg_log_debug("Target timeline history:");
 
 		/*
 		 * Print the target timeline history.
@@ -868,27 +901,21 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 
 /*
  * Determine the TLI of the last common timeline in the timeline history of the
- * two clusters. targetHistory is filled with target timeline history and
- * targetNentries is number of items in targetHistory. *tliIndex is set to the
- * index of last common timeline in targetHistory array, and *recptr is set to
- * the position where the timeline history diverged (ie. the first WAL record
- * that's not the same in both clusters).
- *
- * Control files of both clusters must be read into ControlFile_target/source
- * before calling this routine.
+ * two clusters. *tliIndex is set to the index of last common timeline in
+ * targetHistory array, and *recptr is set to the position where the timeline
+ * history diverged (ie. the first WAL record that's not the same in both
+ * clusters).
  */
 static void
-findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
+findCommonAncestorTimeline(TimeLineHistoryEntry *sourceHistory,
+						   int sourceNentries,
+						   TimeLineHistoryEntry *targetHistory,
+						   int targetNentries,
+						   XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
@@ -910,8 +937,6 @@ findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 		i--;
 		*recptr = MinXLogRecPtr(sourceHistory[i].end, targetHistory[i].end);
 		*tliIndex = i;
-
-		pg_free(sourceHistory);
 		return;
 	}
 	else
diff --git a/src/bin/pg_rewind/t/007_standby_source.pl b/src/bin/pg_rewind/t/007_standby_source.pl
index 7a597bf12b..82b9fc39d7 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -80,7 +80,6 @@ $node_b->wait_for_catchup('node_c', 'write', $lsn);
 # A (primary) <--- B (standby)      C (primary)
 
 $node_c->promote;
-$node_c->safe_psql('postgres', "checkpoint");
 
 
 # Insert a row in A. This causes A/B and C to have "diverged", so that it's
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e3d94b3346..60d74a7e1a 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -75,13 +75,6 @@ $node_1->wait_for_catchup('node_3', 'replay', $lsn);
 #
 $node_1->stop('fast');
 $node_3->promote;
-# Force a checkpoint after the promotion. pg_rewind looks at the control
-# file to determine what timeline the server is on, and that isn't updated
-# immediately at promotion, but only at the next checkpoint. When running
-# pg_rewind in remote mode, it's possible that we complete the test steps
-# after promotion so quickly that when pg_rewind runs, the standby has not
-# performed a checkpoint after promotion yet.
-$node_3->safe_psql('postgres', "checkpoint");
 
 # reconfigure node_1 as a standby following node_3
 my $node_3_connstr = $node_3->connstr;
@@ -106,8 +99,6 @@ $lsn = $node_3->lsn('insert');
 $node_3->wait_for_catchup('node_1', 'replay', $lsn);
 
 $node_1->promote;
-# Force a checkpoint after promotion, like earlier.
-$node_1->safe_psql('postgres', "checkpoint");
 
 #
 # We now have a split-brain with two primaries. Insert a row on both to
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 41ed7d4b3b..e2041e6fc0 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -206,14 +206,6 @@ sub promote_standby
 	# the primary out-of-sync with the standby.
 	$node_standby->promote;
 
-	# Force a checkpoint after the promotion. pg_rewind looks at the control
-	# file to determine what timeline the server is on, and that isn't updated
-	# immediately at promotion, but only at the next checkpoint. When running
-	# pg_rewind in remote mode, it's possible that we complete the test steps
-	# after promotion so quickly that when pg_rewind runs, the standby has not
-	# performed a checkpoint after promotion yet.
-	standby_psql("checkpoint");
-
 	return;
 }

Aleksander Alekseev

aleksander@timescale.com

over 4 years ago

In reply to: Ibrar Ahmed (#4)

Re: pg_rewind race condition just after promotion

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
coverage for this scenario. But I don't have a strong opinion on this one.

The new status of this patch is: Ready for Committer

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: Aleksander Alekseev (#5)

Re: pg_rewind race condition just after promotion

On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander@timescale.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
coverage for this scenario. But I don't have a strong opinion on this one.

The new status of this patch is: Ready for Committer

Heikki, do you have plans to address this patch during this CF?

--
Daniel Gustafsson https://vmware.com/

Ian Lawrence Barwick

barwick@gmail.com

about 3 years ago

In reply to: Daniel Gustafsson (#6)

Re: pg_rewind race condition just after promotion

2021年11月9日(火) 20:31 Daniel Gustafsson <daniel@yesql.se>:

On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander@timescale.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
coverage for this scenario. But I don't have a strong opinion on this one.

The new status of this patch is: Ready for Committer

Heikki, do you have plans to address this patch during this CF?

Friendly reminder ping one year on; I haven't looked at this patch in
detail but going by the thread contents it seems it should be marked
"Ready for Committer"? Moved to the next CF anyway.

Regards

Ian Barwick

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Ian Lawrence Barwick (#7)

1 attachment(s)

Re: pg_rewind race condition just after promotion

On 11/12/2022 02:01, Ian Lawrence Barwick wrote:

2021年11月9日(火) 20:31 Daniel Gustafsson <daniel@yesql.se>:

On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander@timescale.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
coverage for this scenario. But I don't have a strong opinion on this one.

The new status of this patch is: Ready for Committer

Heikki, do you have plans to address this patch during this CF?

Friendly reminder ping one year on; I haven't looked at this patch in
detail but going by the thread contents it seems it should be marked
"Ready for Committer"? Moved to the next CF anyway.

Here's an updated version of the patch.

I renamed the arguments to findCommonAncestorTimeline() so that the
'targetHistory' argument doesn't shadow the global 'targetHistory'
variable. No other changes, and this still looks good to me, so I'll
wait for the cfbot to run on this and commit in the next few days.

- Heikki

Attachments:

v4-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patchtext/x-patch; charset=UTF-8; name=v4-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patchDownload

From 046529d0ec039888b10ef65ff7d8228bba89666c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 22 Feb 2023 15:58:17 +0200
Subject: [PATCH v4 1/1] pg_rewind: Fix determining TLI when server was just
 promoted.

If the source server was just promoted, and it hasn't written the
checkpoint record yet, pg_rewind considered the server to be still on
the old timeline. Because of that, it would claim incorrectly that no
rewind is required. Fix that by looking at minRecoveryPointTLI in the
control file in addition to the ThisTimeLineID on the checkpoint.

This has been a known issue since forever, and we had worked around it
in the regression tests by issuing a checkpoint after each promotion,
before running pg_rewind. But that was always quite hacky, so better
to fix this properly. This doesn't add any new tests for this, but
removes the previously-added workarounds from the existing tests, so
that they should occasionally hit this codepath again.

Backpatch to all supported versions.

Reviewed-by: Kyotaro Horiguchi, Ibrar Ahmed, Aleksander Alekseev
Discussion: https://www.postgresql.org/message-id/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi
---
 src/bin/pg_rewind/pg_rewind.c                 | 105 +++++++++++-------
 src/bin/pg_rewind/t/007_standby_source.pl     |   1 -
 src/bin/pg_rewind/t/008_min_recovery_point.pl |   9 --
 src/bin/pg_rewind/t/RewindTest.pm             |   8 --
 4 files changed, 64 insertions(+), 59 deletions(-)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 858d8d9f2f..f7f3b8227f 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -45,7 +45,13 @@ static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
-static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
+static TimeLineHistoryEntry *getTimelineHistory(TimeLineID tli, bool is_source,
+												int *nentries);
+static void findCommonAncestorTimeline(TimeLineHistoryEntry *a_history,
+									   int a_nentries,
+									   TimeLineHistoryEntry *b_history,
+									   int b_nentries,
+									   XLogRecPtr *recptr, int *tliIndex);
 static void ensureCleanShutdown(const char *argv0);
 static void disconnect_atexit(void);
 
@@ -134,6 +140,8 @@ main(int argc, char **argv)
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
 	XLogRecPtr	chkptredo;
+	TimeLineID	source_tli;
+	TimeLineID	target_tli;
 	XLogRecPtr	target_wal_endrec;
 	size_t		size;
 	char	   *buffer;
@@ -332,14 +340,28 @@ main(int argc, char **argv)
 
 	sanityChecks();
 
+	/*
+	 * Usually, the TLI can be found in the latest checkpoint record. But if
+	 * the source server is just being promoted (or it's a standby that's
+	 * following a primary that's just being promoted), and the checkpoint
+	 * requested by the promotion hasn't completed yet, the latest timeline is
+	 * in minRecoveryPoint. So we check which is later, the TLI of the
+	 * minRecoveryPoint or the latest checkpoint.
+	 */
+	source_tli = Max(ControlFile_source.minRecoveryPointTLI,
+					 ControlFile_source.checkPointCopy.ThisTimeLineID);
+
+	/* Similarly for the target. */
+	target_tli = Max(ControlFile_target.minRecoveryPointTLI,
+					 ControlFile_target.checkPointCopy.ThisTimeLineID);
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (target_tli == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -348,12 +370,31 @@ main(int argc, char **argv)
 	else
 	{
 		XLogRecPtr	chkptendrec;
+		TimeLineHistoryEntry *sourceHistory;
+		int			sourceNentries;
+
+		/*
+		 * Retrieve timelines for both source and target, and find the point
+		 * where they diverged.
+		 */
+		sourceHistory = getTimelineHistory(source_tli, true, &sourceNentries);
+		targetHistory = getTimelineHistory(target_tli, false, &targetNentries);
+
+		findCommonAncestorTimeline(sourceHistory, sourceNentries,
+								   targetHistory, targetNentries,
+								   &divergerec, &lastcommontliIndex);
 
-		findCommonAncestorTimeline(&divergerec, &lastcommontliIndex);
 		pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
 					LSN_FORMAT_ARGS(divergerec),
 					targetHistory[lastcommontliIndex].tli);
 
+		/*
+		 * Don't need the source history anymore. The target history is still
+		 * needed by the routines in parsexlog.c, when we read the target WAL.
+		 */
+		pfree(sourceHistory);
+
+
 		/*
 		 * Determine the end-of-WAL on the target.
 		 *
@@ -654,7 +695,8 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+			endtli = Max(ControlFile_source_after.checkPointCopy.ThisTimeLineID,
+						 ControlFile_source_after.minRecoveryPointTLI);
 		}
 	}
 	else
@@ -796,16 +838,12 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
 }
 
 /*
- * Retrieve timeline history for given control file which should behold
- * either source or target.
+ * Retrieve timeline history for the source or target system.
  */
 static TimeLineHistoryEntry *
-getTimelineHistory(ControlFileData *controlFile, int *nentries)
+getTimelineHistory(TimeLineID tli, bool is_source, int *nentries)
 {
 	TimeLineHistoryEntry *history;
-	TimeLineID	tli;
-
-	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
@@ -826,12 +864,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 		TLHistoryFilePath(path, tli);
 
 		/* Get history file from appropriate source */
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			histfile = source->fetch_file(source, path, NULL);
-		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
 		else
-			pg_fatal("invalid control file");
+			histfile = slurpFile(datadir_target, path, NULL);
 
 		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
 		pg_free(histfile);
@@ -841,12 +877,10 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 	{
 		int			i;
 
-		if (controlFile == &ControlFile_source)
+		if (is_source)
 			pg_log_debug("Source timeline history:");
-		else if (controlFile == &ControlFile_target)
-			pg_log_debug("Target timeline history:");
 		else
-			Assert(false);
+			pg_log_debug("Target timeline history:");
 
 		/*
 		 * Print the target timeline history.
@@ -866,28 +900,19 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 }
 
 /*
- * Determine the TLI of the last common timeline in the timeline history of the
- * two clusters. targetHistory is filled with target timeline history and
- * targetNentries is number of items in targetHistory. *tliIndex is set to the
- * index of last common timeline in targetHistory array, and *recptr is set to
- * the position where the timeline history diverged (ie. the first WAL record
- * that's not the same in both clusters).
- *
- * Control files of both clusters must be read into ControlFile_target/source
- * before calling this routine.
+ * Determine the TLI of the last common timeline in the timeline history of
+ * two clusters. *tliIndex is set to the index of last common timeline in
+ * the arrays, and *recptr is set to the position where the timeline history
+ * diverged (ie. the first WAL record that's not the same in both clusters).
  */
 static void
-findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
+findCommonAncestorTimeline(TimeLineHistoryEntry *a_history, int a_nentries,
+						   TimeLineHistoryEntry *b_history, int b_nentries,
+						   XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
@@ -896,21 +921,19 @@ findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 	 * recovery processes. Hence check the start position of the new timeline
 	 * as well and move down by one extra timeline entry if they do not match.
 	 */
-	n = Min(sourceNentries, targetNentries);
+	n = Min(a_nentries, b_nentries);
 	for (i = 0; i < n; i++)
 	{
-		if (sourceHistory[i].tli != targetHistory[i].tli ||
-			sourceHistory[i].begin != targetHistory[i].begin)
+		if (a_history[i].tli != b_history[i].tli ||
+			a_history[i].begin != b_history[i].begin)
 			break;
 	}
 
 	if (i > 0)
 	{
 		i--;
-		*recptr = MinXLogRecPtr(sourceHistory[i].end, targetHistory[i].end);
+		*recptr = MinXLogRecPtr(a_history[i].end, b_history[i].end);
 		*tliIndex = i;
-
-		pg_free(sourceHistory);
 		return;
 	}
 	else
diff --git a/src/bin/pg_rewind/t/007_standby_source.pl b/src/bin/pg_rewind/t/007_standby_source.pl
index 35fe27888e..3f813929a6 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -83,7 +83,6 @@ $node_b->wait_for_catchup('node_c', 'write', $lsn);
 # A (primary) <--- B (standby)      C (primary)
 
 $node_c->promote;
-$node_c->safe_psql('postgres', "checkpoint");
 
 
 # Insert a row in A. This causes A/B and C to have "diverged", so that it's
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index af262f3338..c753a64fdb 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -76,13 +76,6 @@ $node_1->wait_for_catchup('node_3');
 #
 $node_1->stop('fast');
 $node_3->promote;
-# Force a checkpoint after the promotion. pg_rewind looks at the control
-# file to determine what timeline the server is on, and that isn't updated
-# immediately at promotion, but only at the next checkpoint. When running
-# pg_rewind in remote mode, it's possible that we complete the test steps
-# after promotion so quickly that when pg_rewind runs, the standby has not
-# performed a checkpoint after promotion yet.
-$node_3->safe_psql('postgres', "checkpoint");
 
 # reconfigure node_1 as a standby following node_3
 my $node_3_connstr = $node_3->connstr;
@@ -108,8 +101,6 @@ $node_2->restart();
 $node_3->wait_for_catchup('node_1');
 
 $node_1->promote;
-# Force a checkpoint after promotion, like earlier.
-$node_1->safe_psql('postgres', "checkpoint");
 
 #
 # We now have a split-brain with two primaries. Insert a row on both to
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index aa16a1b3f1..373f6dfbf7 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -198,14 +198,6 @@ sub promote_standby
 	# the primary out-of-sync with the standby.
 	$node_standby->promote;
 
-	# Force a checkpoint after the promotion. pg_rewind looks at the control
-	# file to determine what timeline the server is on, and that isn't updated
-	# immediately at promotion, but only at the next checkpoint. When running
-	# pg_rewind in remote mode, it's possible that we complete the test steps
-	# after promotion so quickly that when pg_rewind runs, the standby has not
-	# performed a checkpoint after promotion yet.
-	standby_psql("checkpoint");
-
 	return;
 }
 
-- 
2.30.2

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Heikki Linnakangas (#8)

Re: pg_rewind race condition just after promotion

On 22/02/2023 16:00, Heikki Linnakangas wrote:

On 11/12/2022 02:01, Ian Lawrence Barwick wrote:

2021年11月9日(火) 20:31 Daniel Gustafsson <daniel@yesql.se>:

On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander@timescale.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
coverage for this scenario. But I don't have a strong opinion on this one.

The new status of this patch is: Ready for Committer

Heikki, do you have plans to address this patch during this CF?

Friendly reminder ping one year on; I haven't looked at this patch in
detail but going by the thread contents it seems it should be marked
"Ready for Committer"? Moved to the next CF anyway.

Here's an updated version of the patch.

I renamed the arguments to findCommonAncestorTimeline() so that the
'targetHistory' argument doesn't shadow the global 'targetHistory'
variable. No other changes, and this still looks good to me, so I'll
wait for the cfbot to run on this and commit in the next few days.

Pushed. I decided not to backpatch this, after all. We haven't really
been treating this as a bug so far, and the patch didn't apply cleanly
to v13 and before.

- Heikki