pg_rewind: warn when checkpoint hasn't happened after promotion

Started by James Colemanover 3 years ago21 messages
#1James Coleman
jtc331@gmail.com
1 attachment(s)

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Specifically, this happens when the control file on the newly promoted
server looks like:

Latest checkpoint's TimeLineID: 4
Latest checkpoint's PrevTimeLineID: 4
...
Min recovery ending loc's timeline: 5

Attached is a patch that detects this condition and reports it as an
error to the user.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

Thanks,
James Coleman

1: /messages/by-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Attachments:

v1-0001-pg_rewind-warn-if-we-haven-t-checkpointed-after-p.patchapplication/octet-stream; name=v1-0001-pg_rewind-warn-if-we-haven-t-checkpointed-after-p.patchDownload
From 47ae21af8fd2d685987b10c3195e4897c1f81976 Mon Sep 17 00:00:00 2001
From: jcoleman <jtc331@gmail.com>
Date: Fri, 3 Jun 2022 20:32:15 -0400
Subject: [PATCH v1] pg_rewind: warn if we haven't checkpointed after promote

Because pg_rewind looks at the last checkpoint to determine the timeline
on the source server it can incorrectly report that the timelines are
the same if the source is a recently promoted standby that hasn't yet
completed a new checkpoint.

This is already documented, but we shouldn't tell the user a rewind
isn't necessary when it clearly is.
---
 src/bin/pg_rewind/pg_rewind.c                 | 13 +++-
 .../t/010_no_checkpoint_after_promotion.pl    | 74 +++++++++++++++++++
 src/bin/pg_rewind/t/RewindTest.pm             | 22 ++++--
 3 files changed, 101 insertions(+), 8 deletions(-)
 create mode 100644 src/bin/pg_rewind/t/010_no_checkpoint_after_promotion.pl

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..2dc396c4fd 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -337,8 +337,19 @@ main(int argc, char **argv)
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
+	 *
+	 * We must avoid assuming that a checkpoint has happened on the source or
+	 * we'll possibly conclude the timelines are the same when the source has
+	 * recently promoted.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
+	if (ControlFile_source.minRecoveryPointTLI != 0 &&
+			ControlFile_source.checkPointCopy.ThisTimeLineID !=
+			ControlFile_source.minRecoveryPointTLI)
+	{
+		pg_log_info("source cluster hasn't checkpointed since its last timeline switch");
+		exit(1);
+	}
+	else if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
 		ControlFile_source.checkPointCopy.ThisTimeLineID)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
diff --git a/src/bin/pg_rewind/t/010_no_checkpoint_after_promotion.pl b/src/bin/pg_rewind/t/010_no_checkpoint_after_promotion.pl
new file mode 100644
index 0000000000..53188b2514
--- /dev/null
+++ b/src/bin/pg_rewind/t/010_no_checkpoint_after_promotion.pl
@@ -0,0 +1,74 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use RewindTest;
+
+RewindTest::setup_cluster("remote");
+RewindTest::start_primary();
+
+# Create a test table and insert a row in primary.
+primary_psql("CREATE TABLE tbl1 (d text)");
+primary_psql("INSERT INTO tbl1 VALUES ('in primary')");
+
+# This test table will be used to test truncation, i.e. the table
+# is extended in the old primary after promotion
+primary_psql("CREATE TABLE trunc_tbl (d text)");
+primary_psql("INSERT INTO trunc_tbl VALUES ('in primary')");
+
+# This test table will be used to test the "copy-tail" case, i.e. the
+# table is truncated in the old primary after promotion
+primary_psql("CREATE TABLE tail_tbl (id integer, d text)");
+primary_psql("INSERT INTO tail_tbl VALUES (0, 'in primary')");
+
+# This test table is dropped in the old primary after promotion.
+primary_psql("CREATE TABLE drop_tbl (d text)");
+primary_psql("INSERT INTO drop_tbl VALUES ('in primary')");
+
+primary_psql("CHECKPOINT");
+
+RewindTest::create_standby("remote");
+
+# Insert additional data on primary that will be replicated to standby
+primary_psql("INSERT INTO tbl1 values ('in primary, before promotion')");
+primary_psql(
+  "INSERT INTO trunc_tbl values ('in primary, before promotion')");
+primary_psql(
+  "INSERT INTO tail_tbl SELECT g, 'in primary, before promotion: ' || g FROM generate_series(1, 10000) g"
+);
+
+primary_psql('CHECKPOINT');
+
+RewindTest::promote_standby('skip_checkpoint' => 1);
+
+# Insert a row in the old primary. This causes the primary and standby
+# to have "diverged", it's no longer possible to just apply the
+# standy's logs over primary directory - you need to rewind.
+primary_psql("INSERT INTO tbl1 VALUES ('in primary, after promotion')");
+
+# Also insert a new row in the standby, which won't be present in the
+# old primary.
+standby_psql("INSERT INTO tbl1 VALUES ('in standby, after promotion')");
+
+$node_primary->stop;
+
+my $primary_pgdata = $node_primary->data_dir;
+my $standby_connstr = $node_standby->connstr('postgres');
+command_fails_like(
+  [
+    'pg_rewind',       '--dry-run',
+    "--source-server", $standby_connstr,
+    '--target-pgdata', $primary_pgdata,
+    '--no-sync',       '--no-ensure-shutdown'
+  ],
+  qr/source cluster hasn\'t checkpointed since its last timeline switch/,
+  'pg_rewind no checkpoint after promotion');
+
+done_testing();
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 98b66b01f8..89fddc1656 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -188,6 +188,11 @@ primary_conninfo='$connstr_primary'
 
 sub promote_standby
 {
+	my (%options) = (
+		'skip_checkpoint' => 0,
+		@_
+	);
+
 	#### Now run the test-specific parts to run after standby has been started
 	# up standby
 
@@ -198,13 +203,16 @@ sub promote_standby
 	# the primary out-of-sync with the standby.
 	$node_standby->promote;
 
-	# Force a checkpoint after the promotion. pg_rewind looks at the control
-	# file to determine what timeline the server is on, and that isn't updated
-	# immediately at promotion, but only at the next checkpoint. When running
-	# pg_rewind in remote mode, it's possible that we complete the test steps
-	# after promotion so quickly that when pg_rewind runs, the standby has not
-	# performed a checkpoint after promotion yet.
-	standby_psql("checkpoint");
+	unless ($options{'skip_checkpoint'})
+	{
+		# Force a checkpoint after the promotion. pg_rewind looks at the control
+		# file to determine what timeline the server is on, and that isn't updated
+		# immediately at promotion, but only at the next checkpoint. When running
+		# pg_rewind in remote mode, it's possible that we complete the test steps
+		# after promotion so quickly that when pg_rewind runs, the standby has not
+		# performed a checkpoint after promotion yet.
+		standby_psql("checkpoint");
+	}
 
 	return;
 }
-- 
2.32.0 (Apple Git-132)

#2Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: James Coleman (#1)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Specifically, this happens when the control file on the newly promoted
server looks like:

Latest checkpoint's TimeLineID: 4
Latest checkpoint's PrevTimeLineID: 4
...
Min recovery ending loc's timeline: 5

Attached is a patch that detects this condition and reports it as an
error to the user.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

1: /messages/by-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Thanks. I had a quick look over the issue and patch - just a thought -
can't we let pg_rewind issue a checkpoint on the new primary instead
of erroring out, maybe optionally? It might sound too much, but helps
pg_rewind to be self-reliant i.e. avoiding external actor to detect
the error and issue checkpoint the new primary to be able to
successfully run pg_rewind on the pld primary and repair it to use it
as a new standby.

Regards,
Bharath Rupireddy.

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#2)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

...

Attached is a patch that detects this condition and reports it as an
error to the user.

I have some random thoughts on this.

There could be a problem in the case of gracefully shutdowned
old-primary, so I think it is worth doing something if it can be in a
simple way.

However, I don't think we can simply rely on minRecoveryPoint to
detect that situation, since it won't be reset on a standby. A standby
also still can be the upstream of a cascading standby. So, as
discussed in the thread for the comment [2]/messages/by-id/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb+Bu=h1m=rLdn=g@mail.gmail.com, what we can do here would be
simply waiting for the timelineID to advance, maybe having a timeout.

In a case of single-step replication set, a checkpoint request to the
primary makes the end-of-recovery checkpoint fast. It won't work as
expected in cascading replicas, but it might be acceptable.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

1: /messages/by-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Thanks. I had a quick look over the issue and patch - just a thought -
can't we let pg_rewind issue a checkpoint on the new primary instead
of erroring out, maybe optionally? It might sound too much, but helps
pg_rewind to be self-reliant i.e. avoiding external actor to detect
the error and issue checkpoint the new primary to be able to
successfully run pg_rewind on the pld primary and repair it to use it
as a new standby.

At the time of the discussion [2]/messages/by-id/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb+Bu=h1m=rLdn=g@mail.gmail.com for the it was the hinderance that
that requires superuser privileges. Now that has been narrowed down
to the pg_checkpointer privileges.

If we know that the timeline IDs are different, we don't need to wait
for a checkpoint.

It seems to me that the exit status is significant. pg_rewind exits
with 1 when an invalid option is given. I don't think it is great if
we report this state by the same code.

I don't think we always want to request a non-spreading checkpoint.

[2]: /messages/by-id/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb+Bu=h1m=rLdn=g@mail.gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4James Coleman
jtc331@gmail.com
In reply to: Bharath Rupireddy (#2)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Specifically, this happens when the control file on the newly promoted
server looks like:

Latest checkpoint's TimeLineID: 4
Latest checkpoint's PrevTimeLineID: 4
...
Min recovery ending loc's timeline: 5

Attached is a patch that detects this condition and reports it as an
error to the user.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

1: /messages/by-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Thanks. I had a quick look over the issue and patch - just a thought -
can't we let pg_rewind issue a checkpoint on the new primary instead
of erroring out, maybe optionally? It might sound too much, but helps
pg_rewind to be self-reliant i.e. avoiding external actor to detect
the error and issue checkpoint the new primary to be able to
successfully run pg_rewind on the pld primary and repair it to use it
as a new standby.

That's what I had suggested as a "further improvement" option in the
last paragraph :)

But I think agreement on this more basic solution would still be good
(even if I add the automatic checkpointing in this thread); given we
currently explicitly mis-inform the user of pg_rewind, I think this is
a bug that should be considered for backpatching, and the simpler
"fail if detected" patch is probably the only thing we could
backpatch.

Thanks for taking a look,
James Coleman

#5James Coleman
jtc331@gmail.com
In reply to: Kyotaro Horiguchi (#3)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

...

Attached is a patch that detects this condition and reports it as an
error to the user.

I have some random thoughts on this.

There could be a problem in the case of gracefully shutdowned
old-primary, so I think it is worth doing something if it can be in a
simple way.

However, I don't think we can simply rely on minRecoveryPoint to
detect that situation, since it won't be reset on a standby. A standby
also still can be the upstream of a cascading standby. So, as
discussed in the thread for the comment [2], what we can do here would be
simply waiting for the timelineID to advance, maybe having a timeout.

To confirm I'm following you correctly, you're envisioning a situation like:

- Primary A
- Replica B replicating from primary
- Replica C replicating from replica B

then on failover from A to B you end up with:

- Primary B
- Replica C replication from primary
- [needs rewind] A

and you try to rewind A from C as the source?

In a case of single-step replication set, a checkpoint request to the
primary makes the end-of-recovery checkpoint fast. It won't work as
expected in cascading replicas, but it might be acceptable.

"Won't work as expected" because there's no way to guarantee
replication is caught up or even advancing?

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

1: /messages/by-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Thanks. I had a quick look over the issue and patch - just a thought -
can't we let pg_rewind issue a checkpoint on the new primary instead
of erroring out, maybe optionally? It might sound too much, but helps
pg_rewind to be self-reliant i.e. avoiding external actor to detect
the error and issue checkpoint the new primary to be able to
successfully run pg_rewind on the pld primary and repair it to use it
as a new standby.

At the time of the discussion [2] for the it was the hinderance that
that requires superuser privileges. Now that has been narrowed down
to the pg_checkpointer privileges.

If we know that the timeline IDs are different, we don't need to wait
for a checkpoint.

Correct.

It seems to me that the exit status is significant. pg_rewind exits
with 1 when an invalid option is given. I don't think it is great if
we report this state by the same code.

I'm happy to change that; I only chose "1" as a placeholder for
"non-zero exit status".

I don't think we always want to request a non-spreading checkpoint.

I'm not familiar with the terminology "non-spreading checkpoint".

[2] /messages/by-id/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb+Bu=h1m=rLdn=g@mail.gmail.com

I read through that thread, and one interesting idea stuck out to me:
making "tiimeline IDs are the same" an error exit status. On the one
hand that makes a certain amount of sense because it's unexpected. But
on the other hand there are entirely legitimate situations where upon
failover the timeline IDs happen to match (e.g., for use it happens
some percentage of the time naturally as we are using sync replication
and failovers often involve STONITHing the original primary, so it's
entirely possible that the promoted replica begins with exactly the
same WAL ending LSN from the primary before it stopped).

Thanks,
James Coleman

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: James Coleman (#5)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in

On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Sat, Jun 4, 2022 at 6:29 PM James Coleman <jtc331@gmail.com> wrote:

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

...

Attached is a patch that detects this condition and reports it as an
error to the user.

I have some random thoughts on this.

There could be a problem in the case of gracefully shutdowned
old-primary, so I think it is worth doing something if it can be in a
simple way.

However, I don't think we can simply rely on minRecoveryPoint to
detect that situation, since it won't be reset on a standby. A standby
also still can be the upstream of a cascading standby. So, as
discussed in the thread for the comment [2], what we can do here would be
simply waiting for the timelineID to advance, maybe having a timeout.

To confirm I'm following you correctly, you're envisioning a situation like:

- Primary A
- Replica B replicating from primary
- Replica C replicating from replica B

then on failover from A to B you end up with:

- Primary B
- Replica C replication from primary
- [needs rewind] A

and you try to rewind A from C as the source?

Yes. I think it is a legit use case. That being said, like other
points, it might be acceptable.

In a case of single-step replication set, a checkpoint request to the
primary makes the end-of-recovery checkpoint fast. It won't work as
expected in cascading replicas, but it might be acceptable.

"Won't work as expected" because there's no way to guarantee
replication is caught up or even advancing?

Maybe no. I meant that restartpoints don't run more frequently than
the intervals of checkpoint_timeout even if checkpint records come
more frequently.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

1: /messages/by-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

Thanks. I had a quick look over the issue and patch - just a thought -
can't we let pg_rewind issue a checkpoint on the new primary instead
of erroring out, maybe optionally? It might sound too much, but helps
pg_rewind to be self-reliant i.e. avoiding external actor to detect
the error and issue checkpoint the new primary to be able to
successfully run pg_rewind on the pld primary and repair it to use it
as a new standby.

At the time of the discussion [2] for the it was the hinderance that
that requires superuser privileges. Now that has been narrowed down
to the pg_checkpointer privileges.

If we know that the timeline IDs are different, we don't need to wait
for a checkpoint.

Correct.

It seems to me that the exit status is significant. pg_rewind exits
with 1 when an invalid option is given. I don't think it is great if
we report this state by the same code.

I'm happy to change that; I only chose "1" as a placeholder for
"non-zero exit status".

I don't think we always want to request a non-spreading checkpoint.

I'm not familiar with the terminology "non-spreading checkpoint".

Does "immediate checkpoint" works? That is, a checkpoint that runs at
full-speed (i.e. with no delays between writes).

[2] /messages/by-id/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb+Bu=h1m=rLdn=g@mail.gmail.com

I read through that thread, and one interesting idea stuck out to me:
making "tiimeline IDs are the same" an error exit status. On the one
hand that makes a certain amount of sense because it's unexpected. But
on the other hand there are entirely legitimate situations where upon
failover the timeline IDs happen to match (e.g., for use it happens
some percentage of the time naturally as we are using sync replication
and failovers often involve STONITHing the original primary, so it's
entirely possible that the promoted replica begins with exactly the
same WAL ending LSN from the primary before it stopped).

Yes that is true for most cases unless old primary written some
records that had not sent to the standby before its death. So if we
don't inspect WAL records (on the target cluster), we should always
run rewinding even in the STONITH-killed (or immediate-shutdown)
cases.

One possible way to detect promotion reliably is to look into timeline
history files. It is written immediately at promotion even on
standbys.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#6)
1 attachment(s)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

One possible way to detect promotion reliably is to look into timeline
history files. It is written immediately at promotion even on
standbys.

The attached seems to work. It uses timeline history files to identify
the source timeline. With this change pg_waldump no longer need to
wait for end-of-recovery to finish.

(It lacks doc part and test.. But I'm not sure how we can test this
behavior.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pg_rewind_fix.diff.txttext/plain; charset=us-asciiDownload
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 6cb288f099..2a407da1e4 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -309,9 +309,11 @@ sync_target_dir(void)
  * buffer is actually *filesize + 1. That's handy when reading a text file.
  * This function can be used to read binary files as well, you can just
  * ignore the zero-terminator in that case.
+ *
+ * If noerror is true, returns NULL when the file is not found.
  */
 char *
-slurpFile(const char *datadir, const char *path, size_t *filesize)
+slurpFile(const char *datadir, const char *path, size_t *filesize, bool noerror)
 {
 	int			fd;
 	char	   *buffer;
@@ -323,8 +325,13 @@ slurpFile(const char *datadir, const char *path, size_t *filesize)
 	snprintf(fullpath, sizeof(fullpath), "%s/%s", datadir, path);
 
 	if ((fd = open(fullpath, O_RDONLY | PG_BINARY, 0)) == -1)
+	{
+		if (noerror && errno == ENOENT)
+			return NULL;
+
 		pg_fatal("could not open file \"%s\" for reading: %m",
 				 fullpath);
+	}
 
 	if (fstat(fd, &statbuf) < 0)
 		pg_fatal("could not open file \"%s\" for reading: %m",
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 54a853bd42..92e19042cb 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -21,7 +21,8 @@ extern void create_target(file_entry_t *t);
 extern void remove_target(file_entry_t *t);
 extern void sync_target_dir(void);
 
-extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
+extern char *slurpFile(const char *datadir, const char *path, size_t *filesize,
+	bool noerror);
 
 typedef void (*process_file_callback_t) (const char *path, file_type_t type, size_t size, const char *link_target);
 extern void traverse_datadir(const char *datadir, process_file_callback_t callback);
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 011c9cce6e..92067d4f2c 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -68,7 +68,7 @@ static void libpq_queue_fetch_range(rewind_source *source, const char *path,
 									off_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static XLogRecPtr libpq_get_current_wal_insert_lsn(rewind_source *source);
 static void libpq_destroy(rewind_source *source);
 
@@ -620,9 +620,12 @@ appendArrayEscapedString(StringInfo buf, const char *str)
 
 /*
  * Fetch a single file as a malloc'd buffer.
+ *
+ * If noerror is true, returns NULL if pg_read_binary_file() failed.
  */
 static char *
-libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+				 bool noerror)
 {
 	PGconn	   *conn = ((libpq_source *) source)->conn;
 	PGresult   *res;
@@ -631,6 +634,34 @@ libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
 	const char *paramValues[1];
 
 	paramValues[0] = path;
+
+	/*
+	 * check the existence of the file. We don't do this separately from
+	 * pg_read_binary_file so that server doesn't emit an error
+	 */
+	if (noerror)
+	{
+		res = PQexecParams(conn, "SELECT pg_stat_file($1, true)",
+						   1, NULL, paramValues, NULL, NULL, 1);
+
+		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		{
+			pg_fatal("could not stat remote file \"%s\": %s",
+					 path, PQresultErrorMessage(res));
+		}
+
+		/* sanity check the result set */
+		if (PQntuples(res) != 1)
+			pg_fatal("unexpected result set while stating remote file \"%s\"",
+					 path);
+
+		/* Return NULL if the file was not found */
+		if (PQgetisnull(res, 0, 0))
+			return NULL;
+
+		PQclear(res);
+	}
+
 	res = PQexecParams(conn, "SELECT pg_read_binary_file($1)",
 					   1, NULL, paramValues, NULL, NULL, 1);
 
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..fc2e1e9f11 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -28,7 +28,7 @@ typedef struct
 static void local_traverse_files(rewind_source *source,
 								 process_file_callback_t callback);
 static char *local_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static void local_queue_fetch_file(rewind_source *source, const char *path,
 								   size_t len);
 static void local_queue_fetch_range(rewind_source *source, const char *path,
@@ -63,9 +63,11 @@ local_traverse_files(rewind_source *source, process_file_callback_t callback)
 }
 
 static char *
-local_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+local_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+	bool noerror)
 {
-	return slurpFile(((local_source *) source)->datadir, path, filesize);
+	return slurpFile(((local_source *) source)->datadir, path, filesize,
+					 noerror);
 }
 
 /*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..f9c7853f08 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -43,6 +43,8 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
+static TimeLineHistoryEntry *getTimelineHistory(ControlFileData *controlFile,
+												int *nentries);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -141,6 +143,7 @@ main(int argc, char **argv)
 	bool		rewind_needed;
 	bool		writerecoveryconf = false;
 	filemap_t  *filemap;
+	TimeLineID	source_tli;
 
 	pg_logging_init(argv[0]);
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_rewind"));
@@ -311,7 +314,7 @@ main(int argc, char **argv)
 	 * need to make sure by themselves that the target cluster is in a clean
 	 * state.
 	 */
-	buffer = slurpFile(datadir_target, "global/pg_control", &size);
+	buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_target, buffer, size);
 	pg_free(buffer);
 
@@ -321,25 +324,47 @@ main(int argc, char **argv)
 	{
 		ensureCleanShutdown(argv[0]);
 
-		buffer = slurpFile(datadir_target, "global/pg_control", &size);
+		buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 		digestControlFile(&ControlFile_target, buffer, size);
 		pg_free(buffer);
 	}
 
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source, buffer, size);
 	pg_free(buffer);
 
 	sanityChecks();
 
+	/*
+	 * There may be a case where the source has been promoted but the
+	 * end-of-recovery checkpoint has not completed. In this case the soruce
+	 * control file is has a bit older content for this purpose.  Look into
+	 * timeline history file, which is refreshed up-to-date.
+	 */
+	source_tli = ControlFile_source.checkPointCopy.ThisTimeLineID;
+	if (ControlFile_target.checkPointCopy.ThisTimeLineID == source_tli)
+	{
+		int nentries;
+		TimeLineHistoryEntry *hist;
+
+		hist = getTimelineHistory(&ControlFile_source, &nentries);
+
+		/* last line of history file is the newest timeline */
+		if (nentries > 0 && hist[nentries - 1].tli > source_tli)
+		{
+			pg_log_info("source's actual timeline ID (%d) is newer than control file (%d)", hist[nentries - 1].tli, source_tli);
+			source_tli = hist[nentries - 1].tli;
+		}
+		pg_free(hist);
+	}
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (ControlFile_target.checkPointCopy.ThisTimeLineID == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -581,7 +606,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	 * Fetch the control file from the source last. This ensures that the
 	 * minRecoveryPoint is up-to-date.
 	 */
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source_after, buffer, size);
 	pg_free(buffer);
 
@@ -630,6 +655,10 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	 */
 	if (connstr_source)
 	{
+		int nentries;
+		TimeLineHistoryEntry *hist;
+		int i;
+
 		/*
 		 * The source is a live server. Like in an online backup, it's
 		 * important that we recover all the WAL that was generated while we
@@ -655,6 +684,29 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 
 			endrec = source->get_current_wal_insert_lsn(source);
 			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+
+			/*
+			 * Find the timeline ID corresponding to endrec on the source.
+			 *
+			 * In most cases we can rely on control file, but that is not the
+			 * case after promotion until end-of-recovery checkpoint completes.
+			 * Identify the timeline ID the hard way since we don't have a
+			 * easer way to detect that case.  In case where we failed to do
+			 * that, fall back to the control file's value.
+			 */
+			hist = getTimelineHistory(&ControlFile_source, &nentries);
+			if (hist[nentries - 1].tli != endtli)
+			{
+				for (i = 0; i < nentries; i++)
+				{
+					if ((hist[i].begin == 0 || hist[i].begin <= endrec) &&
+						(hist[i].end == 0   || endrec < hist[i].end))
+					{
+						endtli = hist[i].tli;
+						break;
+					}
+				}
+			}
 		}
 	}
 	else
@@ -804,9 +856,32 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 {
 	TimeLineHistoryEntry *history;
 	TimeLineID	tli;
+	TimeLineID	probe_tli;
 
 	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
+	Assert(tli > 0);
+	for (probe_tli = tli + 1 ;; probe_tli++)
+	{
+		char		path[MAXPGPATH];
+		char	   *histfile;
+
+		TLHistoryFilePath(path, probe_tli);
+
+		/* Get history file from appropriate source */
+		if (controlFile == &ControlFile_source)
+			histfile = source->fetch_file(source, path, NULL, true);
+		else if (controlFile == &ControlFile_target)
+			histfile = slurpFile(datadir_target, path, NULL, true);
+
+		if (!histfile)
+			break;
+
+		pg_free(histfile);
+	}
+
+	tli = probe_tli - 1;
+
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
 	 * and fake an entry with infinite start and end positions.
@@ -827,9 +902,9 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 
 		/* Get history file from appropriate source */
 		if (controlFile == &ControlFile_source)
-			histfile = source->fetch_file(source, path, NULL);
+			histfile = source->fetch_file(source, path, NULL, false);
 		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
+			histfile = slurpFile(datadir_target, path, NULL, false);
 		else
 			pg_fatal("invalid control file");
 
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 1310e86e75..6975848668 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -35,7 +35,7 @@ typedef struct rewind_source
 	 * handy for text files.
 	 */
 	char	   *(*fetch_file) (struct rewind_source *, const char *path,
-							   size_t *filesize);
+							   size_t *filesize, bool noerror);
 
 	/*
 	 * Request to fetch (part of) a file in the source system, specified by an
#8Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#6)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote:

At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in

To confirm I'm following you correctly, you're envisioning a situation like:

- Primary A
- Replica B replicating from primary
- Replica C replicating from replica B

then on failover from A to B you end up with:

- Primary B
- Replica C replication from primary
- [needs rewind] A

and you try to rewind A from C as the source?

Yes. I think it is a legit use case. That being said, like other
points, it might be acceptable.

This configuration is a case supported by pg_rewind, meaning that your
patch to check after minRecoveryPointTLI would be confusing when using
a standby as a source because the checkpoint needs to apply on its
primary to allow the TLI of the standby to go up. If you want to
provide to the user more context, a more meaningful way may be to rely
on an extra check for ControlFileData.state, I guess, as a promoted
cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is
cleared by the first post-promotion checkpoint, with
DB_IN_ARCHIVE_RECOVERY for a cascading standby.
--
Michael

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#8)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

At Tue, 7 Jun 2022 16:16:09 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Tue, Jun 07, 2022 at 12:39:38PM +0900, Kyotaro Horiguchi wrote:

At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman <jtc331@gmail.com> wrote in

To confirm I'm following you correctly, you're envisioning a situation like:

- Primary A
- Replica B replicating from primary
- Replica C replicating from replica B

then on failover from A to B you end up with:

- Primary B
- Replica C replication from primary
- [needs rewind] A

and you try to rewind A from C as the source?

Yes. I think it is a legit use case. That being said, like other
points, it might be acceptable.

This configuration is a case supported by pg_rewind, meaning that your
patch to check after minRecoveryPointTLI would be confusing when using
a standby as a source because the checkpoint needs to apply on its
primary to allow the TLI of the standby to go up. If you want to

Yeah, that what I meant.

provide to the user more context, a more meaningful way may be to rely
on an extra check for ControlFileData.state, I guess, as a promoted
cluster is marked as DB_IN_PRODUCTION before recoveryMinPoint is
cleared by the first post-promotion checkpoint, with
DB_IN_ARCHIVE_RECOVERY for a cascading standby.

Right. However, IIUC, checkpoint LSN/TLI is not updated at the
time. The point of the minRecoveryPoint check is to confirm that we
can read the timeline ID of the promoted source cluster from
checkPointCopy.ThisTimeLineID. But we cannot do that yet at the time
the cluster state moves to DB_IN_PRODUCTION. And a standby is in
DB_IN_ARCHIVE_RECOVERY since before the upstream promotes. It also
doesn't signal the reliability of checkPointCopy.ThisTimeLineID..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10vignesh ravichandran
admin@viggy28.dev
In reply to: James Coleman (#1)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

I think this is a good improvement and also like the option (on pg_rewind) to potentially send checkpoints to the source.

Personal anecdote. I was using stolon and frequently failing over. For sometime the rewind was failing that it wasn't required. Only learnt that it's the checkpoint on the source which was missing. 

References https://github.com/sorintlab/stolon/issues/601
And the fix https://github.com/sorintlab/stolon/pull/644
 https://github.com/sorintlab/stolon/issues/601

---- On Sat, 04 Jun 2022 05:59:12 -0700 James Coleman <mailto:jtc331@gmail.com> wrote ----

A few weeks back I sent a bug report [1] directly to the -bugs mailing
list, and I haven't seen any activity on it (maybe this is because I
emailed directly instead of using the form?), but I got some time to
take a look and concluded that a first-level fix is pretty simple.

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Specifically, this happens when the control file on the newly promoted
server looks like:

Latest checkpoint's TimeLineID: 4
Latest checkpoint's PrevTimeLineID: 4
...
Min recovery ending loc's timeline: 5

Attached is a patch that detects this condition and reports it as an
error to the user.

In the spirit of the new-ish "ensure shutdown" functionality I could
imagine extending this to automatically issue a checkpoint when this
situation is detected. I haven't started to code that up, however,
wanting to first get buy-in on that.

Thanks,
James Coleman

1: /messages/by-id/CAAaqYe8b2DBbooTprY4v=BiZEd9qBqVLq+FD9j617eQFjk1KvQ@mail.gmail.com

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#7)
1 attachment(s)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

At Tue, 07 Jun 2022 16:05:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

One possible way to detect promotion reliably is to look into timeline
history files. It is written immediately at promotion even on
standbys.

The attached seems to work. It uses timeline history files to identify
the source timeline. With this change pg_waldump no longer need to
wait for end-of-recovery to finish.

(It lacks doc part and test.. But I'm not sure how we can test this
behavior.)

This is a revised version.

Revised getTimelineHistory()'s logic (refactored, and changed so that
it doesn't pick-up the wrong history files).

perform_rewind always identify endtli based on source's timeline
history.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pg_rewind_fix_2.diff.txttext/plain; charset=us-asciiDownload
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 69d6924b3a..7f752b2ed0 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -334,15 +334,6 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
 </programlisting>
   </para>
 
-  <para>
-   When executing <application>pg_rewind</application> using an online
-   cluster as source which has been recently promoted, it is necessary
-   to execute a <command>CHECKPOINT</command> after promotion such that its
-   control file reflects up-to-date timeline information, which is used by
-   <application>pg_rewind</application> to check if the target cluster
-   can be rewound using the designated source cluster.
-  </para>
-
   <refsect2>
    <title>How It Works</title>
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 6cb288f099..2a407da1e4 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -309,9 +309,11 @@ sync_target_dir(void)
  * buffer is actually *filesize + 1. That's handy when reading a text file.
  * This function can be used to read binary files as well, you can just
  * ignore the zero-terminator in that case.
+ *
+ * If noerror is true, returns NULL when the file is not found.
  */
 char *
-slurpFile(const char *datadir, const char *path, size_t *filesize)
+slurpFile(const char *datadir, const char *path, size_t *filesize, bool noerror)
 {
 	int			fd;
 	char	   *buffer;
@@ -323,8 +325,13 @@ slurpFile(const char *datadir, const char *path, size_t *filesize)
 	snprintf(fullpath, sizeof(fullpath), "%s/%s", datadir, path);
 
 	if ((fd = open(fullpath, O_RDONLY | PG_BINARY, 0)) == -1)
+	{
+		if (noerror && errno == ENOENT)
+			return NULL;
+
 		pg_fatal("could not open file \"%s\" for reading: %m",
 				 fullpath);
+	}
 
 	if (fstat(fd, &statbuf) < 0)
 		pg_fatal("could not open file \"%s\" for reading: %m",
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 54a853bd42..92e19042cb 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -21,7 +21,8 @@ extern void create_target(file_entry_t *t);
 extern void remove_target(file_entry_t *t);
 extern void sync_target_dir(void);
 
-extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
+extern char *slurpFile(const char *datadir, const char *path, size_t *filesize,
+	bool noerror);
 
 typedef void (*process_file_callback_t) (const char *path, file_type_t type, size_t size, const char *link_target);
 extern void traverse_datadir(const char *datadir, process_file_callback_t callback);
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 011c9cce6e..751c96e6e4 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -68,7 +68,7 @@ static void libpq_queue_fetch_range(rewind_source *source, const char *path,
 									off_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static XLogRecPtr libpq_get_current_wal_insert_lsn(rewind_source *source);
 static void libpq_destroy(rewind_source *source);
 
@@ -620,9 +620,12 @@ appendArrayEscapedString(StringInfo buf, const char *str)
 
 /*
  * Fetch a single file as a malloc'd buffer.
+ *
+ * If noerror is true, returns NULL if pg_read_binary_file() failed.
  */
 static char *
-libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+				 bool noerror)
 {
 	PGconn	   *conn = ((libpq_source *) source)->conn;
 	PGresult   *res;
@@ -631,6 +634,34 @@ libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
 	const char *paramValues[1];
 
 	paramValues[0] = path;
+
+	/*
+	 * check the existence of the file. We do this before executing
+	 * pg_read_binary_file so that server doesn't emit an error
+	 */
+	if (noerror)
+	{
+		res = PQexecParams(conn, "SELECT pg_stat_file($1, true)",
+						   1, NULL, paramValues, NULL, NULL, 1);
+
+		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		{
+			pg_fatal("could not stat remote file \"%s\": %s",
+					 path, PQresultErrorMessage(res));
+		}
+
+		/* sanity check the result set */
+		if (PQntuples(res) != 1)
+			pg_fatal("unexpected result set while stating remote file \"%s\"",
+					 path);
+
+		/* Return NULL if the file was not found */
+		if (PQgetisnull(res, 0, 0))
+			return NULL;
+
+		PQclear(res);
+	}
+
 	res = PQexecParams(conn, "SELECT pg_read_binary_file($1)",
 					   1, NULL, paramValues, NULL, NULL, 1);
 
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..fc2e1e9f11 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -28,7 +28,7 @@ typedef struct
 static void local_traverse_files(rewind_source *source,
 								 process_file_callback_t callback);
 static char *local_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static void local_queue_fetch_file(rewind_source *source, const char *path,
 								   size_t len);
 static void local_queue_fetch_range(rewind_source *source, const char *path,
@@ -63,9 +63,11 @@ local_traverse_files(rewind_source *source, process_file_callback_t callback)
 }
 
 static char *
-local_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+local_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+	bool noerror)
 {
-	return slurpFile(((local_source *) source)->datadir, path, filesize);
+	return slurpFile(((local_source *) source)->datadir, path, filesize,
+					 noerror);
 }
 
 /*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..220e6fbdc4 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -43,6 +43,8 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
+static TimeLineHistoryEntry *getTimelineHistory(ControlFileData *controlFile,
+												int *nentries);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -70,9 +72,13 @@ bool		do_sync = true;
 bool		restore_wal = false;
 
 /* Target history */
-TimeLineHistoryEntry *targetHistory;
+TimeLineHistoryEntry *targetHistory = NULL;
 int			targetNentries;
 
+/* Source history */
+TimeLineHistoryEntry *sourceHistory = NULL;
+int			sourceNentries;
+
 /* Progress counters */
 uint64		fetch_size;
 uint64		fetch_done;
@@ -141,6 +147,7 @@ main(int argc, char **argv)
 	bool		rewind_needed;
 	bool		writerecoveryconf = false;
 	filemap_t  *filemap;
+	TimeLineID	source_tli;
 
 	pg_logging_init(argv[0]);
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_rewind"));
@@ -311,7 +318,7 @@ main(int argc, char **argv)
 	 * need to make sure by themselves that the target cluster is in a clean
 	 * state.
 	 */
-	buffer = slurpFile(datadir_target, "global/pg_control", &size);
+	buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_target, buffer, size);
 	pg_free(buffer);
 
@@ -321,25 +328,43 @@ main(int argc, char **argv)
 	{
 		ensureCleanShutdown(argv[0]);
 
-		buffer = slurpFile(datadir_target, "global/pg_control", &size);
+		buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 		digestControlFile(&ControlFile_target, buffer, size);
 		pg_free(buffer);
 	}
 
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source, buffer, size);
 	pg_free(buffer);
 
 	sanityChecks();
 
+	/* Retrieve timelines for both source and target */
+	sourceHistory =	getTimelineHistory(&ControlFile_source, &sourceNentries);
+	targetHistory =	getTimelineHistory(&ControlFile_target, &targetNentries);
+
+	/*
+	 * If the source just has been promoted but the end-of-recovery checkpoint
+	 * has not completed, the soruce control file has a bit older content for
+	 * identifying the source's timeline.  Instead, look into timeline history,
+	 * which is always refreshed up-to-date.
+	 */
+	source_tli = ControlFile_source.checkPointCopy.ThisTimeLineID;
+
+	if (sourceHistory[sourceNentries - 1].tli > source_tli)
+	{
+		pg_log_info("source's actual timeline ID (%d) is newer than control file (%d)",
+					sourceHistory[sourceNentries - 1].tli, source_tli);
+		source_tli = sourceHistory[sourceNentries - 1].tli;
+	}
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (ControlFile_target.checkPointCopy.ThisTimeLineID == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -581,7 +606,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	 * Fetch the control file from the source last. This ensures that the
 	 * minRecoveryPoint is up-to-date.
 	 */
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source_after, buffer, size);
 	pg_free(buffer);
 
@@ -650,11 +675,34 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 			 * Source is a production, non-standby, server. We must replay to
 			 * the last WAL insert location.
 			 */
+			int i;
+
 			if (ControlFile_source_after.state != DB_IN_PRODUCTION)
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+
+			/*
+			 * Find the timeline ID corresponding to endrec on the source.
+			 *
+			 * In most cases we can use the TLI in the source control file, but
+			 * that is not the case after promotion until end-of-recovery
+			 * checkpoint completes, where the control file is a bit old for
+			 * this purpose.  We find the TLI for endrec from sourceHistory.
+			 */
+			for (i = 0; i < sourceNentries; i++)
+			{
+				if ((sourceHistory[i].begin == 0 ||
+					 sourceHistory[i].begin <= endrec) &&
+					(sourceHistory[i].end == 0 ||
+					 endrec < sourceHistory[i].end))
+				{
+					endtli = sourceHistory[i].tli;
+					break;
+				}
+			}
+			if (i == sourceNentries)
+				pg_fatal("source server's current insert LSN was not found in history file");
 		}
 	}
 	else
@@ -796,46 +844,83 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
 }
 
 /*
- * Retrieve timeline history for given control file which should behold
- * either source or target.
+ * Retrieve the latest timeline history for given control file which should
+ * behold either source or target.
+ *
+ * This works on the assumption that the timeline IDs of existing history files
+ * are contiguous up to the latest history file. This is true for
+ * recently-promoted servers.  See findNewestTimeLine() for this assumption.
  */
 static TimeLineHistoryEntry *
 getTimelineHistory(ControlFileData *controlFile, int *nentries)
 {
-	TimeLineHistoryEntry *history;
+	TimeLineHistoryEntry *history = NULL;
 	TimeLineID	tli;
+	TimeLineID	probe_tli;
 
 	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
+	Assert(tli > 0);
+
+	/* Probe history files */
+	for (probe_tli = tli ;; probe_tli++)
+	{
+		char		path[MAXPGPATH];
+		char	   *histfile;
+		TimeLineHistoryEntry *tmphistory;
+		int			nent;
+		int			i;
+
+		if (probe_tli < 2)
+			continue;
+
+		TLHistoryFilePath(path, probe_tli);
+
+		/* Get history file from appropriate source */
+		if (controlFile == &ControlFile_source)
+			histfile = source->fetch_file(source, path, NULL, true);
+		else if (controlFile == &ControlFile_target)
+			histfile = slurpFile(datadir_target, path, NULL, true);
+		else
+			pg_fatal("invalid control file");
+
+		/* no such history file, exit */
+		if (!histfile)
+			break;
+
+		/* preserve the history if we're part of it */
+		tmphistory = rewind_parseTimeLineHistory(histfile, probe_tli, &nent);
+		pg_free(histfile);
+
+		for (i = 0 ; i < nent ; i++)
+		{
+			if (tmphistory[i].tli == tli)
+			{
+				if (history)
+					pg_free(history);
+
+				history = tmphistory;
+				*nentries = nent;
+				break;
+			}
+		}
+		if (tmphistory != history)
+			pg_free(tmphistory);
+	}
+
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
 	 * and fake an entry with infinite start and end positions.
 	 */
-	if (tli == 1)
+	if (!history)
 	{
+		Assert (tli == 1);
+
 		history = (TimeLineHistoryEntry *) pg_malloc(sizeof(TimeLineHistoryEntry));
 		history->tli = tli;
 		history->begin = history->end = InvalidXLogRecPtr;
 		*nentries = 1;
 	}
-	else
-	{
-		char		path[MAXPGPATH];
-		char	   *histfile;
-
-		TLHistoryFilePath(path, tli);
-
-		/* Get history file from appropriate source */
-		if (controlFile == &ControlFile_source)
-			histfile = source->fetch_file(source, path, NULL);
-		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
-		else
-			pg_fatal("invalid control file");
-
-		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
-		pg_free(histfile);
-	}
 
 	if (debug)
 	{
@@ -879,15 +964,9 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 static void
 findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
@@ -910,7 +989,6 @@ findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 		*recptr = MinXLogRecPtr(sourceHistory[i].end, targetHistory[i].end);
 		*tliIndex = i;
 
-		pg_free(sourceHistory);
 		return;
 	}
 	else
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 1310e86e75..6975848668 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -35,7 +35,7 @@ typedef struct rewind_source
 	 * handy for text files.
 	 */
 	char	   *(*fetch_file) (struct rewind_source *, const char *path,
-							   size_t *filesize);
+							   size_t *filesize, bool noerror);
 
 	/*
 	 * Request to fetch (part of) a file in the source system, specified by an
#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#11)
1 attachment(s)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

At Wed, 08 Jun 2022 18:15:09 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 07 Jun 2022 16:05:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 07 Jun 2022 12:39:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

One possible way to detect promotion reliably is to look into timeline
history files. It is written immediately at promotion even on
standbys.

The attached seems to work. It uses timeline history files to identify
the source timeline. With this change pg_waldump no longer need to
wait for end-of-recovery to finish.

(It lacks doc part and test.. But I'm not sure how we can test this
behavior.)

This is a revised version.

Revised getTimelineHistory()'s logic (refactored, and changed so that
it doesn't pick-up the wrong history files).

perform_rewind always identify endtli based on source's timeline
history.

No need to "search" history file to identify it. The latest timeline
must be that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pg_rewind_fix_3.diff.txttext/plain; charset=us-asciiDownload
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 69d6924b3a..7f752b2ed0 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -334,15 +334,6 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
 </programlisting>
   </para>
 
-  <para>
-   When executing <application>pg_rewind</application> using an online
-   cluster as source which has been recently promoted, it is necessary
-   to execute a <command>CHECKPOINT</command> after promotion such that its
-   control file reflects up-to-date timeline information, which is used by
-   <application>pg_rewind</application> to check if the target cluster
-   can be rewound using the designated source cluster.
-  </para>
-
   <refsect2>
    <title>How It Works</title>
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 6cb288f099..2a407da1e4 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -309,9 +309,11 @@ sync_target_dir(void)
  * buffer is actually *filesize + 1. That's handy when reading a text file.
  * This function can be used to read binary files as well, you can just
  * ignore the zero-terminator in that case.
+ *
+ * If noerror is true, returns NULL when the file is not found.
  */
 char *
-slurpFile(const char *datadir, const char *path, size_t *filesize)
+slurpFile(const char *datadir, const char *path, size_t *filesize, bool noerror)
 {
 	int			fd;
 	char	   *buffer;
@@ -323,8 +325,13 @@ slurpFile(const char *datadir, const char *path, size_t *filesize)
 	snprintf(fullpath, sizeof(fullpath), "%s/%s", datadir, path);
 
 	if ((fd = open(fullpath, O_RDONLY | PG_BINARY, 0)) == -1)
+	{
+		if (noerror && errno == ENOENT)
+			return NULL;
+
 		pg_fatal("could not open file \"%s\" for reading: %m",
 				 fullpath);
+	}
 
 	if (fstat(fd, &statbuf) < 0)
 		pg_fatal("could not open file \"%s\" for reading: %m",
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 54a853bd42..92e19042cb 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -21,7 +21,8 @@ extern void create_target(file_entry_t *t);
 extern void remove_target(file_entry_t *t);
 extern void sync_target_dir(void);
 
-extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
+extern char *slurpFile(const char *datadir, const char *path, size_t *filesize,
+	bool noerror);
 
 typedef void (*process_file_callback_t) (const char *path, file_type_t type, size_t size, const char *link_target);
 extern void traverse_datadir(const char *datadir, process_file_callback_t callback);
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 011c9cce6e..751c96e6e4 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -68,7 +68,7 @@ static void libpq_queue_fetch_range(rewind_source *source, const char *path,
 									off_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static XLogRecPtr libpq_get_current_wal_insert_lsn(rewind_source *source);
 static void libpq_destroy(rewind_source *source);
 
@@ -620,9 +620,12 @@ appendArrayEscapedString(StringInfo buf, const char *str)
 
 /*
  * Fetch a single file as a malloc'd buffer.
+ *
+ * If noerror is true, returns NULL if pg_read_binary_file() failed.
  */
 static char *
-libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+				 bool noerror)
 {
 	PGconn	   *conn = ((libpq_source *) source)->conn;
 	PGresult   *res;
@@ -631,6 +634,34 @@ libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
 	const char *paramValues[1];
 
 	paramValues[0] = path;
+
+	/*
+	 * check the existence of the file. We do this before executing
+	 * pg_read_binary_file so that server doesn't emit an error
+	 */
+	if (noerror)
+	{
+		res = PQexecParams(conn, "SELECT pg_stat_file($1, true)",
+						   1, NULL, paramValues, NULL, NULL, 1);
+
+		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		{
+			pg_fatal("could not stat remote file \"%s\": %s",
+					 path, PQresultErrorMessage(res));
+		}
+
+		/* sanity check the result set */
+		if (PQntuples(res) != 1)
+			pg_fatal("unexpected result set while stating remote file \"%s\"",
+					 path);
+
+		/* Return NULL if the file was not found */
+		if (PQgetisnull(res, 0, 0))
+			return NULL;
+
+		PQclear(res);
+	}
+
 	res = PQexecParams(conn, "SELECT pg_read_binary_file($1)",
 					   1, NULL, paramValues, NULL, NULL, 1);
 
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..fc2e1e9f11 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -28,7 +28,7 @@ typedef struct
 static void local_traverse_files(rewind_source *source,
 								 process_file_callback_t callback);
 static char *local_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static void local_queue_fetch_file(rewind_source *source, const char *path,
 								   size_t len);
 static void local_queue_fetch_range(rewind_source *source, const char *path,
@@ -63,9 +63,11 @@ local_traverse_files(rewind_source *source, process_file_callback_t callback)
 }
 
 static char *
-local_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+local_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+	bool noerror)
 {
-	return slurpFile(((local_source *) source)->datadir, path, filesize);
+	return slurpFile(((local_source *) source)->datadir, path, filesize,
+					 noerror);
 }
 
 /*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..221e97c5d8 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -43,6 +43,8 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
+static TimeLineHistoryEntry *getTimelineHistory(ControlFileData *controlFile,
+												int *nentries);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -70,9 +72,13 @@ bool		do_sync = true;
 bool		restore_wal = false;
 
 /* Target history */
-TimeLineHistoryEntry *targetHistory;
+TimeLineHistoryEntry *targetHistory = NULL;
 int			targetNentries;
 
+/* Source history */
+TimeLineHistoryEntry *sourceHistory = NULL;
+int			sourceNentries;
+
 /* Progress counters */
 uint64		fetch_size;
 uint64		fetch_done;
@@ -141,6 +147,7 @@ main(int argc, char **argv)
 	bool		rewind_needed;
 	bool		writerecoveryconf = false;
 	filemap_t  *filemap;
+	TimeLineID	source_tli;
 
 	pg_logging_init(argv[0]);
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_rewind"));
@@ -311,7 +318,7 @@ main(int argc, char **argv)
 	 * need to make sure by themselves that the target cluster is in a clean
 	 * state.
 	 */
-	buffer = slurpFile(datadir_target, "global/pg_control", &size);
+	buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_target, buffer, size);
 	pg_free(buffer);
 
@@ -321,25 +328,43 @@ main(int argc, char **argv)
 	{
 		ensureCleanShutdown(argv[0]);
 
-		buffer = slurpFile(datadir_target, "global/pg_control", &size);
+		buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 		digestControlFile(&ControlFile_target, buffer, size);
 		pg_free(buffer);
 	}
 
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source, buffer, size);
 	pg_free(buffer);
 
 	sanityChecks();
 
+	/* Retrieve timelines for both source and target */
+	sourceHistory =	getTimelineHistory(&ControlFile_source, &sourceNentries);
+	targetHistory =	getTimelineHistory(&ControlFile_target, &targetNentries);
+
+	/*
+	 * If the source just has been promoted but the end-of-recovery checkpoint
+	 * has not completed, the soruce control file has a bit older content for
+	 * identifying the source's timeline.  Instead, look into timeline history,
+	 * which is always refreshed up-to-date.
+	 */
+	source_tli = ControlFile_source.checkPointCopy.ThisTimeLineID;
+
+	if (sourceHistory[sourceNentries - 1].tli > source_tli)
+	{
+		pg_log_info("source's actual timeline ID (%d) is newer than control file (%d)",
+					sourceHistory[sourceNentries - 1].tli, source_tli);
+		source_tli = sourceHistory[sourceNentries - 1].tli;
+	}
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (ControlFile_target.checkPointCopy.ThisTimeLineID == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -581,7 +606,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	 * Fetch the control file from the source last. This ensures that the
 	 * minRecoveryPoint is up-to-date.
 	 */
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source_after, buffer, size);
 	pg_free(buffer);
 
@@ -654,7 +679,22 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+
+			/*
+			 * Find the timeline ID corresponding to endrec on the source.
+			 *
+			 * In most cases we can use the TLI in the source control file, but
+			 * that is not the case after promotion until end-of-recovery
+			 * checkpoint completes, where the control file is a bit old for
+			 * this purpose.  It should be the latest timeline in the source's
+			 * history file.
+			 */
+			if (!((sourceHistory[sourceNentries - 1].begin == 0 ||
+				   sourceHistory[sourceNentries - 1].begin <= endrec) &&
+				  sourceHistory[sourceNentries - 1].end == 0))
+				pg_fatal("source server's current insert LSN was not on the latest timeline in history file");
+
+			endtli = sourceHistory[sourceNentries - 1].tli;
 		}
 	}
 	else
@@ -796,46 +836,83 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
 }
 
 /*
- * Retrieve timeline history for given control file which should behold
- * either source or target.
+ * Retrieve the latest timeline history for given control file which should
+ * behold either source or target.
+ *
+ * This works on the assumption that the timeline IDs of existing history files
+ * are contiguous up to the latest history file. This is true for
+ * recently-promoted servers.  See findNewestTimeLine() for this assumption.
  */
 static TimeLineHistoryEntry *
 getTimelineHistory(ControlFileData *controlFile, int *nentries)
 {
-	TimeLineHistoryEntry *history;
+	TimeLineHistoryEntry *history = NULL;
 	TimeLineID	tli;
+	TimeLineID	probe_tli;
 
 	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
+	Assert(tli > 0);
+
+	/* Probe history files */
+	for (probe_tli = tli ;; probe_tli++)
+	{
+		char		path[MAXPGPATH];
+		char	   *histfile;
+		TimeLineHistoryEntry *tmphistory;
+		int			nent;
+		int			i;
+
+		if (probe_tli < 2)
+			continue;
+
+		TLHistoryFilePath(path, probe_tli);
+
+		/* Get history file from appropriate source */
+		if (controlFile == &ControlFile_source)
+			histfile = source->fetch_file(source, path, NULL, true);
+		else if (controlFile == &ControlFile_target)
+			histfile = slurpFile(datadir_target, path, NULL, true);
+		else
+			pg_fatal("invalid control file");
+
+		/* no such history file, exit */
+		if (!histfile)
+			break;
+
+		/* preserve the history if we're part of it */
+		tmphistory = rewind_parseTimeLineHistory(histfile, probe_tli, &nent);
+		pg_free(histfile);
+
+		for (i = 0 ; i < nent ; i++)
+		{
+			if (tmphistory[i].tli == tli)
+			{
+				if (history)
+					pg_free(history);
+
+				history = tmphistory;
+				*nentries = nent;
+				break;
+			}
+		}
+		if (tmphistory != history)
+			pg_free(tmphistory);
+	}
+
 	/*
 	 * Timeline 1 does not have a history file, so there is no need to check
 	 * and fake an entry with infinite start and end positions.
 	 */
-	if (tli == 1)
+	if (!history)
 	{
+		Assert (tli == 1);
+
 		history = (TimeLineHistoryEntry *) pg_malloc(sizeof(TimeLineHistoryEntry));
 		history->tli = tli;
 		history->begin = history->end = InvalidXLogRecPtr;
 		*nentries = 1;
 	}
-	else
-	{
-		char		path[MAXPGPATH];
-		char	   *histfile;
-
-		TLHistoryFilePath(path, tli);
-
-		/* Get history file from appropriate source */
-		if (controlFile == &ControlFile_source)
-			histfile = source->fetch_file(source, path, NULL);
-		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
-		else
-			pg_fatal("invalid control file");
-
-		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
-		pg_free(histfile);
-	}
 
 	if (debug)
 	{
@@ -879,15 +956,9 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 static void
 findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
@@ -910,7 +981,6 @@ findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 		*recptr = MinXLogRecPtr(sourceHistory[i].end, targetHistory[i].end);
 		*tliIndex = i;
 
-		pg_free(sourceHistory);
 		return;
 	}
 	else
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 1310e86e75..6975848668 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -35,7 +35,7 @@ typedef struct rewind_source
 	 * handy for text files.
 	 */
 	char	   *(*fetch_file) (struct rewind_source *, const char *path,
-							   size_t *filesize);
+							   size_t *filesize, bool noerror);
 
 	/*
 	 * Request to fetch (part of) a file in the source system, specified by an
#13Robert Haas
robertmhaas@gmail.com
In reply to: James Coleman (#1)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote:

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Is there anything intrinsic to the mechanism of operation of pg_rewind
that requires a timeline change, or could we just rewind within the
same timeline to an earlier LSN? In other words, maybe we could just
remove this limitation of pg_rewind, and then perhaps it wouldn't be
necessary to determine what the new timeline is.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14James Coleman
jtc331@gmail.com
In reply to: Robert Haas (#13)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Tue, Jul 5, 2022 at 2:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote:

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Is there anything intrinsic to the mechanism of operation of pg_rewind
that requires a timeline change, or could we just rewind within the
same timeline to an earlier LSN? In other words, maybe we could just
remove this limitation of pg_rewind, and then perhaps it wouldn't be
necessary to determine what the new timeline is.

I think (someone can correct me if I'm wrong) that in theory the
mechanisms would support the source and target being on the same
timeline, but in practice that presents problems since you'd not have
an LSN you could detect as the divergence point. If we allowed passing
"rewind to" point LSN value, then that (again, as far as I understand
it) would work, but it's a different use case. Specifically I wouldn't
want that option to need to be used for this particular case since in
my example there is in fact a real divergence point that we should be
detecting automatically.

Thanks,
James Coleman

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

Robert Haas <robertmhaas@gmail.com> writes:

Is there anything intrinsic to the mechanism of operation of pg_rewind
that requires a timeline change, or could we just rewind within the
same timeline to an earlier LSN? In other words, maybe we could just
remove this limitation of pg_rewind, and then perhaps it wouldn't be
necessary to determine what the new timeline is.

That seems like a fairly bad idea. For example, if you've already
archived some WAL segments past the rewind target, there will shortly
be two versions of truth about what that part of the WAL space contains,
and your archiver will either spit up or do probably-the-wrong-thing.

regards, tom lane

#16Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Tue, Jul 5, 2022 at 2:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Is there anything intrinsic to the mechanism of operation of pg_rewind
that requires a timeline change, or could we just rewind within the
same timeline to an earlier LSN? In other words, maybe we could just
remove this limitation of pg_rewind, and then perhaps it wouldn't be
necessary to determine what the new timeline is.

That seems like a fairly bad idea. For example, if you've already
archived some WAL segments past the rewind target, there will shortly
be two versions of truth about what that part of the WAL space contains,
and your archiver will either spit up or do probably-the-wrong-thing.

Well, only if you void the warranty. If you rewind the ex-primary to
the LSN where the new primary is replaying and tell it to start
replaying from there and follow the new primary's subsequent switch
onto a new timeline, there's no split-brain problem.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: James Coleman (#14)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

At Tue, 5 Jul 2022 14:46:13 -0400, James Coleman <jtc331@gmail.com> wrote in

On Tue, Jul 5, 2022 at 2:39 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jun 4, 2022 at 8:59 AM James Coleman <jtc331@gmail.com> wrote:

A quick background refresher: after promoting a standby rewinding the
former primary requires that a checkpoint have been completed on the
new primary after promotion. This is correctly documented. However
pg_rewind incorrectly reports to the user that a rewind isn't
necessary because the source and target are on the same timeline.

Is there anything intrinsic to the mechanism of operation of pg_rewind
that requires a timeline change, or could we just rewind within the
same timeline to an earlier LSN? In other words, maybe we could just
remove this limitation of pg_rewind, and then perhaps it wouldn't be
necessary to determine what the new timeline is.

I think (someone can correct me if I'm wrong) that in theory the
mechanisms would support the source and target being on the same
timeline, but in practice that presents problems since you'd not have
an LSN you could detect as the divergence point. If we allowed passing
"rewind to" point LSN value, then that (again, as far as I understand
it) would work, but it's a different use case. Specifically I wouldn't
want that option to need to be used for this particular case since in
my example there is in fact a real divergence point that we should be
detecting automatically.

The point of pg_rewind is finding diverging point then finding all
blocks modified in the dead history (from the diverging point) and
"replace" them with those of the live history. In that sense, to be
exact, pg_rewind does not "rewind" a cluster. If no diverging point,
the last LSN of the cluster getting behind (as target cluster?) is
that and just no need to replace a block at all because no WAL exists
(on the cluster being behind) after the last LSN.

The issue here is pg_rewind looks into control file to determine the
soruce timeline, because the control file is not updated until the
first checkpoint ends after promotion finishes, even though file
blocks are already diverged.

Even in that case history file for the new timeline is already
created, so searching for the latest history file works.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18Noname
kuroda.keisuke@nttcom.co.jp
In reply to: Kyotaro Horiguchi (#17)
1 attachment(s)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

Hi, hackers

The issue here is pg_rewind looks into control file to determine the
soruce timeline, because the control file is not updated until the
first checkpoint ends after promotion finishes, even though file
blocks are already diverged.

Even in that case history file for the new timeline is already
created, so searching for the latest history file works.

I think this change is a good one because if I want
pg_rewind to run automatically after a promotion,
I don't have to wait for the checkpoint to complete.

The attached patch is Horiguchi-san's patch with
additional tests. The tests are based on James's tests,
"010_no_checkpoint_after_promotion.pl" tests that
pg_rewind is successfully executed without running
checkpoint after promote.

Best Regards,
Keisuke Kuroda
NTT COMWARE

Attachments:

pg_rewind_fix_4.diff.patchtext/x-diff; name=pg_rewind_fix_4.diff.patchDownload
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 69d6924b3a..7f752b2ed0 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -334,15 +334,6 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
 </programlisting>
   </para>
 
-  <para>
-   When executing <application>pg_rewind</application> using an online
-   cluster as source which has been recently promoted, it is necessary
-   to execute a <command>CHECKPOINT</command> after promotion such that its
-   control file reflects up-to-date timeline information, which is used by
-   <application>pg_rewind</application> to check if the target cluster
-   can be rewound using the designated source cluster.
-  </para>
-
   <refsect2>
    <title>How It Works</title>
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index db190bcba7..fe98f47519 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -309,9 +309,11 @@ sync_target_dir(void)
  * buffer is actually *filesize + 1. That's handy when reading a text file.
  * This function can be used to read binary files as well, you can just
  * ignore the zero-terminator in that case.
+ *
+ * If noerror is true, returns NULL when the file is not found.
  */
 char *
-slurpFile(const char *datadir, const char *path, size_t *filesize)
+slurpFile(const char *datadir, const char *path, size_t *filesize, bool noerror)
 {
 	int			fd;
 	char	   *buffer;
@@ -323,8 +325,13 @@ slurpFile(const char *datadir, const char *path, size_t *filesize)
 	snprintf(fullpath, sizeof(fullpath), "%s/%s", datadir, path);
 
 	if ((fd = open(fullpath, O_RDONLY | PG_BINARY, 0)) == -1)
+	{
+		if (noerror && errno == ENOENT)
+			return NULL;
+
 		pg_fatal("could not open file \"%s\" for reading: %m",
 				 fullpath);
+	}
 
 	if (fstat(fd, &statbuf) < 0)
 		pg_fatal("could not open file \"%s\" for reading: %m",
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index e6277c4631..559787a072 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -21,7 +21,8 @@ extern void create_target(file_entry_t *entry);
 extern void remove_target(file_entry_t *entry);
 extern void sync_target_dir(void);
 
-extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
+extern char *slurpFile(const char *datadir, const char *path, size_t *filesize,
+       bool noerror);
 
 typedef void (*process_file_callback_t) (const char *path, file_type_t type, size_t size, const char *link_target);
 extern void traverse_datadir(const char *datadir, process_file_callback_t callback);
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 011c9cce6e..751c96e6e4 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -68,7 +68,7 @@ static void libpq_queue_fetch_range(rewind_source *source, const char *path,
 									off_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static XLogRecPtr libpq_get_current_wal_insert_lsn(rewind_source *source);
 static void libpq_destroy(rewind_source *source);
 
@@ -620,9 +620,12 @@ appendArrayEscapedString(StringInfo buf, const char *str)
 
 /*
  * Fetch a single file as a malloc'd buffer.
+ *
+ * If noerror is true, returns NULL if pg_read_binary_file() failed.
  */
 static char *
-libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+				 bool noerror)
 {
 	PGconn	   *conn = ((libpq_source *) source)->conn;
 	PGresult   *res;
@@ -631,6 +634,34 @@ libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize)
 	const char *paramValues[1];
 
 	paramValues[0] = path;
+
+	/*
+	 * check the existence of the file. We do this before executing
+	 * pg_read_binary_file so that server doesn't emit an error
+	 */
+	if (noerror)
+	{
+		res = PQexecParams(conn, "SELECT pg_stat_file($1, true)",
+						   1, NULL, paramValues, NULL, NULL, 1);
+
+		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		{
+			pg_fatal("could not stat remote file \"%s\": %s",
+					 path, PQresultErrorMessage(res));
+		}
+
+		/* sanity check the result set */
+		if (PQntuples(res) != 1)
+			pg_fatal("unexpected result set while stating remote file \"%s\"",
+					 path);
+
+		/* Return NULL if the file was not found */
+		if (PQgetisnull(res, 0, 0))
+			return NULL;
+
+		PQclear(res);
+	}
+
 	res = PQexecParams(conn, "SELECT pg_read_binary_file($1)",
 					   1, NULL, paramValues, NULL, NULL, 1);
 
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 2e50485c39..fc2e1e9f11 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -28,7 +28,7 @@ typedef struct
 static void local_traverse_files(rewind_source *source,
 								 process_file_callback_t callback);
 static char *local_fetch_file(rewind_source *source, const char *path,
-							  size_t *filesize);
+							  size_t *filesize, bool noerror);
 static void local_queue_fetch_file(rewind_source *source, const char *path,
 								   size_t len);
 static void local_queue_fetch_range(rewind_source *source, const char *path,
@@ -63,9 +63,11 @@ local_traverse_files(rewind_source *source, process_file_callback_t callback)
 }
 
 static char *
-local_fetch_file(rewind_source *source, const char *path, size_t *filesize)
+local_fetch_file(rewind_source *source, const char *path, size_t *filesize,
+	bool noerror)
 {
-	return slurpFile(((local_source *) source)->datadir, path, filesize);
+	return slurpFile(((local_source *) source)->datadir, path, filesize,
+					 noerror);
 }
 
 /*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 3cd77c09b1..75a7df0b1f 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -43,6 +43,8 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
+static TimeLineHistoryEntry *getTimelineHistory(ControlFileData *controlFile,
+												int *nentries);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -70,9 +72,13 @@ bool		do_sync = true;
 bool		restore_wal = false;
 
 /* Target history */
-TimeLineHistoryEntry *targetHistory;
+TimeLineHistoryEntry *targetHistory = NULL;
 int			targetNentries;
 
+/* Source history */
+TimeLineHistoryEntry *sourceHistory = NULL;
+int			sourceNentries;
+
 /* Progress counters */
 uint64		fetch_size;
 uint64		fetch_done;
@@ -141,6 +147,7 @@ main(int argc, char **argv)
 	bool		rewind_needed;
 	bool		writerecoveryconf = false;
 	filemap_t  *filemap;
+	TimeLineID	source_tli;
 
 	pg_logging_init(argv[0]);
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_rewind"));
@@ -311,7 +318,7 @@ main(int argc, char **argv)
 	 * need to make sure by themselves that the target cluster is in a clean
 	 * state.
 	 */
-	buffer = slurpFile(datadir_target, "global/pg_control", &size);
+	buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_target, buffer, size);
 	pg_free(buffer);
 
@@ -321,25 +328,43 @@ main(int argc, char **argv)
 	{
 		ensureCleanShutdown(argv[0]);
 
-		buffer = slurpFile(datadir_target, "global/pg_control", &size);
+		buffer = slurpFile(datadir_target, "global/pg_control", &size, false);
 		digestControlFile(&ControlFile_target, buffer, size);
 		pg_free(buffer);
 	}
 
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source, buffer, size);
 	pg_free(buffer);
 
 	sanityChecks();
 
+	/* Retrieve timelines for both source and target */
+	sourceHistory =	getTimelineHistory(&ControlFile_source, &sourceNentries);
+	targetHistory =	getTimelineHistory(&ControlFile_target, &targetNentries);
+
+	/*
+	 * If the source just has been promoted but the end-of-recovery checkpoint
+	 * has not completed, the soruce control file has a bit older content for
+	 * identifying the source's timeline.  Instead, look into timeline history,
+	 * which is always refreshed up-to-date.
+	 */
+	source_tli = ControlFile_source.checkPointCopy.ThisTimeLineID;
+
+	if (sourceHistory[sourceNentries - 1].tli > source_tli)
+	{
+		pg_log_info("source's actual timeline ID (%d) is newer than control file (%d)",
+					sourceHistory[sourceNentries - 1].tli, source_tli);
+		source_tli = sourceHistory[sourceNentries - 1].tli;
+	}
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (ControlFile_target.checkPointCopy.ThisTimeLineID == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline");
 		rewind_needed = false;
@@ -581,7 +606,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	 * Fetch the control file from the source last. This ensures that the
 	 * minRecoveryPoint is up-to-date.
 	 */
-	buffer = source->fetch_file(source, "global/pg_control", &size);
+	buffer = source->fetch_file(source, "global/pg_control", &size, false);
 	digestControlFile(&ControlFile_source_after, buffer, size);
 	pg_free(buffer);
 
@@ -654,7 +679,22 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 				pg_fatal("source system was in unexpected state at end of rewind");
 
 			endrec = source->get_current_wal_insert_lsn(source);
-			endtli = ControlFile_source_after.checkPointCopy.ThisTimeLineID;
+
+			/*
+			 * Find the timeline ID corresponding to endrec on the source.
+			 *
+			 * In most cases we can use the TLI in the source control file, but
+			 * that is not the case after promotion until end-of-recovery
+			 * checkpoint completes, where the control file is a bit old for
+			 * this purpose.  It should be the latest timeline in the source's
+			 * history file.
+			 */
+			if (!((sourceHistory[sourceNentries - 1].begin == 0 ||
+				   sourceHistory[sourceNentries - 1].begin <= endrec) &&
+				  sourceHistory[sourceNentries - 1].end == 0))
+				pg_fatal("source server's current insert LSN was not on the latest timeline in history file");
+
+			endtli = sourceHistory[sourceNentries - 1].tli;
 		}
 	}
 	else
@@ -796,45 +836,82 @@ MinXLogRecPtr(XLogRecPtr a, XLogRecPtr b)
 }
 
 /*
- * Retrieve timeline history for given control file which should behold
- * either source or target.
+ * Retrieve the latest timeline history for given control file which should
+ * behold either source or target.
+ *
+ * This works on the assumption that the timeline IDs of existing history files
+ * are contiguous up to the latest history file. This is true for
+ * recently-promoted servers.  See findNewestTimeLine() for this assumption.
  */
 static TimeLineHistoryEntry *
 getTimelineHistory(ControlFileData *controlFile, int *nentries)
 {
-	TimeLineHistoryEntry *history;
+	TimeLineHistoryEntry *history = NULL;
 	TimeLineID	tli;
+	TimeLineID	probe_tli;
 
 	tli = controlFile->checkPointCopy.ThisTimeLineID;
 
-	/*
-	 * Timeline 1 does not have a history file, so there is no need to check
-	 * and fake an entry with infinite start and end positions.
-	 */
-	if (tli == 1)
-	{
-		history = (TimeLineHistoryEntry *) pg_malloc(sizeof(TimeLineHistoryEntry));
-		history->tli = tli;
-		history->begin = history->end = InvalidXLogRecPtr;
-		*nentries = 1;
-	}
-	else
+	Assert(tli > 0);
+
+	/* Probe history files */
+	for (probe_tli = tli ;; probe_tli++)
 	{
 		char		path[MAXPGPATH];
 		char	   *histfile;
+		TimeLineHistoryEntry *tmphistory;
+		int			nent;
+		int			i;
+
+		if (probe_tli < 2)
+			continue;
 
-		TLHistoryFilePath(path, tli);
+		TLHistoryFilePath(path, probe_tli);
 
 		/* Get history file from appropriate source */
 		if (controlFile == &ControlFile_source)
-			histfile = source->fetch_file(source, path, NULL);
+			histfile = source->fetch_file(source, path, NULL, true);
 		else if (controlFile == &ControlFile_target)
-			histfile = slurpFile(datadir_target, path, NULL);
+			histfile = slurpFile(datadir_target, path, NULL, true);
 		else
 			pg_fatal("invalid control file");
 
-		history = rewind_parseTimeLineHistory(histfile, tli, nentries);
+		/* no such history file, exit */
+		if (!histfile)
+			break;
+
+		/* preserve the history if we're part of it */
+		tmphistory = rewind_parseTimeLineHistory(histfile, probe_tli, &nent);
 		pg_free(histfile);
+
+		for (i = 0 ; i < nent ; i++)
+		{
+			if (tmphistory[i].tli == tli)
+			{
+				if (history)
+					pg_free(history);
+
+				history = tmphistory;
+				*nentries = nent;
+				break;
+			}
+		}
+		if (tmphistory != history)
+			pg_free(tmphistory);
+	}
+
+	/*
+	 * Timeline 1 does not have a history file, so there is no need to check
+	 * and fake an entry with infinite start and end positions.
+	 */
+	if (!history)
+	{
+		Assert (tli == 1);
+
+		history = (TimeLineHistoryEntry *) pg_malloc(sizeof(TimeLineHistoryEntry));
+		history->tli = tli;
+		history->begin = history->end = InvalidXLogRecPtr;
+		*nentries = 1;
 	}
 
 	if (debug)
@@ -879,15 +956,9 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries)
 static void
 findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 {
-	TimeLineHistoryEntry *sourceHistory;
-	int			sourceNentries;
 	int			i,
 				n;
 
-	/* Retrieve timelines for both source and target */
-	sourceHistory = getTimelineHistory(&ControlFile_source, &sourceNentries);
-	targetHistory = getTimelineHistory(&ControlFile_target, &targetNentries);
-
 	/*
 	 * Trace the history forward, until we hit the timeline diverge. It may
 	 * still be possible that the source and target nodes used the same
@@ -910,7 +981,6 @@ findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex)
 		*recptr = MinXLogRecPtr(sourceHistory[i].end, targetHistory[i].end);
 		*tliIndex = i;
 
-		pg_free(sourceHistory);
 		return;
 	}
 	else
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 1310e86e75..6975848668 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -35,7 +35,7 @@ typedef struct rewind_source
 	 * handy for text files.
 	 */
 	char	   *(*fetch_file) (struct rewind_source *, const char *path,
-							   size_t *filesize);
+							   size_t *filesize, bool noerror);
 
 	/*
 	 * Request to fetch (part of) a file in the source system, specified by an
diff --git a/src/bin/pg_rewind/t/010_no_checkpoint_after_promotion.pl b/src/bin/pg_rewind/t/010_no_checkpoint_after_promotion.pl
new file mode 100644
index 0000000000..a8512c1db5
--- /dev/null
+++ b/src/bin/pg_rewind/t/010_no_checkpoint_after_promotion.pl
@@ -0,0 +1,76 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use RewindTest;
+
+RewindTest::setup_cluster("remote");
+RewindTest::start_primary();
+
+# Create a test table and insert a row in primary.
+primary_psql("CREATE TABLE tbl1 (d text)");
+primary_psql("INSERT INTO tbl1 VALUES ('in primary')");
+
+# This test table will be used to test truncation, i.e. the table
+# is extended in the old primary after promotion
+primary_psql("CREATE TABLE trunc_tbl (d text)");
+primary_psql("INSERT INTO trunc_tbl VALUES ('in primary')");
+
+# This test table will be used to test the "copy-tail" case, i.e. the
+# table is truncated in the old primary after promotion
+primary_psql("CREATE TABLE tail_tbl (id integer, d text)");
+primary_psql("INSERT INTO tail_tbl VALUES (0, 'in primary')");
+
+# This test table is dropped in the old primary after promotion.
+primary_psql("CREATE TABLE drop_tbl (d text)");
+primary_psql("INSERT INTO drop_tbl VALUES ('in primary')");
+
+primary_psql("CHECKPOINT");
+
+RewindTest::create_standby("remote");
+
+# Insert additional data on primary that will be replicated to standby
+primary_psql("INSERT INTO tbl1 values ('in primary, before promotion')");
+primary_psql(
+  "INSERT INTO trunc_tbl values ('in primary, before promotion')");
+primary_psql(
+  "INSERT INTO tail_tbl SELECT g, 'in primary, before promotion: ' || g FROM generate_series(1, 10000) g"
+);
+
+primary_psql('CHECKPOINT');
+
+RewindTest::promote_standby('skip_checkpoint' => 1);
+
+# Insert a row in the old primary. This causes the primary and standby
+# to have "diverged", it's no longer possible to just apply the
+# standy's logs over primary directory - you need to rewind.
+primary_psql("INSERT INTO tbl1 VALUES ('in primary, after promotion')");
+
+# Also insert a new row in the standby, which won't be present in the
+# old primary.
+standby_psql("INSERT INTO tbl1 VALUES ('in standby, after promotion')");
+
+$node_primary->stop;
+
+my $primary_pgdata = $node_primary->data_dir;
+my $standby_connstr = $node_standby->connstr('postgres');
+command_checks_all(
+  [
+    'pg_rewind',       '--dry-run',
+    "--source-server", $standby_connstr,
+    '--target-pgdata', $primary_pgdata,
+    '--no-sync',       '--no-ensure-shutdown'
+  ],
+  0,
+  [],
+  [qr/pg_rewind: source's actual timeline ID \(2\) is newer than control file \(1\)/],
+  'pg_rewind no checkpoint after promotion');
+
+done_testing();
diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm
index 98b66b01f8..89fddc1656 100644
--- a/src/bin/pg_rewind/t/RewindTest.pm
+++ b/src/bin/pg_rewind/t/RewindTest.pm
@@ -188,6 +188,11 @@ primary_conninfo='$connstr_primary'
 
 sub promote_standby
 {
+	my (%options) = (
+		'skip_checkpoint' => 0,
+		@_
+	);
+
 	#### Now run the test-specific parts to run after standby has been started
 	# up standby
 
@@ -198,13 +203,16 @@ sub promote_standby
 	# the primary out-of-sync with the standby.
 	$node_standby->promote;
 
-	# Force a checkpoint after the promotion. pg_rewind looks at the control
-	# file to determine what timeline the server is on, and that isn't updated
-	# immediately at promotion, but only at the next checkpoint. When running
-	# pg_rewind in remote mode, it's possible that we complete the test steps
-	# after promotion so quickly that when pg_rewind runs, the standby has not
-	# performed a checkpoint after promotion yet.
-	standby_psql("checkpoint");
+	unless ($options{'skip_checkpoint'})
+	{
+		# Force a checkpoint after the promotion. pg_rewind looks at the control
+		# file to determine what timeline the server is on, and that isn't updated
+		# immediately at promotion, but only at the next checkpoint. When running
+		# pg_rewind in remote mode, it's possible that we complete the test steps
+		# after promotion so quickly that when pg_rewind runs, the standby has not
+		# performed a checkpoint after promotion yet.
+		standby_psql("checkpoint");
+	}
 
 	return;
 }
#19Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Noname (#18)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote:

The issue here is pg_rewind looks into control file to determine the
soruce timeline, because the control file is not updated until the
first checkpoint ends after promotion finishes, even though file
blocks are already diverged.

Even in that case history file for the new timeline is already
created, so searching for the latest history file works.

I think this change is a good one because if I want
pg_rewind to run automatically after a promotion,
I don't have to wait for the checkpoint to complete.

The attached patch is Horiguchi-san's patch with
additional tests. The tests are based on James's tests,
"010_no_checkpoint_after_promotion.pl" tests that
pg_rewind is successfully executed without running
checkpoint after promote.

I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry I
didn't notice this thread earlier.

I didn't realize that we had a notice about this in the docs. I'll go
and remove that. Thanks!

- Heikki

#20Noname
kuroda.keisuke@nttcom.co.jp
In reply to: Heikki Linnakangas (#19)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

hi Heikki,

Thanks to mail, and thanks also for the commit(0a0500207a)
to fix the document.
I'm glad the problem was solved.

Best Regards,
Keisuke Kuroda
NTT COMWARE

2023-02-27 16:33 に Heikki Linnakangas さんは書きました:

Show quoted text

On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote:

I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry
I didn't notice this thread earlier.

I didn't realize that we had a notice about this in the docs. I'll go
and remove that. Thanks!

- Heikki

#21James Coleman
jtc331@gmail.com
In reply to: Heikki Linnakangas (#19)
Re: pg_rewind: warn when checkpoint hasn't happened after promotion

On Mon, Feb 27, 2023 at 2:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 16/11/2022 07:17, kuroda.keisuke@nttcom.co.jp wrote:

The issue here is pg_rewind looks into control file to determine the
soruce timeline, because the control file is not updated until the
first checkpoint ends after promotion finishes, even though file
blocks are already diverged.

Even in that case history file for the new timeline is already
created, so searching for the latest history file works.

I think this change is a good one because if I want
pg_rewind to run automatically after a promotion,
I don't have to wait for the checkpoint to complete.

The attached patch is Horiguchi-san's patch with
additional tests. The tests are based on James's tests,
"010_no_checkpoint_after_promotion.pl" tests that
pg_rewind is successfully executed without running
checkpoint after promote.

I fixed this last week in commit 009eeee746, see thread [1]. I'm sorry I
didn't notice this thread earlier.

I didn't realize that we had a notice about this in the docs. I'll go
and remove that. Thanks!

- Heikki

Thanks; I think the missing [1] (for reference) is:
/messages/by-id/9f568c97-87fe-a716-bd39-65299b8a60f4@iki.fi

James