pg_rewind with cascade standby doesn't work well
Hi there,
I tested pg_rewind behavior and found a suspicious one.
Consider a scenario like this,
Server A: primary
Server B :replica of A
Server C :replica of B
and somehow A down ,so B gets promoted.
Server A: down
Server B :new primary
Server C :replica of B
In this case, pg_rewind can be used to reconstruct the cascade; the source
is C and the target is A.
However, we get error as belows by running pg_rewind.
```
pg_rewind: fetched file "global/pg_control", length 8192
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
```
Though A's timeline is 1 and C's is 2 ideally, it says they're on the same
timeline.
This is because `pg_rewind` currently uses minRecoveryPointTLI and latest
checkpoint's TimelineID to compare the TLI between source and target[1]/messages/by-id/9f568c97-87fe-a716-bd39-65299b8a60f4@iki.fi.
Both C's minRecoveryPointTLI and Latest checkpoint's TimelineID are not
modified until checkpointing. (even though B's are modified).
And then, if you run pg_rewind immediately, pg_rewind won't work because C
and A appear to be on the same timeline. So we have to CHECKPOINT on C
before running pg_rewind;
BTW, immediate pg_rewind with cascade standby seems to be already concerned
in another discussion[2]/messages/by-id/aeb5f31a-8de2-40a8-64af-ab659a309d6b@iki.fi, but unfortunately missed.
Anyway, I don't think this behavior is kind.
To fix this, should we use another variable to compare TLI?
Or, modify the cascade standby's minRecoveryPointTLI somehow?
Masaki Kuwamura
[1]: /messages/by-id/9f568c97-87fe-a716-bd39-65299b8a60f4@iki.fi
/messages/by-id/9f568c97-87fe-a716-bd39-65299b8a60f4@iki.fi
[2]: /messages/by-id/aeb5f31a-8de2-40a8-64af-ab659a309d6b@iki.fi
/messages/by-id/aeb5f31a-8de2-40a8-64af-ab659a309d6b@iki.fi
Consider a scenario like this,
Server A: primary
Server B :replica of A
Server C :replica of Band somehow A down ,so B gets promoted.
Server A: down
Server B :new primary
Server C :replica of BIn this case, pg_rewind can be used to reconstruct the cascade; the
source is C and the target is A.
However, we get error as belows by running pg_rewind.
```
pg_rewind: fetched file "global/pg_control", length 8192
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
```
To fix the above mentioned behavior of pg_rewind, I suggest to change the
cascade standby's (i.e. server C's) minRecoveryPointTLI when it receives
the new timeline information from the new primary (i.e. server B).
When server B is promoted, it creates an end-of-recovery record by calling
CreateEndOfRecoveryRecord(). (in xlog.c)
And also updates B's minRecoveryPoint and minRecoveryPointTLI.
```
/*
* Update the control file so that crash recovery can follow the
timeline
* changes to this point.
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->minRecoveryPoint = recptr;
ControlFile->minRecoveryPointTLI = xlrec.ThisTimeLineID;
UpdateControlFile();
LWLockRelease(ControlFileLock);
```
Since C is a replica of B, the end-of-recovery record is replicated from B
to C, so the record is replayed in C by xlog_redo().
The attached patch updates minRecoveryPoint and minRecoveryPointTLI at this
point by mimicking CreateEndOfRecoveryRecord().
With this patch, you can run pg_rewind with cascade standby immediately.
(without waiting for checkpointing)
Thoughts?
Masaki Kuwamura
Attachments:
v1-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patchapplication/octet-stream; name=v1-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..a559f46ebb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7951,6 +7951,15 @@ xlog_redo(XLogReaderState *record)
ereport(PANIC,
(errmsg("unexpected timeline ID %u (should be %u) in end-of-recovery record",
xlrec.ThisTimeLineID, replayTLI)));
+ /*
+ * Update the control file so that crash recovery can follow the timeline
+ * changes to this point.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->minRecoveryPoint = lsn;
+ ControlFile->minRecoveryPointTLI = xlrec.ThisTimeLineID;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
}
else if (info == XLOG_NOOP)
{
Hi,
The attached patch updates minRecoveryPoint and minRecoveryPointTLI at this point by mimicking CreateEndOfRecoveryRecord().
With this patch, you can run pg_rewind with cascade standby immediately. (without waiting for checkpointing)
Many thanks for submitting the patch. I added it to the nearest open
commitfest [1]https://commitfest.postgresql.org/45/4559/.
IMO a test is needed that makes sure no one is going to break this in
the future.
[1]: https://commitfest.postgresql.org/45/4559/
--
Best regards,
Aleksander Alekseev
On Mon, Sep 11, 2023 at 07:04:30PM +0300, Aleksander Alekseev wrote:
Many thanks for submitting the patch. I added it to the nearest open
commitfest [1].IMO a test is needed that makes sure no one is going to break this in
the future.
You definitely need more complex test scenarios for that. If you can
come up with new ways to make the TAP tests of pg_rewind mode modular
in handling more complicated node setups, that would be a nice
addition, for example.
@@ -7951,6 +7951,15 @@ xlog_redo(XLogReaderState *record)
ereport(PANIC,
(errmsg("unexpected timeline ID %u (should be %u) in end-of-recovery record",
xlrec.ThisTimeLineID, replayTLI)));
+ /*
+ * Update the control file so that crash recovery can follow the timeline
+ * changes to this point.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->minRecoveryPoint = lsn;
+ ControlFile->minRecoveryPointTLI = xlrec.ThisTimeLineID;
This patch is at least incorrect in its handling of crash recovery,
because these two should *never* be set in this case as we want to
replay up to the end of WAL. For example, see xlog.c or the top of
xlogrecovery.c about the assumptions behind these variables:
/* crash recovery should always recover to the end of WAL */
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
ControlFile->minRecoveryPointTLI = 0;
If an end-of-recovery record is replayed during crash recovery, these
assumptions are plain broken.
One thing that we could consider is to be more aggressive with
restartpoints when replaying this record for a standby, see a few
lines above the lines added by your patch, for example. And we could
potentially emulate a post-promotion restart point to get a refresh of
the control file as it should, with the correct code paths involved in
the updates of minRecoveryPoint when the checkpointer does the job.
--
Michael
IMO a test is needed that makes sure no one is going to break this in
the future.You definitely need more complex test scenarios for that. If you can
come up with new ways to make the TAP tests of pg_rewind mode modular
in handling more complicated node setups, that would be a nice
addition, for example.
I'm sorry for lacking tests. For now, I started off with a simple test
that cause the problem I mentioned. The updated WIP patch 0001 includes
the new test for pg_rewind.
And also, I'm afraid that I'm not sure what kind of tests I have to make
for fix this behavior. Would you mind giving me some advice?
This patch is at least incorrect in its handling of crash recovery,
because these two should *never* be set in this case as we want to
replay up to the end of WAL. For example, see xlog.c or the top of
xlogrecovery.c about the assumptions behind these variables:
/* crash recovery should always recover to the end of WAL */
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
ControlFile->minRecoveryPointTLI = 0;If an end-of-recovery record is replayed during crash recovery, these
assumptions are plain broken.
That make sense! I really appreciate your knowledgeable review.
One thing that we could consider is to be more aggressive with
restartpoints when replaying this record for a standby, see a few
lines above the lines added by your patch, for example. And we could
potentially emulate a post-promotion restart point to get a refresh of
the control file as it should, with the correct code paths involved in
the updates of minRecoveryPoint when the checkpointer does the job.
I'm not confident but you meant we could make restartpoint
(i.e., call `RequestCheckpoint()`) instead of my old patch?
The patch 0001 also contains my understanding.
I also found a bug (maybe). If we call `CreateRestartPoint()` during
crash-recovery, the assertion fails at ComputeXidHorizon() in procarray.c.
It's inherently orthogonal to the problem I already reported. So you can
reproduce this at HEAD with this procedure.
1. Start primary and standby server
2. Modify checkpoint_timeout to 1h on standby
3. Insert 10^10 records and concurrently run CHECKPOINT every second on
primary
4. Do an immediate stop on both standby and primary at the end of the insert
5. Modify checkpoint_timeout to 30 on standby
6. Remove standby.signal on standby
7. Restart standby (it will start crash-recovery)
8. Assertion failure is raised shortly
I think this is because `TruncateSUBTRANS();` in `CreateRestartPoint()` is
called but `StartupSUBTRANS()` isn't called yet. In `StartupXLOG()`, we
call
`StartupSUBTRANS()` if `(ArchiveRecoveryRequested && EnableHotStandby)`.
However, in `CreateRestartPoint()`, we call `TruncateSUBTRANS()` if
`(EnableHotStandby)`. I guess the difference causes this bug. The latter
possibly be called even crash-recovery while former isn't.
The attached patch 0002 fixes it. I think we could discuss about this bug
in
another thread if needed.
Best regards!
Masaki Kuwamura
Attachments:
v2-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patchapplication/octet-stream; name=v2-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..083fe760f3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7936,10 +7936,9 @@ xlog_redo(XLogReaderState *record)
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_end_of_recovery));
/*
- * For Hot Standby, we could treat this like a Shutdown Checkpoint,
- * but this case is rarer and harder to test, so the benefit doesn't
- * outweigh the potential extra cost of maintenance.
+ * For Hot Standby, we could treat this like an end-of-recovery checkpoint
*/
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
/*
* We should've already switched to the new TLI before replaying this
diff --git a/src/bin/pg_rewind/t/010_standby_source_following_new_primary.pl b/src/bin/pg_rewind/t/010_standby_source_following_new_primary.pl
new file mode 100644
index 0000000000..bbd6fff536
--- /dev/null
+++ b/src/bin/pg_rewind/t/010_standby_source_following_new_primary.pl
@@ -0,0 +1,157 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+#
+# Test using a standby server that follows new primary as the source.
+#
+# This sets up three nodes: A, B and C. First, A is the primary,
+# B follows A, and C follows B:
+#
+# A (primary) <--- B (standby) <--- C (standby)
+#
+#
+# Then we promote B, and insert some divergent rows in A and B:
+#
+# A (primary) B (primary) <--- C (standby)
+#
+#
+# Finally, we run pg_rewind on A, to point it at C:
+#
+# B (primary) <--- C (standby) <--- A (standby)
+#
+# We was not able to rewind until checkpointing in this scenario due to
+# the bug that cascade standby (i.e. C) does not follow the new
+# primary's (i.e. B's) minRecoveryPoint and minRecoveryPointTLI.
+#
+# Since we're dealing with three nodes, we cannot use most of the
+# RewindTest functions as is.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use RewindTest;
+
+my $tmp_folder = PostgreSQL::Test::Utils::tempdir;
+
+my $node_a;
+my $node_b;
+my $node_c;
+
+# Set up node A, as primary
+#
+# A (primary)
+
+setup_cluster('a');
+start_primary();
+$node_a = $node_primary;
+
+# Create a test table and insert a row in primary.
+$node_a->safe_psql('postgres', "CREATE TABLE tbl1 (d text)");
+$node_a->safe_psql('postgres', "INSERT INTO tbl1 VALUES ('in A')");
+primary_psql("CHECKPOINT");
+
+# Set up node B and C, as cascaded standbys
+#
+# A (primary) <--- B (standby) <--- C (standby)
+$node_a->backup('my_backup');
+$node_b = PostgreSQL::Test::Cluster->new('node_b');
+$node_b->init_from_backup($node_a, 'my_backup', has_streaming => 1);
+$node_b->set_standby_mode();
+$node_b->start;
+
+$node_b->backup('my_backup');
+$node_c = PostgreSQL::Test::Cluster->new('node_c');
+$node_c->init_from_backup($node_b, 'my_backup', has_streaming => 1);
+$node_c->set_standby_mode();
+$node_c->start;
+
+# Insert additional data on A, and wait for both standbys to catch up.
+$node_a->safe_psql('postgres',
+ "INSERT INTO tbl1 values ('in A, before promotion')");
+$node_a->safe_psql('postgres', 'CHECKPOINT');
+
+my $lsn = $node_a->lsn('write');
+$node_a->wait_for_catchup('node_b', 'write', $lsn);
+$node_b->wait_for_catchup('node_c', 'write', $lsn);
+
+# Promote B
+#
+# A (primary) B (primary) <--- C (standby)
+
+$node_b->promote;
+
+# make sure end-of-recovery record is replicated to C before we continue
+$node_b->wait_for_catchup('node_c');
+
+# Insert a row in A. This causes A/B and C to have "diverged", so that it's
+# no longer possible to just apply the standy's logs over primary directory
+# - you need to rewind.
+$node_a->safe_psql('postgres',
+ "INSERT INTO tbl1 VALUES ('rewind this')");
+
+#
+# All set up. We're ready to run pg_rewind.
+#
+my $node_a_pgdata = $node_a->data_dir;
+
+# Stop the old primary node and be ready to perform the rewind.
+$node_a->stop('fast');
+
+# Keep a temporary postgresql.conf or it would be overwritten during the rewind.
+copy(
+ "$node_a_pgdata/postgresql.conf",
+ "$tmp_folder/node_a-postgresql.conf.tmp");
+
+{
+ # Temporarily unset PGAPPNAME so that the server doesn't
+ # inherit it. Otherwise this could affect libpqwalreceiver
+ # connections in confusing ways.
+ local %ENV = %ENV;
+ delete $ENV{PGAPPNAME};
+
+ # Do rewind using a remote connection as source, generating
+ # recovery configuration automatically.
+ command_ok(
+ [
+ 'pg_rewind', "--debug",
+ "--source-server", $node_c->connstr('postgres'),
+ "--target-pgdata=$node_a_pgdata", "--no-sync",
+ ],
+ 'pg_rewind remote');
+}
+
+# Now move back postgresql.conf with old settings
+move(
+ "$tmp_folder/node_a-postgresql.conf.tmp",
+ "$node_a_pgdata/postgresql.conf");
+
+# Restart the node.
+$node_a->start;
+
+# set RewindTest::node_primary to point to the rewound node, so that we can
+# use check_query()
+$node_primary = $node_a;
+
+# Run some checks to verify that C has been successfully rewound,
+# and connected back to follow B.
+
+check_query(
+ 'SELECT * FROM tbl1',
+ qq(in A
+in A, before promotion
+),
+ 'table content after rewind');
+
+# clean up
+$node_a->teardown_node;
+$node_b->teardown_node;
+$node_c->teardown_node;
+
+done_testing();
+
v1-0002-Fix-restartpoint-during-crash-recovery.patchapplication/octet-stream; name=v1-0002-Fix-restartpoint-during-crash-recovery.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..98dcfea4c9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7336,7 +7336,7 @@ CreateRestartPoint(int flags)
* in subtrans.c). When hot standby is disabled, though, we mustn't do
* this because StartupSUBTRANS hasn't been called yet.
*/
- if (EnableHotStandby)
+ if (ArchiveRecoveryRequested && EnableHotStandby)
TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
/* Real work is done; log and update stats. */
On Wed, Sep 20, 2023 at 11:46:45AM +0900, Kuwamura Masaki wrote:
I also found a bug (maybe). If we call `CreateRestartPoint()` during
crash-recovery, the assertion fails at ComputeXidHorizon() in procarray.c.
It's inherently orthogonal to the problem I already reported. So you can
reproduce this at HEAD with this procedure.1. Start primary and standby server
2. Modify checkpoint_timeout to 1h on standby
3. Insert 10^10 records and concurrently run CHECKPOINT every second on
primary
4. Do an immediate stop on both standby and primary at the end of the insert
5. Modify checkpoint_timeout to 30 on standby
6. Remove standby.signal on standby
7. Restart standby (it will start crash-recovery)
8. Assertion failure is raised shortlyI think this is because `TruncateSUBTRANS();` in `CreateRestartPoint()` is
called but `StartupSUBTRANS()` isn't called yet. In `StartupXLOG()`, we
call
`StartupSUBTRANS()` if `(ArchiveRecoveryRequested && EnableHotStandby)`.
However, in `CreateRestartPoint()`, we call `TruncateSUBTRANS()` if
`(EnableHotStandby)`. I guess the difference causes this bug. The latter
possibly be called even crash-recovery while former isn't.
The attached patch 0002 fixes it. I think we could discuss about this bug
in
another thread if needed.
This is a known issue. I guess that the same as this thread and this
CF entry:
https://commitfest.postgresql.org/44/4244/
/messages/by-id/ZArVOMifjzE7f8W7@paquier.xyz
--
Michael
On 2023/09/20 12:04, Michael Paquier wrote:
This is a known issue. I guess that the same as this thread and this
CF entry:
https://commitfest.postgresql.org/44/4244/
/messages/by-id/ZArVOMifjzE7f8W7@paquier.xyz
I think this is a separate issue, and we should still use Kuwamura-san's patch
even after the one you posted on the thread gets accepted. BTW, I was able to
reproduce the assertion failure Kuwamura-san reported, even after applying
your latest patch from the thread.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Hi,
IMO a test is needed that makes sure no one is going to break this in
the future.You definitely need more complex test scenarios for that. If you can
come up with new ways to make the TAP tests of pg_rewind mode modular
in handling more complicated node setups, that would be a nice
addition, for example.I'm sorry for lacking tests. For now, I started off with a simple test
that cause the problem I mentioned. The updated WIP patch 0001 includes
the new test for pg_rewind.
Many thanks for a quick update.
And also, I'm afraid that I'm not sure what kind of tests I have to make
for fix this behavior. Would you mind giving me some advice?
Personally I would prefer not to increase the scope of work. Your TAP
test added in 0001 seems to be adequate.
BTW, I was able to
reproduce the assertion failure Kuwamura-san reported, even after applying
your latest patch from the thread.
Do you mean that the test fails or it doesn't but there are other
steps to reproduce the issue?
--
Best regards,
Aleksander Alekseev
On Tue, Sep 26, 2023 at 06:44:50PM +0300, Aleksander Alekseev wrote:
And also, I'm afraid that I'm not sure what kind of tests I have to make
for fix this behavior. Would you mind giving me some advice?Personally I would prefer not to increase the scope of work. Your TAP
test added in 0001 seems to be adequate.
Yeah, agreed. I'm OK with what you are proposing, basically (the
test could be made a bit cheaper actually).
/*
- * For Hot Standby, we could treat this like a Shutdown Checkpoint,
- * but this case is rarer and harder to test, so the benefit doesn't
- * outweigh the potential extra cost of maintenance.
+ * For Hot Standby, we could treat this like an end-of-recovery checkpoint
*/
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
I don't understand what you want to change here. Archive recovery and
crash recovery are two different things, still this code would be
triggered even if there is no standby.signal, aka the node is not a
standby. Why isn't this stuff conditional?
BTW, I was able to
reproduce the assertion failure Kuwamura-san reported, even after applying
your latest patch from the thread.Do you mean that the test fails or it doesn't but there are other
steps to reproduce the issue?
I get it as Fujii-san testing the patch from [1]/messages/by-id/ZArVOMifjzE7f8W7@paquier.xyz, still failing the
test from [2]/messages/by-id/CAMyC8qryE7mKyvPvGHCt5GpANAmp8sS_tRbraqXcPBx14viy6g@mail.gmail.com:
[1]: /messages/by-id/ZArVOMifjzE7f8W7@paquier.xyz
[2]: /messages/by-id/CAMyC8qryE7mKyvPvGHCt5GpANAmp8sS_tRbraqXcPBx14viy6g@mail.gmail.com
I would be surprised, actually, because the patch from [1]/messages/by-id/ZArVOMifjzE7f8W7@paquier.xyz would cause
step 7 of the test to fail: the patch causes standby.signal or
recovery.signal to be required. Anyway, this specific issue, if any,
had better be discussed on the other thread. I need to address a few
comments there as well and was planning to get back to it. It is
possible that I've missed something on the other thread with the
restrictions I was proposing in the latest version of the patch.
For this thread, let's focus on the pg_rewind case and how we want to
treat these records to improve the cascading case.
--
Michael
Thanks for your review!
2023年9月27日(水) 8:33 Michael Paquier <michael@paquier.xyz>:
On Tue, Sep 26, 2023 at 06:44:50PM +0300, Aleksander Alekseev wrote:
And also, I'm afraid that I'm not sure what kind of tests I have to make
for fix this behavior. Would you mind giving me some advice?Personally I would prefer not to increase the scope of work. Your TAP
test added in 0001 seems to be adequate.Yeah, agreed. I'm OK with what you are proposing, basically (the
test could be made a bit cheaper actually).
I guess you meant that it contains an unnecessary insert and wait.
I fixed this and some incorrect comments caused by copy & paste.
Please see the attached patch.
/*
- * For Hot Standby, we could treat this like a Shutdown Checkpoint, - * but this case is rarer and harder to test, so the benefit doesn't - * outweigh the potential extra cost of maintenance. + * For Hot Standby, we could treat this like an end-of-recovery checkpoint */ + RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);I don't understand what you want to change here. Archive recovery and
crash recovery are two different things, still this code would be
triggered even if there is no standby.signal, aka the node is not a
standby. Why isn't this stuff conditional?
You are absolutely right. It should only run in standby mode.
Also, according to the document[1]https://www.postgresql.org/docs/current/hot-standby.html, a server can be "Hot Standby" even if
it is
not in standby mode (i.e. when it is in archive recovery mode).
So I fixed the comment above `RequestCheckpoint()`.
[1]: https://www.postgresql.org/docs/current/hot-standby.html
I hope you will review it again.
Regards,
Masaki Kuwamura
Attachments:
v3-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patchapplication/octet-stream; name=v3-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..e431fbba31 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7936,10 +7936,11 @@ xlog_redo(XLogReaderState *record)
memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_end_of_recovery));
/*
- * For Hot Standby, we could treat this like a Shutdown Checkpoint,
- * but this case is rarer and harder to test, so the benefit doesn't
- * outweigh the potential extra cost of maintenance.
+ * In standby mode, we could treat this like an end-of-recovery
+ * checkpoint.
*/
+ if (StandbyMode)
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
/*
* We should've already switched to the new TLI before replaying this
diff --git a/src/bin/pg_rewind/t/010_standby_source_following_new_primary.pl b/src/bin/pg_rewind/t/010_standby_source_following_new_primary.pl
new file mode 100644
index 0000000000..3e5c8caab7
--- /dev/null
+++ b/src/bin/pg_rewind/t/010_standby_source_following_new_primary.pl
@@ -0,0 +1,145 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+#
+# Test using a standby server that follows new primary as the source.
+#
+# This sets up three nodes: A, B and C. First, A is the primary,
+# B follows A, and C follows B:
+#
+# A (primary) <--- B (standby) <--- C (standby)
+#
+#
+# Then we promote B, and insert some divergent rows in A and B:
+#
+# A (primary) B (primary) <--- C (standby)
+#
+#
+# Finally, we run pg_rewind on A, to point it at C:
+#
+# B (primary) <--- C (standby) <--- A (standby)
+#
+# We was not able to rewind until checkpointing in this scenario due to
+# the bug that cascade standby (i.e. C) does not follow the new
+# primary's (i.e. B's) minRecoveryPoint and minRecoveryPointTLI.
+#
+# Since we're dealing with three nodes, we cannot use most of the
+# RewindTest functions as is.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use RewindTest;
+
+my $tmp_folder = PostgreSQL::Test::Utils::tempdir;
+
+my $node_a;
+my $node_b;
+my $node_c;
+
+# Set up node A, as primary
+#
+# A (primary)
+
+setup_cluster('a');
+start_primary();
+$node_a = $node_primary;
+
+# Create a test table and insert a row in primary.
+$node_a->safe_psql('postgres', "CREATE TABLE tbl1 (d text)");
+$node_a->safe_psql('postgres', "INSERT INTO tbl1 VALUES ('before promotion')");
+primary_psql("CHECKPOINT");
+
+# Set up node B and C, as cascaded standbys
+#
+# A (primary) <--- B (standby) <--- C (standby)
+$node_a->backup('my_backup');
+$node_b = PostgreSQL::Test::Cluster->new('node_b');
+$node_b->init_from_backup($node_a, 'my_backup', has_streaming => 1);
+$node_b->set_standby_mode();
+$node_b->start;
+
+$node_b->backup('my_backup');
+$node_c = PostgreSQL::Test::Cluster->new('node_c');
+$node_c->init_from_backup($node_b, 'my_backup', has_streaming => 1);
+$node_c->set_standby_mode();
+$node_c->start;
+
+# Promote B
+#
+# A (primary) B (primary) <--- C (standby)
+
+$node_b->promote;
+
+# make sure end-of-recovery record is replicated to C before we continue
+$node_b->wait_for_catchup('node_c');
+
+# Insert a row in A. This causes A and B/C to have "diverged", so that it's
+# no longer possible to just apply the standy's logs over primary directory
+# - you need to rewind.
+$node_a->safe_psql('postgres',
+ "INSERT INTO tbl1 VALUES ('rewind this')");
+
+#
+# All set up. We're ready to run pg_rewind.
+#
+my $node_a_pgdata = $node_a->data_dir;
+
+# Stop the old primary node and be ready to perform the rewind.
+$node_a->stop('fast');
+
+# Keep a temporary postgresql.conf or it would be overwritten during the rewind.
+copy(
+ "$node_a_pgdata/postgresql.conf",
+ "$tmp_folder/node_a-postgresql.conf.tmp");
+
+{
+ # Temporarily unset PGAPPNAME so that the server doesn't
+ # inherit it. Otherwise this could affect libpqwalreceiver
+ # connections in confusing ways.
+ local %ENV = %ENV;
+ delete $ENV{PGAPPNAME};
+
+ # Do rewind using a remote connection as source.
+ command_ok(
+ [
+ 'pg_rewind', "--debug",
+ "--source-server", $node_c->connstr('postgres'),
+ "--target-pgdata=$node_a_pgdata", "--no-sync",
+ ],
+ 'pg_rewind remote');
+}
+
+# Now move back postgresql.conf with old settings
+move(
+ "$tmp_folder/node_a-postgresql.conf.tmp",
+ "$node_a_pgdata/postgresql.conf");
+
+# Restart the node.
+$node_a->start;
+
+# set RewindTest::node_primary to point to the rewound node, so that we can
+# use check_query()
+$node_primary = $node_a;
+
+# Verify that A has been successfully rewound.
+
+check_query(
+ 'SELECT * FROM tbl1',
+ qq(before promotion
+),
+ 'table content after rewind');
+
+# clean up
+$node_a->teardown_node;
+$node_b->teardown_node;
+$node_c->teardown_node;
+
+done_testing();
+
<v3-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patch>
Hi,
Thank you for addressing this issue!
The patch needs to be rebased as it doesn’t apply on master anymore, but here are some thoughts on the patch in general without testing:
1. Regarding the approach to force a checkpoint on every restartpoint record, I wonder if it has any performance implications, since now the WAL replay will wait for the restartpoint to finish as opposed to it happening in the background.
2. This change of behaviour should be documented in [1]https://www.postgresql.org/docs/devel/wal-configuration.html, there’s a paragraph about restartpoints.
3. It looks like some pg_rewind code accommodating for the "restartpoint < last common checkpoint" situation could be cleaned up as well, I found this at pg_rewind.c:669 on efcbb76efe, but maybe there’s more:
if (ControlFile_source.checkPointCopy.redo < chkptredo) …
There’s also a less invasive option to fix this problem by detecting this situation from pg_rewind and simply calling checkpoint on the standby that I think is worth exploring.
Regards,
Ilya
[1]: https://www.postgresql.org/docs/devel/wal-configuration.html