pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.
If a promotion is triggered while recovery is paused, the paused state ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?
Attached patch fixes this issue by resetting the recovery pause state to
'not paused' when standby promotion is triggered.
Thought?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachments:
reset_wal_pause_state_after_promotion_v1.patchtext/plain; charset=UTF-8; name=reset_wal_pause_state_after_promotion_v1.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d163f190f..441a9124cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12825,6 +12825,14 @@ SetPromoteIsTriggered(void)
XLogCtl->SharedPromoteIsTriggered = true;
SpinLockRelease(&XLogCtl->info_lck);
+ /*
+ * Mark the recovery pause state as 'not paused' because the paused state
+ * ends and promotion continues if a promotion is triggered while recovery
+ * is paused. Otherwise pg_get_wal_replay_pause_state() can mistakenly
+ * return 'paused' while a promotion is ongoing.
+ */
+ SetRecoveryPause(false);
+
LocalPromoteIsTriggered = true;
}
At Mon, 17 May 2021 23:29:18 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
If a promotion is triggered while recovery is paused, the paused state
ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?Attached patch fixes this issue by resetting the recovery pause state
to
'not paused' when standby promotion is triggered.Thought?
Nice catch!
Once the state enteres "paused" state no more WAL record is expected
to be replayed until exiting the state. I'm not sure but maybe we are
also expecting that the server promotes whthout a record replayed when
triggered while pausing. However, actually there's a chance for a
record to replayed before promotion. Of course it is existing
behavior but I'd like to make sure whether we deliberately allow that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2021/05/18 9:58, Kyotaro Horiguchi wrote:
At Mon, 17 May 2021 23:29:18 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
If a promotion is triggered while recovery is paused, the paused state
ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?Attached patch fixes this issue by resetting the recovery pause state
to
'not paused' when standby promotion is triggered.Thought?
Nice catch!
Once the state enteres "paused" state no more WAL record is expected
to be replayed until exiting the state. I'm not sure but maybe we are
also expecting that the server promotes whthout a record replayed when
triggered while pausing.
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Mon, May 17, 2021 at 7:59 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
If a promotion is triggered while recovery is paused, the paused state ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?Attached patch fixes this issue by resetting the recovery pause state to
'not paused' when standby promotion is triggered.Thought?
I think, prior to commit 496ee647ecd2917369ffcf1eaa0b2cdca07c8730
(Prefer standby promotion over recovery pause.) this behavior was fine
because the pause was continued but after this commit now we are
giving preference to pause so this is a bug so need to be fixed.
The fix looks fine but I think along with this we should also return
immediately from the pause loop if promotion is requested. Because if
we recheck the recovery pause then someone can pause again and we will
be in loop so better to exit as soon as promotion is requested, see
attached patch. Should be applied along with your patch.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
Return_immediately_if_promote_requested.patchtext/x-patch; charset=US-ASCII; name=Return_immediately_if_promote_requested.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d163f1..1d81c23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6358,6 +6358,8 @@ RecoveryRequiresIntParameter(const char *param_name, int currValue, int minValue
minValue),
errhint("Restart the server after making the necessary configuration changes.")));
warned_for_promote = true;
+
+ return;
}
/*
On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.
Sounds like a "promotion immediate" mode. It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion. Could you add one based on
pg_get_wal_replay_pause_state()?
--
Michael
On 2021/05/18 14:53, Dilip Kumar wrote:
On Mon, May 17, 2021 at 7:59 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
If a promotion is triggered while recovery is paused, the paused state ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?Attached patch fixes this issue by resetting the recovery pause state to
'not paused' when standby promotion is triggered.Thought?
I think, prior to commit 496ee647ecd2917369ffcf1eaa0b2cdca07c8730
(Prefer standby promotion over recovery pause.) this behavior was fine
because the pause was continued but after this commit now we are
giving preference to pause so this is a bug so need to be fixed.The fix looks fine but I think along with this we should also return
immediately from the pause loop if promotion is requested. Because if
we recheck the recovery pause then someone can pause again and we will
be in loop so better to exit as soon as promotion is requested, see
attached patch. Should be applied along with your patch.
But this change can cause the recovery to continue with insufficient parameter
settings if a promotion is requested while the server is in the paused state
because of such invalid settings. This behavior seems not safe.
If this my understanding is right, the recovery should abort immediately
(i.e., FATAL error ""recovery aborted because of insufficient parameter settings"
should be thrown) if a promotion is requested in that case, like when
pg_wal_replay_resume() is executed in that case. Thought?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Tue, May 18, 2021 at 1:43 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
The fix looks fine but I think along with this we should also return
immediately from the pause loop if promotion is requested. Because if
we recheck the recovery pause then someone can pause again and we will
be in loop so better to exit as soon as promotion is requested, see
attached patch. Should be applied along with your patch.But this change can cause the recovery to continue with insufficient parameter
settings if a promotion is requested while the server is in the paused state
because of such invalid settings. This behavior seems not safe.
If this my understanding is right, the recovery should abort immediately
(i.e., FATAL error ""recovery aborted because of insufficient parameter settings"
should be thrown) if a promotion is requested in that case, like when
pg_wal_replay_resume() is executed in that case. Thought?
Yeah, you are right, I missed that.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
At Tue, 18 May 2021 12:48:38 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.
Mmm. I was confused with recovery target + pause. Actually promotion
works as so and it is documented. Anyway it is a matter of the next
version.
I forgot to mention the patch itself, but what the patch does looks
fine to me. Disabling pause after setting SharedProteIsTriggered
prevents later re-pausing (from the sql function).
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2021/05/18 15:46, Michael Paquier wrote:
On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.Sounds like a "promotion immediate" mode. It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion. Could you add one based on
pg_get_wal_replay_pause_state()?
You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'
It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2021/05/19 9:53, Kyotaro Horiguchi wrote:
At Tue, 18 May 2021 12:48:38 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.Mmm. I was confused with recovery target + pause. Actually promotion
works as so and it is documented. Anyway it is a matter of the next
version.I forgot to mention the patch itself, but what the patch does looks
fine to me. Disabling pause after setting SharedProteIsTriggered
prevents later re-pausing (from the sql function).
Thanks for the review! I pushed the patch.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Wed, May 19, 2021 at 01:46:45PM +0900, Fujii Masao wrote:
You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.
Couldn't you rely on recovery_end_command for number #4? The shared
memory state tracked by SharedRecoveryState is updated after the
end-recovery command is triggered, so pg_get_wal_replay_pause_state()
can be executed at this point. A bit hairy, I agree, but that would
work :)
Still, it would be easy enough to have something for
pg_get_wal_replay_pause_state() called on a standby when there is no
pause (your case #2) and a second case on a standby with a pause
triggered, though (not listed above).
--
Michael
On Wed, May 19, 2021 at 10:16 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
On 2021/05/18 15:46, Michael Paquier wrote:
On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.Sounds like a "promotion immediate" mode. It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion. Could you add one based on
pg_get_wal_replay_pause_state()?You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.
Even for #2, we can not ensure that whether it will be 'paused' or
'pause requested'.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
At Wed, 19 May 2021 11:19:13 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
On Wed, May 19, 2021 at 10:16 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:On 2021/05/18 15:46, Michael Paquier wrote:
On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.Sounds like a "promotion immediate" mode. It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion. Could you add one based on
pg_get_wal_replay_pause_state()?You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.Even for #2, we can not ensure that whether it will be 'paused' or
'pause requested'.
We often use poll_query_until() to make sure some desired state is
reached. And, as Michael suggested, the function
pg_get_wal_replay_pause_state() still works at the time of
recovery_end_command. So a bit more detailed steps are:
#0. Equip the server with recovery_end_command that waits for some
trigger then start the server.
#1. Pause the recovery
#2. Wait until pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Wait until pg_get_wal_replay_pause_state() returns 'not paused'
#5. Trigger recovery_end_command to let promotion proceed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2021/05/19 15:25, Kyotaro Horiguchi wrote:
At Wed, 19 May 2021 11:19:13 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
On Wed, May 19, 2021 at 10:16 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:On 2021/05/18 15:46, Michael Paquier wrote:
On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.Sounds like a "promotion immediate" mode. It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion. Could you add one based on
pg_get_wal_replay_pause_state()?You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.Even for #2, we can not ensure that whether it will be 'paused' or
'pause requested'.We often use poll_query_until() to make sure some desired state is
reached.
Yes.
And, as Michael suggested, the function
pg_get_wal_replay_pause_state() still works at the time of
recovery_end_command. So a bit more detailed steps are:
IMO this idea is tricky and fragile, so I'm inclined to avoid that if possible.
Attached is the POC patch to add the following tests.
#1. Check that pg_get_wal_replay_pause_state() reports "not paused" at first.
#2. Request to pause archive recovery and wait until it's actually paused.
#3. Request to resume archive recovery and wait until it's actually resumed.
#4. Request to pause archive recovery and wait until it's actually paused.
Then, check that the paused state ends and promotion continues
if a promotion is triggered while recovery is paused.
In #4, pg_get_wal_replay_pause_state() is not executed while promotion
is ongoing. #4 checks that pg_is_in_recovery() returns false and
the promotion finishes expectedly in that case. Isn't this test enough for now?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachments:
recovery_pause_test_v1.patchtext/plain; charset=UTF-8; name=recovery_pause_test_v1.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index c675c0886c..8db7e47d13 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -6,7 +6,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
use File::Copy;
# Initialize primary node, doing archives
@@ -75,3 +75,42 @@ ok( !-f "$node_standby2_data/pg_wal/RECOVERYHISTORY",
"RECOVERYHISTORY removed after promotion");
ok( !-f "$node_standby2_data/pg_wal/RECOVERYXLOG",
"RECOVERYXLOG removed after promotion");
+
+# Check that archive recovery can be paused or resumed expectedly.
+my $node_standby3 = get_new_node('standby3');
+$node_standby3->init_from_backup($node_primary, $backup_name,
+ has_restoring => 1);
+$node_standby3->start;
+
+# Archive recovery is not yet paused.
+is($node_standby3->safe_psql('postgres',
+ "SELECT pg_get_wal_replay_pause_state()"),
+ 'not paused', 'pg_get_wal_replay_pause_state() reports not paused');
+
+# Request to pause archive recovery and wait until it's actually paused.
+$node_standby3->safe_psql('postgres', "SELECT pg_wal_replay_pause()");
+$node_primary->safe_psql('postgres',
+ "INSERT INTO tab_int VALUES (generate_series(2001,2010))");
+$node_standby3->poll_query_until('postgres',
+ "SELECT pg_get_wal_replay_pause_state() = 'paused'")
+ or die "Timed out while waiting for archive recovery to be paused";
+
+# Request to resume archive recovery and wait until it's actually resumed.
+$node_standby3->safe_psql('postgres', "SELECT pg_wal_replay_resume()");
+$node_standby3->poll_query_until('postgres',
+ "SELECT pg_get_wal_replay_pause_state() = 'not paused'")
+ or die "Timed out while waiting for archive recovery to be resumed";
+
+# Check that the paused state ends and promotion continues if a promotion
+# is triggered while recovery is paused.
+$node_standby3->safe_psql('postgres', "SELECT pg_wal_replay_pause()");
+$node_primary->safe_psql('postgres',
+ "INSERT INTO tab_int VALUES (generate_series(2011,2020))");
+$node_standby3->poll_query_until('postgres',
+ "SELECT pg_get_wal_replay_pause_state() = 'paused'")
+ or die "Timed out while waiting for archive recovery to be paused";
+
+$node_standby3->promote;
+$node_standby3->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery()")
+ or die "Timed out while waiting for promotion to finish";
At Wed, 19 May 2021 16:21:58 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2021/05/19 15:25, Kyotaro Horiguchi wrote:
At Wed, 19 May 2021 11:19:13 +0530, Dilip Kumar
<dilipbalaut@gmail.com> wrote inOn Wed, May 19, 2021 at 10:16 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:On 2021/05/18 15:46, Michael Paquier wrote:
On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.Sounds like a "promotion immediate" mode. It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion. Could you add one based on
pg_get_wal_replay_pause_state()?You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.Even for #2, we can not ensure that whether it will be 'paused' or
'pause requested'.We often use poll_query_until() to make sure some desired state is
reached.Yes.
And, as Michael suggested, the function
pg_get_wal_replay_pause_state() still works at the time of
recovery_end_command. So a bit more detailed steps are:IMO this idea is tricky and fragile, so I'm inclined to avoid that if
Agreed, the recovery_end_command would be something like the following
avoiding dependency on sh. However, I'm not sure it works as well on
Windows..
recovery_end_command='perl -e "while( -f \'$trigfile\') {sleep 0.1;}"'
possible.
Attached is the POC patch to add the following tests.#1. Check that pg_get_wal_replay_pause_state() reports "not paused" at
#first.
#2. Request to pause archive recovery and wait until it's actually
#paused.
#3. Request to resume archive recovery and wait until it's actually
#resumed.
#4. Request to pause archive recovery and wait until it's actually
#paused.
Then, check that the paused state ends and promotion continues
if a promotion is triggered while recovery is paused.In #4, pg_get_wal_replay_pause_state() is not executed while promotion
is ongoing. #4 checks that pg_is_in_recovery() returns false and
the promotion finishes expectedly in that case. Isn't this test enough
for now?
+1 for adding some tests for pg_wal_replay_pause() but the test seems
like checking only that pg_get_wal_replay_pause_state() returns the
expected state value. Don't we need to check that the recovery is
actually paused and that the promotion happens at expected LSN?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, May 19, 2021 at 11:55 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Wed, 19 May 2021 11:19:13 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
On Wed, May 19, 2021 at 10:16 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:On 2021/05/18 15:46, Michael Paquier wrote:
On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.Sounds like a "promotion immediate" mode. It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion. Could you add one based on
pg_get_wal_replay_pause_state()?You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.Even for #2, we can not ensure that whether it will be 'paused' or
'pause requested'.We often use poll_query_until() to make sure some desired state is
reached. And, as Michael suggested, the function
pg_get_wal_replay_pause_state() still works at the time of
recovery_end_command. So a bit more detailed steps are:
Right, if we are polling for the state change in #2 then that makes sense.
#0. Equip the server with recovery_end_command that waits for some
trigger then start the server.
#1. Pause the recovery
#2. Wait until pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Wait until pg_get_wal_replay_pause_state() returns 'not paused'
#5. Trigger recovery_end_command to let promotion proceed.
+1
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 2021/05/19 16:43, Kyotaro Horiguchi wrote:
+1 for adding some tests for pg_wal_replay_pause() but the test seems
like checking only that pg_get_wal_replay_pause_state() returns the
expected state value. Don't we need to check that the recovery is
actually paused and that the promotion happens at expected LSN?
Sounds good. Attached is the updated version of the patch.
I added such checks into the test.
BTW, while reading some recovery regression tests, I found that
013_crash_restart.pl has "use Time::HiRes qw(usleep)" but it seems
not necessary. We can safely remove that? Patch attached.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachments:
recovery_pause_test_v2.patchtext/plain; charset=UTF-8; name=recovery_pause_test_v2.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 7f177afaed..496fa40fe1 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -1,13 +1,13 @@
# Copyright (c) 2021, PostgreSQL Global Development Group
-# Checks for recovery_min_apply_delay
+# Checks for recovery_min_apply_delay and recovery pause
use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 1;
+use Test::More tests => 3;
# Initialize primary node
my $node_primary = get_new_node('primary');
@@ -55,3 +55,58 @@ $node_standby->poll_query_until('postgres',
# the configured apply delay.
ok(time() - $primary_insert_time >= $delay,
"standby applies WAL only after replication delay");
+
+
+# Check that recovery can be paused or resumed expectedly.
+my $node_standby2 = get_new_node('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby2->start;
+
+# Recovery is not yet paused.
+is($node_standby2->safe_psql('postgres',
+ "SELECT pg_get_wal_replay_pause_state()"),
+ 'not paused', 'pg_get_wal_replay_pause_state() reports not paused');
+
+# Request to pause recovery and wait until it's actually paused.
+$node_standby2->safe_psql('postgres', "SELECT pg_wal_replay_pause()");
+$node_primary->safe_psql('postgres',
+ "INSERT INTO tab_int VALUES (generate_series(21,30))");
+$node_standby2->poll_query_until('postgres',
+ "SELECT pg_get_wal_replay_pause_state() = 'paused'")
+ or die "Timed out while waiting for recovery to be paused";
+
+# Even if new WAL records are streamed from the primary,
+# recovery in the paused state doesn't replay them.
+my $receive_lsn = $node_standby2->safe_psql('postgres',
+ "SELECT pg_last_wal_receive_lsn()");
+my $replay_lsn = $node_standby2->safe_psql('postgres',
+ "SELECT pg_last_wal_replay_lsn()");
+$node_primary->safe_psql('postgres',
+ "INSERT INTO tab_int VALUES (generate_series(31,40))");
+$node_standby2->poll_query_until('postgres',
+ "SELECT '$receive_lsn'::pg_lsn < pg_last_wal_receive_lsn()")
+ or die "Timed out while waiting for new WAL to be streamed";
+is($node_standby2->safe_psql('postgres',
+ "SELECT pg_last_wal_replay_lsn()"),
+ qq($replay_lsn), 'no WAL is replayed in the paused state');
+
+# Request to resume recovery and wait until it's actually resumed.
+$node_standby2->safe_psql('postgres', "SELECT pg_wal_replay_resume()");
+$node_standby2->poll_query_until('postgres',
+ "SELECT pg_get_wal_replay_pause_state() = 'not paused' AND pg_last_wal_replay_lsn() > '$replay_lsn'::pg_lsn")
+ or die "Timed out while waiting for recovery to be resumed";
+
+# Check that the paused state ends and promotion continues if a promotion
+# is triggered while recovery is paused.
+$node_standby2->safe_psql('postgres', "SELECT pg_wal_replay_pause()");
+$node_primary->safe_psql('postgres',
+ "INSERT INTO tab_int VALUES (generate_series(41,50))");
+$node_standby2->poll_query_until('postgres',
+ "SELECT pg_get_wal_replay_pause_state() = 'paused'")
+ or die "Timed out while waiting for recovery to be paused";
+
+$node_standby2->promote;
+$node_standby2->poll_query_until('postgres',
+ "SELECT NOT pg_is_in_recovery()")
+ or die "Timed out while waiting for promotion to finish";
remove_unnecessary_hires_usleep_v1.patchtext/plain; charset=UTF-8; name=remove_unnecessary_hires_usleep_v1.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index 66e43ffbe8..e1c36abe97 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -17,7 +17,6 @@ use PostgresNode;
use TestLib;
use Test::More;
use Config;
-use Time::HiRes qw(usleep);
plan tests => 18;
On 2021/05/19 19:24, Fujii Masao wrote:
On 2021/05/19 16:43, Kyotaro Horiguchi wrote:
+1 for adding some tests for pg_wal_replay_pause() but the test seems
like checking only that pg_get_wal_replay_pause_state() returns the
expected state value. Don't we need to check that the recovery is
actually paused and that the promotion happens at expected LSN?Sounds good. Attached is the updated version of the patch.
I added such checks into the test.BTW, while reading some recovery regression tests, I found that
013_crash_restart.pl has "use Time::HiRes qw(usleep)" but it seems
not necessary. We can safely remove that? Patch attached.
Barring any objections, I'm thinking to commit these two patches.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Sorry for missing this.
At Mon, 31 May 2021 12:52:54 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2021/05/19 19:24, Fujii Masao wrote:
On 2021/05/19 16:43, Kyotaro Horiguchi wrote:
+1 for adding some tests for pg_wal_replay_pause() but the test seems
like checking only that pg_get_wal_replay_pause_state() returns the
expected state value. Don't we need to check that the recovery is
actually paused and that the promotion happens at expected LSN?Sounds good. Attached is the updated version of the patch.
I added such checks into the test.
Thanks! Looks fine. The paused-state test may get false-success but it
would be sufficient that it detects the problem in most cases.
BTW, while reading some recovery regression tests, I found that
013_crash_restart.pl has "use Time::HiRes qw(usleep)" but it seems
not necessary. We can safely remove that? Patch attached.
Looks just fine for the removal of HiRes usage. All other use of
HiRes are accompanied by a usleep usage.
Barring any objections, I'm thinking to commit these two patches.
+1.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2021/05/31 17:18, Kyotaro Horiguchi wrote:
Sorry for missing this.
At Mon, 31 May 2021 12:52:54 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2021/05/19 19:24, Fujii Masao wrote:
On 2021/05/19 16:43, Kyotaro Horiguchi wrote:
+1 for adding some tests for pg_wal_replay_pause() but the test seems
like checking only that pg_get_wal_replay_pause_state() returns the
expected state value. Don't we need to check that the recovery is
actually paused and that the promotion happens at expected LSN?Sounds good. Attached is the updated version of the patch.
I added such checks into the test.Thanks! Looks fine. The paused-state test may get false-success but it
would be sufficient that it detects the problem in most cases.BTW, while reading some recovery regression tests, I found that
013_crash_restart.pl has "use Time::HiRes qw(usleep)" but it seems
not necessary. We can safely remove that? Patch attached.Looks just fine for the removal of HiRes usage. All other use of
HiRes are accompanied by a usleep usage.Barring any objections, I'm thinking to commit these two patches.
+1.
Thanks for the review! I pushed those two patches.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION