add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Hi,
The FATAL error "recovery ended before configured recovery target was
reached" introduced by commit at [1]commit dc788668bb269b10a108e87d14fefd1b9301b793 in PG 14 is causing the standby
to go down after having spent a good amount of time in recovery. There
can be cases where the arrival of required WAL (for reaching recovery
target) from the archive location to the standby may take time and
meanwhile the standby failing with the FATAL error isn't good.
Instead, how about we make the standby wait for a certain amount of
time (with a GUC) so that it can keep looking for the required WAL. If
it gets the required WAL during the wait time, then it succeeds in
reaching the recovery target (no FATAL error of course). If it
doesn't, the timeout occurs and the standby fails with the FATAL
error. The value of the new GUC can probably be set to the average
time it takes for the WAL to reach archive location from the primary +
from archive location to the standby, default 0 i.e. disabled.
I'm attaching a WIP patch. I've tested it on my dev system and the
recovery regression tests are passing with it. I will provide a better
version later, probably with a test case.
Thoughts?
[1]: commit dc788668bb269b10a108e87d14fefd1b9301b793
Author: Peter Eisentraut <peter@eisentraut.org>
Date: Wed Jan 29 15:43:32 2020 +0100
Fail if recovery target is not reached
Before, if a recovery target is configured, but the archive ended
before the target was reached, recovery would end and the server would
promote without further notice. That was deemed to be pretty wrong.
With this change, if the recovery target is not reached, it is a fatal
error.
Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion:
/messages/by-id/993736dd3f1713ec1f63fc3b653839f5@lako.no
Regards,
Bharath Rupireddy.
Attachments:
v1-0001-add-retry-mechanism-with-a-GUC-before-failing-the.patchapplication/octet-stream; name=v1-0001-add-retry-mechanism-with-a-GUC-before-failing-the.patchDownload
From b9e897487cf5a03b4bd0c53e4050f07fea96181a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <brupireddy@microsoft.com>
Date: Thu, 14 Oct 2021 10:53:40 +0000
Subject: [PATCH v1] add retry mechanism with a GUC before failing the standby
with recovery ended before configured recovery fatal error
---
src/backend/access/transam/xlog.c | 73 +++++++++++++++++++++++++++++++
src/backend/utils/misc/guc.c | 12 +++++
src/include/access/xlog.h | 1 +
3 files changed, 86 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ffc9fa5bf1..9b62a68870 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -108,6 +108,7 @@ int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
int max_slot_wal_keep_size_mb = -1;
+int recovery_target_retry_timeout = 30;
#ifdef WAL_DEBUG
bool XLOG_DEBUG = false;
@@ -968,6 +969,7 @@ static void WALInsertLockAcquire(void);
static void WALInsertLockAcquireExclusive(void);
static void WALInsertLockRelease(void);
static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
+static bool WaitForRecoveryTarget(void);
/*
* Insert an XLOG record represented by an already-constructed chain of data
@@ -7385,6 +7387,34 @@ StartupXLOG(void)
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogreader, LOG, false);
+
+ if (ArchiveRecoveryRequested &&
+ recoveryTarget != RECOVERY_TARGET_UNSET &&
+ record == NULL && !reachedRecoveryTarget)
+ {
+ bool retry_timeout;
+
+ while (record == NULL)
+ {
+ if (recovery_target_retry_timeout > 0)
+ {
+ retry_timeout = WaitForRecoveryTarget();
+
+ /* try to fetch the next WAL record */
+ record = ReadRecord(xlogreader, LOG, false);
+
+ /* yay! we've got a record */
+ if (record)
+ break;
+
+ /* we've waited enough, no luck, so let's exit now. */
+ if (retry_timeout)
+ break;
+ }
+ else
+ break;
+ }
+ }
} while (record != NULL);
/*
@@ -7957,6 +7987,49 @@ StartupXLOG(void)
RequestCheckpoint(CHECKPOINT_FORCE);
}
+static bool
+WaitForRecoveryTarget(void)
+{
+ static long total_wait_time_msec = 0;
+ long wait_time_msec = 1000;
+
+ /* calculate total wait time in first call */
+ if (total_wait_time_msec == 0)
+ total_wait_time_msec = recovery_target_retry_timeout * 1000;
+
+ /* quick exit without arming the latch if it's already past time */
+ if (total_wait_time_msec <= 0)
+ {
+ elog(LOG, "timeout occurred while waiting for recovery target, waited %d seconds",
+ recovery_target_retry_timeout);
+ return true;
+ }
+
+ ResetLatch(&XLogCtl->recoveryWakeupLatch);
+
+ /* might change the trigger file's location */
+ HandleStartupProcInterrupts();
+
+ elog(LOG, "waiting for %ld milliseconds before recovery target retry", wait_time_msec);
+
+ (void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ wait_time_msec,
+ WAIT_EVENT_RECOVERY_APPLY_DELAY);
+
+ total_wait_time_msec -= wait_time_msec;
+
+ /* exit if waited enough */
+ if (total_wait_time_msec <= 0)
+ {
+ elog(LOG, "timeout occurred while waiting for recovery target, waited %d seconds",
+ recovery_target_retry_timeout);
+ return true;
+ }
+
+ return false;
+}
+
/*
* Checks if recovery has reached a consistent state. When consistency is
* reached and we have a valid starting standby snapshot, tell postmaster
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ce58018635..f67f454ed2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3110,6 +3110,18 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"recovery_target_retry_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
+ gettext_noop("Sets the time to wait before retrying to reach recovery target "
+ "after a failed attempt."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &recovery_target_retry_timeout,
+ 30, 10, 86400,
+ NULL, NULL, NULL
+ },
+
{
{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1570860f95..96db8f7f48 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int recovery_min_apply_delay;
extern char *PrimaryConnInfo;
extern char *PrimarySlotName;
extern bool wal_receiver_create_temp_slot;
+extern int recovery_target_retry_timeout;
/* indirectly set via GUC system */
extern TransactionId recoveryTargetXid;
--
2.25.1
On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote:
The FATAL error "recovery ended before configured recovery target
was
reached" introduced by commit at [1] in PG 14 is causing the standby
to go down after having spent a good amount of time in recovery.
There
can be cases where the arrival of required WAL (for reaching recovery
target) from the archive location to the standby may take time and
meanwhile the standby failing with the FATAL error isn't good.
Instead, how about we make the standby wait for a certain amount of
time (with a GUC) so that it can keep looking for the required WAL.
How is archiving configured, and would it be possible to introduce
logic into the restore_command to handle slow-to-arrive WAL?
Regards,
Jeff Davis
On Fri, Oct 22, 2021 at 5:54 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote:
The FATAL error "recovery ended before configured recovery target
was
reached" introduced by commit at [1] in PG 14 is causing the standby
to go down after having spent a good amount of time in recovery.
There
can be cases where the arrival of required WAL (for reaching recovery
target) from the archive location to the standby may take time and
meanwhile the standby failing with the FATAL error isn't good.
Instead, how about we make the standby wait for a certain amount of
time (with a GUC) so that it can keep looking for the required WAL.How is archiving configured, and would it be possible to introduce
logic into the restore_command to handle slow-to-arrive WAL?
Thanks Jeff!
If the suggestion is to have the wait and retry logic embedded into
the user-written restore_command, IMHO, it's not a good idea as the
restore_command is external to the core PG and the FATAL error
"recovery ended before configured recovery target was reached" is an
internal thing. Having the retry logic (controlled with a GUC) within
the core, when the startup process hits the recovery end before the
target, is a better way and it is something the core PG can offer.
With this, the amount of work spent in recovery by the standby isn't
wasted if the GUC is enabled with the right value. The optimal value
someone can set is the average time it takes for the WAL to reach
archive location from the primary + from archive location to the
standby. By default, we can disable the new GUC with value 0 so that
whoever wants can set it.
Regards,
Bharath Rupireddy.
On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote:
If the suggestion is to have the wait and retry logic embedded into
the user-written restore_command, IMHO, it's not a good idea as the
restore_command is external to the core PG and the FATAL error
"recovery ended before configured recovery target was reached" is an
internal thing.
It seems likely that you'd want to tweak the exact behavior for the
given system. For instance, if the files are making some progress, and
you can estimate that in 2 more minutes everything will be fine, then
you may be more willing to wait those two minutes. But if no progress
has happened since recovery began 15 minutes ago, you may want to fail
immediately.
All of this nuance would be better captured in a specialized script
than a generic timeout in the server code.
What do you want to do after the timeout happens? If you want to issue
a WARNING instead of failing outright, perhaps that makes sense for
exploratory PITR cases. That could be a simple boolean GUC without
needing to introduce the timeout logic into the server.
I think it's an interesting point that it can be hard to choose a
reasonable recovery target if the system is completely down. We could
use some better tooling or metadata around the lsns, xids or timestamp
ranges available in a pg_wal directory or an archive. Even better would
be to see the available named restore points. This would make is easier
to calculate how long recovery might take for a given restore point, or
whether it's not going to work at all because there's not enough WAL.
Regards,
Jeff Davis
On Sat, Oct 23, 2021 at 1:46 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote:
If the suggestion is to have the wait and retry logic embedded into
the user-written restore_command, IMHO, it's not a good idea as the
restore_command is external to the core PG and the FATAL error
"recovery ended before configured recovery target was reached" is an
internal thing.What do you want to do after the timeout happens? If you want to issue
a WARNING instead of failing outright, perhaps that makes sense for
exploratory PITR cases. That could be a simple boolean GUC without
needing to introduce the timeout logic into the server.
If you are suggesting to give the user more control on what should
happen to the standby even after the timeout, then, the 2 new GUCs
recovery_target_retry_timeout (int) and
recovery_target_continue_after_timeout (bool) will really help users
choose what they want. I'm not sure if it is okay to have 2 new GUCs.
Let's hear from other hackers what they think about this.
I think it's an interesting point that it can be hard to choose a
reasonable recovery target if the system is completely down. We could
use some better tooling or metadata around the lsns, xids or timestamp
ranges available in a pg_wal directory or an archive. Even better would
be to see the available named restore points. This would make is easier
to calculate how long recovery might take for a given restore point, or
whether it's not going to work at all because there's not enough WAL.
I think pg_waldump can help here to do some exploratory analysis of
the available WAL in the directory where the WAL files are present.
Since it is an independent C program, it can run even when the server
is down and also run on archive location.
Regards,
Bharath Rupireddy.
On Sat, 2021-10-23 at 09:31 +0530, Bharath Rupireddy wrote:
If you are suggesting ...
Your complaint seems to be coming from commit dc788668, so the most
direct answer would be to make that configurable to the old behavior,
not to invent a new timeout behavior.
If I understand correctly, you are doing PITR from an archive, right?
So would restore_command be a reasonable place for the timeout?
And can you provide some approximate numbers to help me understand
where the timeout would be helpful? E.g. you have W GB of WAL to
replay, and restore would take X minutes, but some WAL is missing so
you fail after X-Y minutes, but if you has timeout Z everything would
be great.
I think pg_waldump can help here to do some exploratory analysis of
the available WAL in the directory where the WAL files are present.
Since it is an independent C program, it can run even when the server
is down and also run on archive location.
Right, it's possible to do, but I think there's room for improvement so
we don't have to always scan the WAL. I'm getting a bit off-topic from
your proposal though. I'll bring it up in another thread when my
thoughts on this are more clear.
Regards,
Jeff Davis
At Wed, 20 Oct 2021 21:35:44 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
Hi,
The FATAL error "recovery ended before configured recovery target was
reached" introduced by commit at [1] in PG 14 is causing the standby
to go down after having spent a good amount of time in recovery. There
can be cases where the arrival of required WAL (for reaching recovery
target) from the archive location to the standby may take time and
meanwhile the standby failing with the FATAL error isn't good.
Instead, how about we make the standby wait for a certain amount of
time (with a GUC) so that it can keep looking for the required WAL. If
it gets the required WAL during the wait time, then it succeeds in
reaching the recovery target (no FATAL error of course). If it
doesn't, the timeout occurs and the standby fails with the FATAL
error. The value of the new GUC can probably be set to the average
time it takes for the WAL to reach archive location from the primary +
from archive location to the standby, default 0 i.e. disabled.I'm attaching a WIP patch. I've tested it on my dev system and the
recovery regression tests are passing with it. I will provide a better
version later, probably with a test case.Thoughts?
It looks like starting a server in non-hot standby mode only fetching
from archive. The only difference is it doesn't have timeout.
Doesn't that cofiguration meet your requirements?
Or, if timeout matters, I agree with Jeff. Retrying in restore_command
looks fine.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Sat, Oct 23, 2021 at 1:46 AM Jeff Davis <pgsql@j-davis.com> wrote:
What do you want to do after the timeout happens? If you want to issue
a WARNING instead of failing outright, perhaps that makes sense for
exploratory PITR cases. That could be a simple boolean GUC without
needing to introduce the timeout logic into the server.
Thanks Jeff. I posted the patch in a separate thread[1]/messages/by-id/CALj2ACWR4iaph7AWCr5-V9dXqpf2p5B=3fTyvLfL8VD_E+x0tA@mail.gmail.com. for new GUC
(WARN + promotion or shutdown with FATAL error) in case the recovery
target isn't reached.
[1]: /messages/by-id/CALj2ACWR4iaph7AWCr5-V9dXqpf2p5B=3fTyvLfL8VD_E+x0tA@mail.gmail.com.
Regards,
Bharath Rupireddy.