[PATCH] Add recovery_min_apply_delay_reconnect recovery option
This is a patch I am using in production using the following parameters
in recovery.conf:
recovery_min_apply_delay = '1d'
recovery_min_apply_delay_reconnect = '10 min'
In our environment we expect that standby servers with an apply delay
provide some protection against mistakes by the DBA (myself), and that
they contain a valid copy of the data that can be used in the event that
the master dies.
Does this feature seems applicable to a wider community?
== delay-reconnect-param ==
Add recovery_min_apply_delay_reconnect recovery option
'recovery_min_apply_delay_reconnect' allows an administrator to specify
how a standby using 'recovery_min_apply_delay' responds when streaming
replication is interrupted.
Combining these two parameters provides a fixed delay under normal
operation while maintaining some assurance that the standby contains an
up-to-date copy of the WAL.
This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.
Is it possible to verify the archive on disk independently of
application? Adding a second delay parameter provides a workaround for
some use cases without complecting xlog.c.
doc/src/sgml/recovery-config.sgml | 24 ++++++++++++++++++++++++
src/backend/access/transam/xlog.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++-------------
src/test/recovery/t/005_replay_delay.pl | 8 ++++++--
3 files changed, 76 insertions(+), 15 deletions(-)
Attachments:
delay-reconnect-param.patchtext/plain; charset=us-asciiDownload
commit b8807b43c6a44c0d85a6a86c13b48b47f56ea45f
Author: Eric Radman <ericshane@eradman.com>
Date: Mon Oct 16 10:07:55 2017 -0400
Add recovery_min_apply_delay_reconnect recovery option
'recovery_min_apply_delay_reconnect' allows an administrator to specify
how a standby using 'recovery_min_apply_delay' responds when streaming
replication is interrupted.
Combining these two parameters provides a fixed delay under normal
operation while maintaining some assurance that the standby contains an
up-to-date copy of the WAL.
This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.
Is it possible to verify the archive on disk independently of
application? Adding a second delay parameter provides a workaround for
some use cases without complecting xlog.c.
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0a5d086248..4f8823ee50 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -502,6 +502,30 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</listitem>
</varlistentry>
+ <varlistentry id="recovery-min-apply-delay-reconnect" xreflabel="recovery_min_apply_delay_reconnect">
+ <term><varname>recovery_min_apply_delay_reconnect</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>recovery_min_apply_delay_reconnect</> recovery parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If the streaming replication is inturruped while
+ <varname>recovery_min_apply_delay</varname> is set, WAL records will be
+ replayed from the archive. After all records have been processed from
+ local disk, <productname>PostgreSQL</> will attempt to resume streaming
+ and connect to the master.
+ </para>
+ <para>
+ This parameter is used to compromise the fixed apply delay in order to
+ restablish streaming. In this way a standby server can be run in fair
+ conditions with a long delay (hours or days) without while specifying
+ the maximum delay that can be expected before the WAL archive is brought
+ back up to date with the master after a network failure.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a12a4..36a4779f70 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -267,6 +267,7 @@ static TimestampTz recoveryTargetTime;
static char *recoveryTargetName;
static XLogRecPtr recoveryTargetLSN;
static int recovery_min_apply_delay = 0;
+static int recovery_min_apply_delay_reconnect = 0;
static TimestampTz recoveryDelayUntilTime;
/* options taken from recovery.conf for XLOG streaming */
@@ -5227,6 +5228,7 @@ readRecoveryCommandFile(void)
*head = NULL,
*tail = NULL;
bool recoveryTargetActionSet = false;
+ const char *hintmsg;
fd = AllocateFile(RECOVERY_COMMAND_FILE, "r");
@@ -5452,8 +5454,6 @@ readRecoveryCommandFile(void)
}
else if (strcmp(item->name, "recovery_min_apply_delay") == 0)
{
- const char *hintmsg;
-
if (!parse_int(item->value, &recovery_min_apply_delay, GUC_UNIT_MS,
&hintmsg))
ereport(ERROR,
@@ -5463,6 +5463,25 @@ readRecoveryCommandFile(void)
hintmsg ? errhint("%s", _(hintmsg)) : 0));
ereport(DEBUG2,
(errmsg_internal("recovery_min_apply_delay = '%s'", item->value)));
+ recovery_min_apply_delay_reconnect = recovery_min_apply_delay;
+ }
+ else if (strcmp(item->name, "recovery_min_apply_delay_reconnect") == 0)
+ {
+ if (!parse_int(item->value, &recovery_min_apply_delay_reconnect, GUC_UNIT_MS,
+ &hintmsg))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("parameter \"%s\" requires a temporal value",
+ "recovery_min_apply_delay_reconnect"),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0));
+ if (recovery_min_apply_delay_reconnect > recovery_min_apply_delay)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" must be <= \"%s\"",
+ "recovery_min_apply_delay_reconnect",
+ "recovery_min_apply_delay")));
+ ereport(DEBUG2,
+ (errmsg_internal("recovery_min_apply_delay_reconnect = '%s'", item->value)));
}
else
ereport(FATAL,
@@ -6080,20 +6099,25 @@ recoveryApplyDelay(XLogReaderState *record)
if (!getRecordTimestamp(record, &xtime))
return false;
- recoveryDelayUntilTime =
- TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
- /*
- * Exit without arming the latch if it's already past time to apply this
- * record
- */
- TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime,
- &secs, µsecs);
- if (secs <= 0 && microsecs <= 0)
- return false;
while (true)
{
+ if (WalRcvStreaming())
+ recoveryDelayUntilTime =
+ TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+ else
+ recoveryDelayUntilTime =
+ TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay_reconnect);
+
+ TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime,
+ &secs, µsecs);
+ /*
+ * Exit without arming the latch if it's already past time to apply this
+ * record
+ */
+ if (secs <= 0 && microsecs <= 0)
+ return false;
+
ResetLatch(&XLogCtl->recoveryWakeupLatch);
/* might change the trigger file's location */
@@ -6116,6 +6140,15 @@ recoveryApplyDelay(XLogReaderState *record)
elog(DEBUG2, "recovery apply delay %ld seconds, %d milliseconds",
secs, microsecs / 1000);
+ /*
+ * Loop every 10 seconds so that an alternate delay can be calculated if
+ * the WallReceiver is shut down
+ */
+ if (secs > 10) {
+ secs = 10;
+ microsecs = 0;
+ }
+
WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
secs * 1000L + microsecs / 1000,
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 8909c4548b..36b94817f0 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -20,13 +20,17 @@ my $backup_name = 'my_backup';
$node_master->backup($backup_name);
# Create streaming standby from backup
-my $node_standby = get_new_node('standby');
-my $delay = 3;
+# Set recovery_min_apply_delay_reconnect to verify that in normal conditions it
+# does not interfere with recovery_min_apply_delay
+my $node_standby = get_new_node('standby');
+my $delay = 3;
+my $delay_reconnect = 1;
$node_standby->init_from_backup($node_master, $backup_name,
has_streaming => 1);
$node_standby->append_conf(
'recovery.conf', qq(
recovery_min_apply_delay = '${delay}s'
+recovery_min_apply_delay_reconnect = '${delay_reconnect}s'
));
$node_standby->start;
On Tue, Oct 17, 2017 at 12:51 AM, Eric Radman <ericshane@eradman.com> wrote:
This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.
Taking a step back here... recoveryApplyDelay() uses
XLogCtl->recoveryWakeupLatch which gets set if the WAL receiver has
received new WAL, or if the WAL receiver shuts down properly. So if
the WAL receiver gets down for whatever reason during the loop of
recoveryApplyDelay(), the startup process waits for a record to be
applied maybe for a long time, and as there is no WAL receiver we
actually don't receive any new WAL records. New WAL records would be
received only once WaitForWALToBecomeAvailable() is called, which
happens once the apply delay is done for. If the postmaster dies, then
HandleStartupProcInterrupts() would take care of taking down
immediately the startup process, which is cool.
I see what you are trying to achieve and that seems worth it. It is
indeed a waste to not have a WAL receiver online while waiting for a
delay to be applied. If there is a flacky network between the primary
and a standby, you may end up with a standby way behind its primary,
and that could penalize a primary clean shutdown as the primary waits
for the shutdown checkpoint record to be flushed on the standby.
I think that your way to deal with the problem is messy though. If you
think about it, no parameters are actually needed. What you should try
to achieve is to make recoveryApplyDelay() smarter, by making the wait
to forcibly stop if you detect a failure by getting out of the redo
routine, and then force again the record to be read again. This way,
the startup process would try to start again a new WAL receiver if it
thinks that the source it should read WAL from is a stream. That may
turn to be a patch more complicated than you think though.
Your patch also breaks actually the use case of standbys doing
recovery using only archives and no streaming. In this case
WalRcvStreaming returns false, and recovery_min_apply_delay_reconnect
would be used unconditionally, so you would break a lot of
applications silently.
Is it possible to verify the archive on disk independently of
application? Adding a second delay parameter provides a workaround for
some use cases without complecting xlog.c.
That's possible, not with core though. The archives are in a location
not controlled by the backend, but by archive_command, which may not
be local to the instance where Postgres is running. You could always
hack your own functions to do this work, here is an example of
something I came up with:
https://github.com/michaelpq/pg_plugins/tree/master/wal_utils
This prototype (use and hack at your own risk), is able to look at the
local contents of an archive. This can be used easily with a
client-side tool to copy a series of segments., or just perform sanity
checks on them.
For those reasons, -1 for the patch as proposed.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 17, 2017 at 12:34:17PM +0900, Michael Paquier wrote:
On Tue, Oct 17, 2017 at 12:51 AM, Eric Radman <ericshane@eradman.com> wrote:
This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.Taking a step back here... recoveryApplyDelay() uses
XLogCtl->recoveryWakeupLatch which gets set if the WAL receiver has
received new WAL, or if the WAL receiver shuts down properly.
I thought I had observed cases where the WalReceiver was shut down
without causing XLogCtl->recoveryWakeupLatch to return. If I'm wrong
about this then there's no reason to spin every n seconds.
the WAL receiver gets down for whatever reason during the loop of
recoveryApplyDelay(), the startup process waits for a record to be
applied maybe for a long time, and as there is no WAL receiver we
actually don't receive any new WAL records.
...
indeed a waste to not have a WAL receiver online while waiting for a
delay to be applied.
Exactly!
If there is a flacky network between the primary and a standby, you
may end up with a standby way behind its primary, and that could
penalize a primary clean shutdown as the primary waits for the
shutdown checkpoint record to be flushed on the standby.
This is another artifact that the database administrator would not
anticipate.
I think that your way to deal with the problem is messy though. If you
think about it, no parameters are actually needed. What you should try
to achieve is to make recoveryApplyDelay() smarter, by making the wait
to forcibly stop if you detect a failure by getting out of the redo
routine, and then force again the record to be read again. This way,
the startup process would try to start again a new WAL receiver if it
thinks that the source it should read WAL from is a stream. That may
turn to be a patch more complicated than you think though.
One of my earlier attempts was to break from the redo loop and try
reading the next record. This was too simple because it only starts the
WAL receiver if there is nothing more to be read from the archive.
Which record are you suggesting should be forcibly "read again"? The
record identified by XLogCtl->replayEndRecPtr or
XLogCtl->lastReplayedEndRecPtr? I'll look more carefully at such an
approach.
Your patch also breaks actually the use case of standbys doing
recovery using only archives and no streaming. In this case
WalRcvStreaming returns false, and recovery_min_apply_delay_reconnect
would be used unconditionally, so you would break a lot of
applications silently.
Excellent point--I had not thought of how this would interact with a
standby that used only archives.
All useful feedback, thank you for the thorough review!
--
Eric Radman | http://eradman.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 17, 2017 at 10:40 PM, Eric Radman <ericshane@eradman.com> wrote:
On Tue, Oct 17, 2017 at 12:34:17PM +0900, Michael Paquier wrote:
I thought I had observed cases where the WalReceiver was shut down
without causing XLogCtl->recoveryWakeupLatch to return. If I'm wrong
about this then there's no reason to spin every n seconds.
I would expect a patch to not move the timeout calculation within the
loop in recoveryApplyDelay() as you did.
Which record are you suggesting should be forcibly "read again"? The
record identified by XLogCtl->replayEndRecPtr or
XLogCtl->lastReplayedEndRecPtr? I'll look more carefully at such an
approach.
I have not looked at how to do that in details, but as the delay is
applied for commit WAL records, you would need to make the redo loop
look again at this same record once you have switched back to a
streaming state. Something to be careful about is that you should not
apply the same delay multiple times for the same record.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 17, 2017 at 12:34:17PM +0900, Michael Paquier wrote:
On Tue, Oct 17, 2017 at 12:51 AM, Eric Radman <ericshane@eradman.com> wrote:
This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.I see what you are trying to achieve and that seems worth it. It is
indeed a waste to not have a WAL receiver online while waiting for a
delay to be applied.
...
If you think about it, no parameters are actually needed. What you
should try to achieve is to make recoveryApplyDelay() smarter,
This would be even better. Attached is the 2nd version of this patch
that I'm using until an alternate solution is developed.
Your patch also breaks actually the use case of standbys doing
recovery using only archives and no streaming
This version disarms recovery_min_apply_delay_reconnect if a primary is
not defined. Also rely on XLogCtl->recoveryWakeupLatch to return if the
WalReciver is shut down--this does work reliably.
--
Eric Radman | http://eradman.com
Attachments:
delay-reconnect-param-v2.patchtext/plain; charset=us-asciiDownload
commit 36b5a022241c1ade9dcf5ffc46f926e46f4ee696
Author: Eric Radman <ericshane@eradman.com>
Date: Tue Oct 17 19:10:22 2017 -0400
Add recovery_min_apply_delay_reconnect recovery option
'recovery_min_apply_delay_reconnect' allows an administrator to specify
how a standby using 'recovery_min_apply_delay' responds when streaming
replication is interrupted.
Combining these two parameters provides a fixed delay under normal
operation while maintaining some assurance that the standby contains an
up-to-date copy of the WAL.
This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.
It would be better if a second option was not added, but second delay
parameter provides a workaround for some use cases without complecting
xlog.c.
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 4e1aa74c1f..8e395edae0 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -502,6 +502,30 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</listitem>
</varlistentry>
+ <varlistentry id="recovery-min-apply-delay-reconnect" xreflabel="recovery_min_apply_delay_reconnect">
+ <term><varname>recovery_min_apply_delay_reconnect</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>recovery_min_apply_delay_reconnect</> recovery parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ If the streaming replication is inturruped while
+ <varname>recovery_min_apply_delay</varname> is set, WAL records will be
+ replayed from the archive. After all records have been processed from
+ local disk, <productname>PostgreSQL</> will attempt to resume streaming
+ and connect to the master.
+ </para>
+ <para>
+ This parameter is used to compromise the fixed apply delay in order to
+ restablish streaming. In this way a standby server can be run in fair
+ conditions with a long delay (hours or days) without while specifying
+ the maximum delay that can be expected before the WAL archive is brought
+ back up to date with the master after a network failure.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a12a4..6f4c7bf3e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -267,6 +267,7 @@ static TimestampTz recoveryTargetTime;
static char *recoveryTargetName;
static XLogRecPtr recoveryTargetLSN;
static int recovery_min_apply_delay = 0;
+static int recovery_min_apply_delay_reconnect = 0;
static TimestampTz recoveryDelayUntilTime;
/* options taken from recovery.conf for XLOG streaming */
@@ -5227,6 +5228,7 @@ readRecoveryCommandFile(void)
*head = NULL,
*tail = NULL;
bool recoveryTargetActionSet = false;
+ const char *hintmsg;
fd = AllocateFile(RECOVERY_COMMAND_FILE, "r");
@@ -5452,8 +5454,6 @@ readRecoveryCommandFile(void)
}
else if (strcmp(item->name, "recovery_min_apply_delay") == 0)
{
- const char *hintmsg;
-
if (!parse_int(item->value, &recovery_min_apply_delay, GUC_UNIT_MS,
&hintmsg))
ereport(ERROR,
@@ -5463,6 +5463,25 @@ readRecoveryCommandFile(void)
hintmsg ? errhint("%s", _(hintmsg)) : 0));
ereport(DEBUG2,
(errmsg_internal("recovery_min_apply_delay = '%s'", item->value)));
+ recovery_min_apply_delay_reconnect = recovery_min_apply_delay;
+ }
+ else if (strcmp(item->name, "recovery_min_apply_delay_reconnect") == 0)
+ {
+ if (!parse_int(item->value, &recovery_min_apply_delay_reconnect, GUC_UNIT_MS,
+ &hintmsg))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("parameter \"%s\" requires a temporal value",
+ "recovery_min_apply_delay_reconnect"),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0));
+ if (recovery_min_apply_delay_reconnect > recovery_min_apply_delay)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" must be <= \"%s\"",
+ "recovery_min_apply_delay_reconnect",
+ "recovery_min_apply_delay")));
+ ereport(DEBUG2,
+ (errmsg_internal("recovery_min_apply_delay_reconnect = '%s'", item->value)));
}
else
ereport(FATAL,
@@ -6080,20 +6099,25 @@ recoveryApplyDelay(XLogReaderState *record)
if (!getRecordTimestamp(record, &xtime))
return false;
- recoveryDelayUntilTime =
- TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
- /*
- * Exit without arming the latch if it's already past time to apply this
- * record
- */
- TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime,
- &secs, µsecs);
- if (secs <= 0 && microsecs <= 0)
- return false;
while (true)
{
+ if (PrimaryConnInfo != NULL && !WalRcvStreaming())
+ recoveryDelayUntilTime =
+ TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay_reconnect);
+ else
+ recoveryDelayUntilTime =
+ TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+ TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime,
+ &secs, µsecs);
+ /*
+ * Exit without arming the latch if it's already past time to apply this
+ * record
+ */
+ if (secs <= 0 && microsecs <= 0)
+ return false;
+
ResetLatch(&XLogCtl->recoveryWakeupLatch);
/* might change the trigger file's location */
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 8909c4548b..36b94817f0 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -20,13 +20,17 @@ my $backup_name = 'my_backup';
$node_master->backup($backup_name);
# Create streaming standby from backup
-my $node_standby = get_new_node('standby');
-my $delay = 3;
+# Set recovery_min_apply_delay_reconnect to verify that in normal conditions it
+# does not interfere with recovery_min_apply_delay
+my $node_standby = get_new_node('standby');
+my $delay = 3;
+my $delay_reconnect = 1;
$node_standby->init_from_backup($node_master, $backup_name,
has_streaming => 1);
$node_standby->append_conf(
'recovery.conf', qq(
recovery_min_apply_delay = '${delay}s'
+recovery_min_apply_delay_reconnect = '${delay_reconnect}s'
));
$node_standby->start;
On Fri, Oct 20, 2017 at 3:46 AM, Eric Radman <ericshane@eradman.com> wrote:
On Tue, Oct 17, 2017 at 12:34:17PM +0900, Michael Paquier wrote:
On Tue, Oct 17, 2017 at 12:51 AM, Eric Radman <ericshane@eradman.com> wrote:
This administrative compromise is necessary because the WalReceiver is
not resumed after a network interruption until all records are read,
verified, and applied from the archive on disk.I see what you are trying to achieve and that seems worth it. It is
indeed a waste to not have a WAL receiver online while waiting for a
delay to be applied....
If you think about it, no parameters are actually needed. What you
should try to achieve is to make recoveryApplyDelay() smarter,This would be even better. Attached is the 2nd version of this patch
that I'm using until an alternate solution is developed.
I definitely agree that a better handling of WAL receiver restart
would be done, however this needs and a better-thought refactoring
which is not this patch provides, so I am marking it as returned with
feedback. People looking for a solution, and not using archiving
(because your patch breaks it), could always apply what you have as a
workaround.
--
Michael