Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Hello everyone.
* Project name: Add recovery_timeout option to control timeout of restore_command nonzero status code
* Uniquely identifiable file name, so we can tell difference between your v1 and v24: 0001-add-recovery_timeout-to-controll-timeout-between-res.patch
* What the patch does in a short paragraph: This patch should add option recovery_timeout, which help to control timeout of restore_command nonzero status code. Right now default value is 5 seconds. This is useful, if I using for restore of wal logs some external storage (like AWS S3) and no matter what the slave database will lag behind the master. The problem, what for each request to AWS S3 need to pay, what is why for N nodes, which try to get next wal log each 5 seconds will be bigger price, than for example each 30 seconds. Before I do this in this way: " if ! (/usr/local/bin/envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-fetch "%f" "%p"); then sleep 60; fi ". But in this case restart/stop database slower.
* Whether the patch is for discussion or for application: No such thing.
* Which branch the patch is against: master branch
* Whether it compiles and tests successfully, so we know nothing obvious is broken: compiled and pass tests on local mashine.
* Whether it contains any platform-specific items and if so, has it been tested on other platforms: hope, no.
* Confirm that the patch includes regression tests to check the new feature actually works as described: No it doesn't have test. I don't found ho to testing new config variables.
* Include documentation: added.
* Describe the effect your patch has on performance, if any: shouldn't effect on database performance.
This is my first patch. I am not sure about name of option. Maybe it should called "recovery_nonzero_timeout".
--
Alexey Vasiliev
Attachments:
0001-add-recovery_timeout-to-controll-timeout-between-res.patchapplication/x-patch; name="=?UTF-8?B?MDAwMS1hZGQtcmVjb3ZlcnlfdGltZW91dC10by1jb250cm9sbC10aW1lb3V0?= =?UTF-8?B?LWJldHdlZW4tcmVzLnBhdGNo?="Download+37-2
On 11/3/14 6:04 AM, Alexey Vasiliev wrote:
3. What the patch does in a short paragraph: This patch should add
option recovery_timeout, which help to control timeout of
restore_command nonzero status code. Right now default value is 5
seconds. This is useful, if I using for restore of wal logs some
external storage (like AWS S3) and no matter what the slave database
will lag behind the master. The problem, what for each request to
AWS S3 need to pay, what is why for N nodes, which try to get next
wal log each 5 seconds will be bigger price, than for example each
30 seconds.
That seems useful. I would include something about this use case in the
documentation.
This is my first patch. I am not sure about name of option. Maybe it
should called "recovery_nonzero_timeout".
The option name had me confused. At first I though this is the time
after which a running restore_command invocation gets killed. I think a
more precise description might be restore_command_retry_interval.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Mon, 03 Nov 2014 14:36:33 -0500 от Peter Eisentraut <peter_e@gmx.net>:
On 11/3/14 6:04 AM, Alexey Vasiliev wrote:
3. What the patch does in a short paragraph: This patch should add
option recovery_timeout, which help to control timeout of
restore_command nonzero status code. Right now default value is 5
seconds. This is useful, if I using for restore of wal logs some
external storage (like AWS S3) and no matter what the slave database
will lag behind the master. The problem, what for each request to
AWS S3 need to pay, what is why for N nodes, which try to get next
wal log each 5 seconds will be bigger price, than for example each
30 seconds.That seems useful. I would include something about this use case in the
documentation.
Ok, I will add this in patch.
This is my first patch. I am not sure about name of option. Maybe it
should called "recovery_nonzero_timeout".The option name had me confused. At first I though this is the time
after which a running restore_command invocation gets killed. I think a
more precise description might be restore_command_retry_interval.
"restore_command_retry_interval" - I like this name!
Should I change my patch and send it by email? And also as I understand I should change message ID for https://commitfest.postgresql.org/action/patch_view?id=1636, isn't it?
Thanks
--
Alexey Vasiliev
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Should I change my patch and send it by email? And also as I understand I
should change message ID for
https://commitfest.postgresql.org/action/patch_view?id=1636, isn't it?
You should just send another version of your patch by email and add a new
"comment" to commit-fest using the "Patch" comment type.
Regards,
--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
Show quoted text
Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello
Mon, 3 Nov 2014 19:18:51 -0200 от Fabrízio de Royes Mello <fabriziomello@gmail.com>:
Should I change my patch and send it by email? And also as I understand I should change message ID for https://commitfest.postgresql.org/action/patch_view?id=1636 , isn't it?
You should just send another version of your patch by email and add a new "comment" to commit-fest using the "Patch" comment type.
Added new patch.
--
Alexey Vasiliev
Attachments:
0001-add-recovery_timeout-to-controll-timeout-between-res.patchapplication/x-patch; name="=?UTF-8?B?MDAwMS1hZGQtcmVjb3ZlcnlfdGltZW91dC10by1jb250cm9sbC10aW1lb3V0?= =?UTF-8?B?LWJldHdlZW4tcmVzLnBhdGNo?="Download+47-3
On Mon, Nov 3, 2014 at 7:25 PM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
Mon, 3 Nov 2014 19:18:51 -0200 от Fabrízio de Royes Mello <
fabriziomello@gmail.com>:
Should I change my patch and send it by email? And also as I
understand I should change message ID for
https://commitfest.postgresql.org/action/patch_view?id=1636 , isn't it?
You should just send another version of your patch by email and add a
new "comment" to commit-fest using the "Patch" comment type.
Added new patch.
And you should add the patch to an open commit-fest not to an in-progress.
The next opened is 2014-12 [1]https://commitfest.postgresql.org/action/commitfest_view?id=25.
Regards,
[1]: https://commitfest.postgresql.org/action/commitfest_view?id=25
--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
Show quoted text
Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello
Mon, 3 Nov 2014 22:55:02 -0200 от Fabrízio de Royes Mello <fabriziomello@gmail.com>:
On Mon, Nov 3, 2014 at 7:25 PM, Alexey Vasiliev < leopard_ne@inbox.ru > wrote:
Mon, 3 Nov 2014 19:18:51 -0200 от Fabrízio de Royes Mello < fabriziomello@gmail.com >:
Should I change my patch and send it by email? And also as I understand I should change message ID for https://commitfest.postgresql.org/action/patch_view?id=1636 , isn't it?
You should just send another version of your patch by email and add a new "comment" to commit-fest using the "Patch" comment type.
Added new patch.
And you should add the patch to an open commit-fest not to an in-progress. The next opened is 2014-12 [1].
Moved. Thanks.
--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQLTimbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello
--
Alexey Vasiliev
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2014-11-03 14:04:00 +0300, Alexey Vasiliev wrote:
* What the patch does in a short paragraph:�This patch should add option�recovery_timeout, which help to control timeout of restore_command�nonzero status code. Right now default value is 5 seconds. This is useful, if I using for restore of wal logs some external storage (like AWS S3) and�no matter what the slave database will�lag behind the master. The problem, what for each request to AWS S3 need to pay, what is why for N nodes, which try to get next wal log each 5 seconds will be bigger price, than for example each 30 seconds. Before I do this in this way: " if ! (/usr/local/bin/envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-fetch "%f" "%p"); then sleep 60; fi ". But in this case restart/stop database slower.
Without saying that the feature is unneccessary, wouldn't this better be
solved by using streaming rep most of the time?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tue, 4 Nov 2014 14:41:56 +0100 от Andres Freund <andres@2ndquadrant.com>:
Hi,
On 2014-11-03 14:04:00 +0300, Alexey Vasiliev wrote:
* What the patch does in a short paragraph: This patch should add option recovery_timeout, which help to control timeout of restore_command nonzero status code. Right now default value is 5 seconds. This is useful, if I using for restore of wal logs some external storage (like AWS S3) and no matter what the slave database will lag behind the master. The problem, what for each request to AWS S3 need to pay, what is why for N nodes, which try to get next wal log each 5 seconds will be bigger price, than for example each 30 seconds. Before I do this in this way: " if ! (/usr/local/bin/envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-fetch "%f" "%p"); then sleep 60; fi ". But in this case restart/stop database slower.
Without saying that the feature is unneccessary, wouldn't this better be
solved by using streaming rep most of the time?
But we don't need streaming rep. Master database no need to know about slaves (and no need to add this little overhead to master). Slaves read wal logs from S3 and the same S3 wal logs used as backups.
--
Alexey Vasiliev
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 4, 2014 at 6:25 AM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
Added new patch.
Seems useful to me to be able to tune this interval of time.
I would simply reword the documentation as follows:
If <varname>restore_command</> returns nonzero exit status code, retry
command after the interval of time specified by this parameter.
Default value is <literal>5s</>.
Also, I think that it would be a good idea to error out if this
parameter has a value of let's say, less than 1s instead of doing a
check for a positive value in the waiting latch. On top of that, the
default value of restore_command_retry_interval should be 5000 and not
0 to simplify the code.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for suggestions.
Patch updated.
Mon, 22 Dec 2014 12:07:06 +0900 от Michael Paquier <michael.paquier@gmail.com>:
On Tue, Nov 4, 2014 at 6:25 AM, Alexey Vasiliev < leopard_ne@inbox.ru > wrote:
Added new patch.
Seems useful to me to be able to tune this interval of time.
I would simply reword the documentation as follows:
If <varname>restore_command</> returns nonzero exit status code, retry
command after the interval of time specified by this parameter.
Default value is <literal>5s</>.Also, I think that it would be a good idea to error out if this
parameter has a value of let's say, less than 1s instead of doing a
check for a positive value in the waiting latch. On top of that, the
default value of restore_command_retry_interval should be 5000 and not
0 to simplify the code.
--
Michael--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Alexey Vasiliev
Attachments:
0001-add-recovery_timeout-to-controll-timeout-between-res.patchapplication/x-patch; name="=?UTF-8?B?MDAwMS1hZGQtcmVjb3ZlcnlfdGltZW91dC10by1jb250cm9sbC10aW1lb3V0?= =?UTF-8?B?LWJldHdlZW4tcmVzLnBhdGNo?="Download+54-2
On Sat, Dec 27, 2014 at 3:42 AM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
Thanks for suggestions.
Patch updated.
Cool, thanks. I just had an extra look at it.
+ This is useful, if I using for restore of wal logs some
+ external storage (like AWS S3) and no matter what the slave database
+ will lag behind the master. The problem, what for each request to
+ AWS S3 need to pay, what is why for N nodes, which try to get next
+ wal log each 5 seconds will be bigger price, than for example each
+ 30 seconds.
I reworked this portion of the docs, it is rather incorrect as the
documentation should not use first-person subjects, and I don't
believe that referencing any commercial products is a good thing in
this context.
+# specifies an optional timeout after nonzero code of restore_command.
+# This can be useful to increase/decrease number of a restore_command calls.
This is still referring to a timeout. That's not good. And the name of
the parameter at the top of this comment block is missing.
+static int restore_command_retry_interval = 5000L;
I think that it would be more adapted to set that to 5000, and
multiply by 1L. I am also wondering about having a better lower bound,
like 100ms to avoid some abuse with this feature in the retries?
+ ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" must
be bigger zero",
+ "restore_command_retry_interval")));
I'd rather rewrite that to "must have a strictly positive value".
- * Wait for more WAL to
arrive. Time out after 5 seconds,
+ * Wait for more WAL to
arrive. Time out after
+ *
restore_command_retry_interval (5 seconds by default),
* like when polling the
archive, to react to a trigger
* file promptly.
*/
WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET
| WL_TIMEOUT,
- 5000L);
+
restore_command_retry_interval);
I should have noticed earlier, but in its current state your patch
actually does not work. What you are doing here is tuning the time
process waits for WAL from stream. In your case what you want to
control is the retry time for a restore_command in archive recovery,
no?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello.
Thanks, I understand, what look in another part of code. Hope right now I did right changes.
To not modify current pg_usleep calculation, I changed restore_command_retry_interval value to seconds (not milliseconds). In this case, min value - 1 second.
Mon, 29 Dec 2014 00:15:03 +0900 от Michael Paquier <michael.paquier@gmail.com>:
On Sat, Dec 27, 2014 at 3:42 AM, Alexey Vasiliev < leopard_ne@inbox.ru > wrote:
Thanks for suggestions.
Patch updated.
Cool, thanks. I just had an extra look at it.
+ This is useful, if I using for restore of wal logs some + external storage (like AWS S3) and no matter what the slave database + will lag behind the master. The problem, what for each request to + AWS S3 need to pay, what is why for N nodes, which try to get next + wal log each 5 seconds will be bigger price, than for example each + 30 seconds. I reworked this portion of the docs, it is rather incorrect as the documentation should not use first-person subjects, and I don't believe that referencing any commercial products is a good thing in this context.+# specifies an optional timeout after nonzero code of restore_command. +# This can be useful to increase/decrease number of a restore_command calls. This is still referring to a timeout. That's not good. And the name of the parameter at the top of this comment block is missing.+static int restore_command_retry_interval = 5000L;
I think that it would be more adapted to set that to 5000, and
multiply by 1L. I am also wondering about having a better lower bound,
like 100ms to avoid some abuse with this feature in the retries?+ ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("\"%s\" must be bigger zero", + "restore_command_retry_interval"))); I'd rather rewrite that to "must have a strictly positive value".- * Wait for more WAL to arrive. Time out after 5 seconds, + * Wait for more WAL to arrive. Time out after + * restore_command_retry_interval (5 seconds by default), * like when polling the archive, to react to a trigger * file promptly. */ WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + restore_command_retry_interval); I should have noticed earlier, but in its current state your patch actually does not work. What you are doing here is tuning the time process waits for WAL from stream. In your case what you want to control is the retry time for a restore_command in archive recovery, no? -- Michael--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Alexey Vasiliev
Attachments:
0001-add-recovery_timeout-to-controll-timeout-between-res.patchapplication/x-patch; name="=?UTF-8?B?MDAwMS1hZGQtcmVjb3ZlcnlfdGltZW91dC10by1jb250cm9sbC10aW1lb3V0?= =?UTF-8?B?LWJldHdlZW4tcmVzLnBhdGNo?="Download+52-4
On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
To not modify current pg_usleep calculation, I changed
restore_command_retry_interval value to seconds (not milliseconds). In this
case, min value - 1 second.
Er, what the problem with not changing 1000000L to 1000L? The unit of
your parameter is ms AFAIK.
Also, I am not really getting the meaning of this paragraph:
+ <para>
+ This is useful, if I using for restore of wal logs some
+ external storage and no matter what the slave database
+ will lag behind the master.
+ </para>
Could you be more explicit here? What do you want to mean here?
(btw, please let's keep the thread readable and not reply at the top
of each post).
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Dec 30, 2014 at 9:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
To not modify current pg_usleep calculation, I changed
restore_command_retry_interval value to seconds (not milliseconds). In this
case, min value - 1 second.Er, what the problem with not changing 1000000L to 1000L? The unit of
your parameter is ms AFAIK.
Of course I meant in the previous version of the patch not the current
one. Wouldn't it be useful to use it with for example retry intervals
of the order of 100ms~300ms for some cases?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tue, 30 Dec 2014 21:39:33 +0900 от Michael Paquier <michael.paquier@gmail.com>:
On Tue, Dec 30, 2014 at 9:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
To not modify current pg_usleep calculation, I changed
restore_command_retry_interval value to seconds (not milliseconds). In this
case, min value - 1 second.Er, what the problem with not changing 1000000L to 1000L? The unit of
your parameter is ms AFAIK.Of course I meant in the previous version of the patch not the current
one. Wouldn't it be useful to use it with for example retry intervals
of the order of 100ms~300ms for some cases?
--
Michael--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks, patch changed.
As I understand now = (pg_time_t) time(NULL); return time in seconds, what is why I multiply value to 1000 to compare with restore_command_retry_interval in milliseconds.
I am not sure about small retry interval of time, in my cases I need interval bigger 5 seconds (20-40 seconds). Right now I limiting this value be bigger 100 milliseconds.
--
Alexey Vasiliev
Attachments:
0001-add-recovery_timeout-to-controll-timeout-between-res.patchapplication/x-patch; name="=?UTF-8?B?MDAwMS1hZGQtcmVjb3ZlcnlfdGltZW91dC10by1jb250cm9sbC10aW1lb3V0?= =?UTF-8?B?LWJldHdlZW4tcmVzLnBhdGNo?="Download+55-4
On Wed, Dec 31, 2014 at 12:22 AM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
Tue, 30 Dec 2014 21:39:33 +0900 от Michael Paquier <michael.paquier@gmail.com>:
As I understand now = (pg_time_t) time(NULL); return time in seconds, what is why I multiply value to 1000 to compare with restore_command_retry_interval in milliseconds.
This way of doing is incorrect, you would get a wait time of 1s even
for retries lower than 1s. In order to get the feature working
correctly and to get a comparison granularity sufficient, you need to
use TimetampTz for the tracking of the current and last failure time
and to use the APIs from TimestampTz for difference calculations.
I am not sure about small retry interval of time, in my cases I need interval bigger 5 seconds (20-40 seconds). Right now I limiting this value be bigger 100 milliseconds.
Other people may disagree here, but I don't see any reason to put
restrictions to this interval of time.
Attached is a patch fixing the feature to use TimestampTz, updating as
well the docs and recovery.conf.sample which was incorrect, on top of
other things I noticed here and there.
Alexey, does this new patch look fine for you?
--
Michael
Attachments:
20150105_restore_retry_interval.patchtext/x-diff; charset=US-ASCII; name=20150105_restore_retry_interval.patchDownload+64-9
Mon, 5 Jan 2015 17:16:43 +0900 от Michael Paquier <michael.paquier@gmail.com>:
On Wed, Dec 31, 2014 at 12:22 AM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
Tue, 30 Dec 2014 21:39:33 +0900 от Michael Paquier <michael.paquier@gmail.com>:
As I understand now = (pg_time_t) time(NULL); return time in seconds, what is why I multiply value to 1000 to compare with restore_command_retry_interval in milliseconds.This way of doing is incorrect, you would get a wait time of 1s even
for retries lower than 1s. In order to get the feature working
correctly and to get a comparison granularity sufficient, you need to
use TimetampTz for the tracking of the current and last failure time
and to use the APIs from TimestampTz for difference calculations.I am not sure about small retry interval of time, in my cases I need interval bigger 5 seconds (20-40 seconds). Right now I limiting this value be bigger 100 milliseconds.
Other people may disagree here, but I don't see any reason to put
restrictions to this interval of time.Attached is a patch fixing the feature to use TimestampTz, updating as
well the docs and recovery.conf.sample which was incorrect, on top of
other things I noticed here and there.Alexey, does this new patch look fine for you?
--
Michael
Hello. Thanks for help. Yes, new patch look fine!
--
Alexey Vasiliev
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 5, 2015 at 10:39 PM, Alexey Vasiliev <leopard_ne@inbox.ru> wrote:
Hello. Thanks for help. Yes, new patch look fine!
OK cool. Let's get an opinion from a committer then.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2015-01-05 17:16:43 +0900, Michael Paquier wrote:
+ <varlistentry id="restore-command-retry-interval" xreflabel="restore_command_retry_interval"> + <term><varname>restore_command_retry_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>restore_command_retry_interval</> recovery parameter</primary> + </indexterm> + </term> + <listitem> + <para> + If <varname>restore_command</> returns nonzero exit status code, retry + command after the interval of time specified by this parameter, + measured in milliseconds if no unit is specified. Default value is + <literal>5s</>. + </para> + <para> + This command is particularly useful for warm standby configurations + to leverage the amount of retries done to restore a WAL segment file + depending on the activity of a master node. + </para>
Don't really like this paragraph. What's "leveraging the amount of
retries done" supposed to mean?
+# restore_command_retry_interval +# +# specifies an optional retry interval for restore_command command, if +# previous command returned nonzero exit status code. This can be useful +# to increase or decrease the number of calls of restore_command.
It's not really the number but frequency, right?
+ else if (strcmp(item->name, "restore_command_retry_interval") == 0) + { + const char *hintmsg; + + if (!parse_int(item->value, &restore_command_retry_interval, GUC_UNIT_MS, + &hintmsg)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("parameter \"%s\" requires a temporal value", + "restore_command_retry_interval"), + hintmsg ? errhint("%s", _(hintmsg)) : 0));
"temporal value" sounds odd to my ears. I'd rather keep in line with
what guc.c outputs in such a case.
+ now = GetCurrentTimestamp(); + if (!TimestampDifferenceExceeds(last_fail_time, now, + restore_command_retry_interval)) { - pg_usleep(1000000L * (5 - (now - last_fail_time))); - now = (pg_time_t) time(NULL); + long secs; + int microsecs; + + TimestampDifference(last_fail_time, now, &secs, µsecs); + pg_usleep(restore_command_retry_interval * 1000L - (1000000L * secs + 1L * microsecs)); + now = GetCurrentTimestamp(); } last_fail_time = now; currentSource = XLOG_FROM_ARCHIVE;
Am I missing something or does this handle/affect streaming failures
just the same? That might not be a bad idea, because the current
interval is far too long for e.g. a sync replication setup. But it
certainly makes the name a complete misnomer.
Not this patch's fault, but I'm getting a bit tired seing the above open
coded. How about adding a function that does the sleeping based on a
timestamptz and a ms interval?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers