Replication & recovery_min_apply_delay
Hi hackers,
One of our customers was faced with the following problem:
he has setup physical primary-slave replication but for some reasons
specified very large (~12 hours)
recovery_min_apply_delay. I do not know precise reasons for such large
gap between master and replica.
But everything works normally until replica is restarted. Then it starts
to apply WAL, comes to the point where record timestamp is less then 12
hours older
and ... suspends recovery. No WAL receiver is launched and so nobody is
fetching changes from master.
It may cause master's WAL space overflow (if there is replication slot)
and loose of data in case of master crash.
Looks like the right behavior is to be able launch WAL receiver before
replica reaches end of WAL.
For example, we can launch it before going to sleep in recoveryApplyDelay.
We need to specify start LSN for WAL sender. I didn't find better
solution except iterating WAL until I reach the last valid record.
I attach small patch which implements this approach.
I wonder if it can be considered as acceptable solution of the problem
or there can be some better approach?
--
Konstantin Knizhnik
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
Attachments:
wal_apply_delay.patchtext/x-patch; name=wal_apply_delay.patchDownload+60-0
Hi
On 2019-Jan-30, Konstantin Knizhnik wrote:
One of our customers was faced with the following problem: he has
setup� physical primary-slave replication but for some reasons
specified very large (~12 hours) recovery_min_apply_delay.
We also came across this exact same problem some time ago. It's pretty
nasty. I wrote a quick TAP reproducer, attached (needed a quick patch
for PostgresNode itself too.)
I tried several failed strategies:
1. setting lastSourceFailed just before sleeping for apply delay, with
the idea that for the next fetch we would try stream. But this
doesn't work because WaitForWalToBecomeAvailable is not executed.
2. split WaitForWalToBecomeAvailable in two pieces, so that we can call
the first half in the restore loop. But this causes 1s of wait
between segments (error recovery) and we never actually catch up.
What back then I thought was the *real* solution but I didn't get around
to implementing is the idea you describe to start a walreceiver at an
earlier point.
I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?
I didn't find one.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jan-30, Konstantin Knizhnik wrote:
I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?I didn't find one.
It sounds like you are in agreement that there is a problem and this
is the best solution. I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?
--
Thomas Munro
https://enterprisedb.com
On Mon, Jul 08, 2019 at 07:56:25PM +1200, Thomas Munro wrote:
On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jan-30, Konstantin Knizhnik wrote:
I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?I didn't find one.
It sounds like you are in agreement that there is a problem and this
is the best solution. I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?
Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that.. This does not feel right.
--
Michael
On 08.07.2019 11:05, Michael Paquier wrote:
On Mon, Jul 08, 2019 at 07:56:25PM +1200, Thomas Munro wrote:
On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jan-30, Konstantin Knizhnik wrote:
I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?I didn't find one.
It sounds like you are in agreement that there is a problem and this
is the best solution. I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that.. This does not feel right.
--
Michael
I am sorry for delay with answer.
Looks like I have not noticed your reply:(
Can you explain me please why it is not correct to iterate through WAL
using WAL reader to get last LSN?
From my point of view it may be not so efficient way, but it should
return correct value, shouldn't it?
Can you suggest some better way to calculate last LSN?
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 2019-Jul-31, Konstantin Knizhnik wrote:
On 08.07.2019 11:05, Michael Paquier wrote:
Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that.. This does not feel right.I am sorry for delay with answer.
Looks like I have not noticed your reply:(
Can you explain me please why it is not correct to iterate through WAL using
WAL reader to get last LSN?
I agree that it's conceptually ugly, but as I mentioned in my previous
reply, I tried several other strategies before giving up and ended up
concluding that this way was a good way to solve the problem.
I don't endorse the exact patch submitted, though. I think it should
have a lot more commentary on what the code is doing and why.
As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 31, 2019 at 04:43:26PM -0400, Alvaro Herrera wrote:
As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.
Hmmm. Instead of that, wouldn't it be cleaner to maintain in the
context of the startup process a marker similar to receivedUpto for
the last LSN? The issue with this one is that it gets reset easily so
we would lose track of it easily, and we need also to count with the
case where a WAL receiver is not started. So I think that we should
do that as a last replayed or received LSN if a WAL receiver is up and
running, whichever is newer. Splitting the WAL receiver restart logic
into a separate routine is a good idea in itself, the patch attempting
to switch primary_conninfo to be reloadable could make use of that.
--
Michael
On 2019-Aug-02, Michael Paquier wrote:
On Wed, Jul 31, 2019 at 04:43:26PM -0400, Alvaro Herrera wrote:
As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.Hmmm. Instead of that, wouldn't it be cleaner to maintain in the
context of the startup process a marker similar to receivedUpto for
the last LSN? The issue with this one is that it gets reset easily so
we would lose track of it easily, and we need also to count with the
case where a WAL receiver is not started. So I think that we should
do that as a last replayed or received LSN if a WAL receiver is up and
running, whichever is newer. Splitting the WAL receiver restart logic
into a separate routine is a good idea in itself, the patch attempting
to switch primary_conninfo to be reloadable could make use of that.
Konstantin, any interest in trying this?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04.09.2019 1:22, Alvaro Herrera wrote:
On 2019-Aug-02, Michael Paquier wrote:
On Wed, Jul 31, 2019 at 04:43:26PM -0400, Alvaro Herrera wrote:
As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.Hmmm. Instead of that, wouldn't it be cleaner to maintain in the
context of the startup process a marker similar to receivedUpto for
the last LSN? The issue with this one is that it gets reset easily so
we would lose track of it easily, and we need also to count with the
case where a WAL receiver is not started. So I think that we should
do that as a last replayed or received LSN if a WAL receiver is up and
running, whichever is newer. Splitting the WAL receiver restart logic
into a separate routine is a good idea in itself, the patch attempting
to switch primary_conninfo to be reloadable could make use of that.Konstantin, any interest in trying this?
Sorry, I do not understand this proposal.
and we need also to count with the case where a WAL receiver is not
started.
May be i missed something, but what this patch is trying to achieve is
to launch WAL receiver before already received transactions are applied.
So definitely WAL receiver is not started at this moment.
receivedUpto is just static variable in xlog.c, maintained by WAL receiver.
But as I mentioned above, WAL receiver is not started at the moment when
we need to know LSN of last record.
Certainly it should be possible to somehow persist receveidUpto, so we
do not need to scan WAL to determine the last LSN at next start.
By persisting last LSN introduce a lot of questions and problems.
For example when it needs to be flushed for the disk. If it is done
after each received transaction, then it can significantly suffer
performance.
If it is done more or less asynchronously, then there us a risk that we
requested streaming with wrong position.
In any case it will significantly complicate the patch and make it more
sensible for various errors.
I wonder what is wrong with determining LSN of last record by just
scanning WAL?
Certainly it is not the most efficient way. But I do not expect that
somebody will have hundreds or thousands megabytes of WAL.
Michael, do you see some other problems with GetLastLSN() functions
except time of its execution?
IMHO one of the ,ani advantages of this patch is that it is very simple.
We need to scan WAL to locate last LSN only if recovery_min_apply_delay
is set.
So this patch should not affect performance of all other cases.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Sep 4, 2019 at 4:37 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
receivedUpto is just static variable in xlog.c, maintained by WAL receiver.
But as I mentioned above, WAL receiver is not started at the moment when
we need to know LSN of last record.Certainly it should be possible to somehow persist receveidUpto, so we
do not need to scan WAL to determine the last LSN at next start.
By persisting last LSN introduce a lot of questions and problems.
For example when it needs to be flushed for the disk. If it is done
after each received transaction, then it can significantly suffer
performance.
If it is done more or less asynchronously, then there us a risk that we
requested streaming with wrong position.
In any case it will significantly complicate the patch and make it more
sensible for various errors.
I think we don't necessary need exact value of receveidUpto. But it
could be some place to start scanning WAL from. We currently call
UpdateControlFile() in a lot of places. In particular we call it each
checkpoint. If even we would start scanning WAL from one checkpoint
back value of receveidUpto, we could still save a lot of resources.
I wonder what is wrong with determining LSN of last record by just
scanning WAL?
Certainly it is not the most efficient way. But I do not expect that
somebody will have hundreds or thousands megabytes of WAL.
Michael, do you see some other problems with GetLastLSN() functions
except time of its execution?
As I get this patch fixes a problem with very large recovery apply
delay. In this case, amount of accumulated WAL corresponding to that
delay could be also huge. Scanning all this amount of WAL could be
costly. And it's nice to evade.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Sep 10, 2019 at 12:46:49AM +0300, Alexander Korotkov wrote:
On Wed, Sep 4, 2019 at 4:37 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:receivedUpto is just static variable in xlog.c, maintained by WAL receiver.
But as I mentioned above, WAL receiver is not started at the moment when
we need to know LSN of last record.Certainly it should be possible to somehow persist receveidUpto, so we
do not need to scan WAL to determine the last LSN at next start.
By persisting last LSN introduce a lot of questions and problems.
For example when it needs to be flushed for the disk. If it is done
after each received transaction, then it can significantly suffer
performance.
If it is done more or less asynchronously, then there us a risk that we
requested streaming with wrong position.
In any case it will significantly complicate the patch and make it more
sensible for various errors.I think we don't necessary need exact value of receveidUpto. But it
could be some place to start scanning WAL from. We currently call
UpdateControlFile() in a lot of places. In particular we call it each
checkpoint. If even we would start scanning WAL from one checkpoint
back value of receveidUpto, we could still save a lot of resources.
A minimum to set would be the minimum consistency LSN, but there are a
lot of gotchas to take into account when it comes to crash recovery.
As I get this patch fixes a problem with very large recovery apply
delay. In this case, amount of accumulated WAL corresponding to that
delay could be also huge. Scanning all this amount of WAL could be
costly. And it's nice to evade.
Yes, I suspect that it could be very costly in some configurations if
there is a large gap between the last replayed LSN and the last LSN
the WAL receiver has flushed.
There is an extra thing, which has not been mentioned yet on this
thread, that we need to be very careful about:
<para>
When the standby is started and <varname>primary_conninfo</varname> is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and
a corresponding walsender process in the primary.
</para>
This is a long-standing behavior, and based on the first patch
proposed we would start a WAL receiver once consistency has been
reached if there is any delay defined even if restore_command is
enabled. We cannot assume either that everybody will want to start a
WAL receiver in this configuration if there is archiving behind with a
lot of segments which allow for a larger catchup window..
--
Michael
On Tue, Sep 10, 2019 at 03:23:25PM +0900, Michael Paquier wrote:
Yes, I suspect that it could be very costly in some configurations if
there is a large gap between the last replayed LSN and the last LSN
the WAL receiver has flushed.There is an extra thing, which has not been mentioned yet on this
thread, that we need to be very careful about:
<para>
When the standby is started and <varname>primary_conninfo</varname> is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and
a corresponding walsender process in the primary.
</para>
This is a long-standing behavior, and based on the first patch
proposed we would start a WAL receiver once consistency has been
reached if there is any delay defined even if restore_command is
enabled. We cannot assume either that everybody will want to start a
WAL receiver in this configuration if there is archiving behind with a
lot of segments which allow for a larger catchup window..
Two months later, we still have a patch registered in the CF which is
incorrect on a couple of aspects, and with scenarios which are
documented and getting broken. Shouldn't we actually have a GUC to
control the startup of the WAL receiver instead?
--
Michael
On 15.11.2019 5:52, Michael Paquier wrote:
On Tue, Sep 10, 2019 at 03:23:25PM +0900, Michael Paquier wrote:
Yes, I suspect that it could be very costly in some configurations if
there is a large gap between the last replayed LSN and the last LSN
the WAL receiver has flushed.There is an extra thing, which has not been mentioned yet on this
thread, that we need to be very careful about:
<para>
When the standby is started and <varname>primary_conninfo</varname> is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and
a corresponding walsender process in the primary.
</para>
This is a long-standing behavior, and based on the first patch
proposed we would start a WAL receiver once consistency has been
reached if there is any delay defined even if restore_command is
enabled. We cannot assume either that everybody will want to start a
WAL receiver in this configuration if there is archiving behind with a
lot of segments which allow for a larger catchup window..Two months later, we still have a patch registered in the CF which is
incorrect on a couple of aspects, and with scenarios which are
documented and getting broken. Shouldn't we actually have a GUC to
control the startup of the WAL receiver instead?
--
Michael
Attached pleased find rebased version of the patch with
"wal_receiver_start_condition" GUC added (preserving by default original
behavior).
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
wal_apply_delay-2.patchtext/x-patch; name=wal_apply_delay-2.patchDownload+107-1
On Fri, Nov 15, 2019 at 06:48:01PM +0300, Konstantin Knizhnik wrote:
Attached pleased find rebased version of the patch with
"wal_receiver_start_condition" GUC added (preserving by default original
behavior).
Konstantin, please be careful with the patch entry in the CF app.
This was marked as waiting on author, but that does not reflect the
reality as you have sent a new patch, so I have moved the patch to
next CF instead, with "Needs review" as status.
--
Michael
Replay lag can build up naturally, even when recovery_min_apply_delay
is not set, because WAL generation on master is concurrent and WAL
replay on standby is performed by a single process.
Hao and I have incorporated the new GUC from Konstantin's patch
and used it to start WAL receiver in the main replay loop, regardless
of whether recover_min_apply_delay is set.
Instead of going through each existing WAL record to determine the
streaming start point, WAL received is changed to persist WAL segment
number of a new WAL segment just before it is created. WAL streaming
always begins from WAL segment boundary. The persistent segment
number can be easily used to compute the start point, which is the
beginning of that segment.
We also have a TAP test to demonstrate the problem in two situations -
(1) WAL receiver process dies due to replication connection getting
disconnected and (2) standby goes through restart. The test uses
recovery_min_apply_delay to delay the replay and expects new commits
made after WAL receiver exit to not block.
Asim
Attachments:
v2-0001-Test-that-replay-of-WAL-logs-on-standby-does-not-.patchapplication/octet-stream; name=v2-0001-Test-that-replay-of-WAL-logs-on-standby-does-not-.patchDownload+188-1
v2-0002-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v2-0002-Start-WAL-receiver-before-startup-process-replays.patchDownload+157-6
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested
The logic to start WAL receiver early should not be coupled with recovery_min_apply_delay GUC. WAL receiver's delayed start affects replication in general, even when the GUC is not set.
A better fix would be to start WAL receiver in the main replay loop, as soon as consistent state has been reached.
As noted during previous reviews, scanning all WAL just to determine streaming start point seems slow. A faster solution seems desirable.
The new status of this patch is: Waiting on Author
On Mon, Jan 27, 2020 at 5:11 PM Asim Rama Praveen <apraveen@pivotal.io>
wrote:
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not testedThe logic to start WAL receiver early should not be coupled with
recovery_min_apply_delay GUC. WAL receiver's delayed start affects
replication in general, even when the GUC is not set.
A better fix would be to start WAL receiver in the main replay loop, as
soon as consistent state has been reached.
As noted during previous reviews, scanning all WAL just to determine
streaming start point seems slow. A faster solution seems desirable.
The new status of this patch is: Waiting on Author
That review is for Konstantin's patch "wal_apply_delay-2.patch". The
latest patch on this thread addresses the above review comments, so I've
changed the status in commitfest app back to "needs review".
Asim
A key challenge here is how to determine the starting point for WAL
receiver when the startup process starts it while still replaying WAL
that's already received. Hao and I wrote a much faster and less intrusive
solution to determine the starting point. Scan the first page of each WAL
segment file, starting from the one that's currently being read for
replay. If the first page is found valid, move to the next segment file
and check its first page. Continue this one segment file at a time until
either the segment file doesn't exist or the page is not valid. The
starting point is then the previous segment's start address.
There is no need to read till the end of WAL, one record at a time, like
the original proposal upthread did. The most recently flushed LSN does not
need to be persisted on disk.
Please see attached patch which also contains a TAP test to induce replay
lag and validate the fix.
Asim
Attachments:
v3-0001-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v3-0001-Start-WAL-receiver-before-startup-process-replays.patchDownload+345-11
This thread has stalled and the patch no longer applies, so I'm marking this
Returned with Feedback. Please feel free to open a new entry if this patch is
revived.
cheers ./daniel
On Mon, 3 Aug 2020 at 15:02, Daniel Gustafsson <daniel@yesql.se> wrote:
This thread has stalled and the patch no longer applies, so I'm marking
this
Returned with Feedback. Please feel free to open a new entry if this
patch is
revived.cheers ./daniel
Hi all,
I took v3[1] patch from this thread and re-based on commit
head(5fedf7417b69295294b154a21).
Please find the attached patch for review.
link [1] : v3 patch
</messages/by-id/CANXE4TeinQdw+M2Or0kTR24eRgWCOg479N8=gRvj9Ouki-tZFg@mail.gmail.com>
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com