Replication & recovery_min_apply_delay

Started by Konstantin Knizhnikalmost 7 years ago27 messages
#1Konstantin Knizhnik
k.knizhnik@postgrespro.ru
1 attachment(s)

Hi hackers,

One of our customers was faced with the following problem:
he has setup  physical primary-slave replication but for some reasons
specified very large (~12 hours)
recovery_min_apply_delay. I do not know precise reasons for such large
gap between master and replica.
But everything works normally until replica is restarted. Then it starts
to apply WAL, comes to the point where record timestamp is less then 12
hours older
and ... suspends recovery. No WAL receiver is launched and so nobody is
fetching changes from master.
It may cause master's WAL space overflow (if there is replication slot)
and loose of data in case of master crash.

Looks like the right behavior is to be able launch WAL receiver before
replica reaches end of WAL.
For example, we can launch it before going to sleep in recoveryApplyDelay.
We need to specify start LSN for WAL sender. I didn't find better
solution except iterating WAL until I reach the last valid record.

I attach small patch which implements this approach.
I wonder if it can be considered as acceptable solution of the problem
or there can be some better approach?

--
Konstantin Knizhnik
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Attachments:

wal_apply_delay.patchtext/x-patch; name=wal_apply_delay.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d80..ef6433f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -802,6 +802,7 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
  */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
+static bool stopOnError = false;
 
 typedef struct XLogPageReadPrivate
 {
@@ -3971,6 +3972,49 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 }
 
 /*
+ * Find latest WAL LSN
+ */
+static XLogRecPtr
+GetLastLSN(XLogRecPtr lsn)
+{
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+	XLogPageReadPrivate private;
+	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, &XLogPageRead, &private);
+
+	stopOnError = true;
+	while (XLogReadRecord(xlogreader, lsn, &errormsg) != NULL)
+	{
+		lsn = InvalidXLogRecPtr;
+	}
+	stopOnError = false;
+	lsn = xlogreader->EndRecPtr;
+	XLogReaderFree(xlogreader);
+
+	return lsn;
+}
+
+/*
+ * Launch WalReceiver starting from last LSN if not started yet.
+ */
+static void
+StartWalRcv(XLogRecPtr currLsn)
+{
+	if (!WalRcvStreaming() && PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
+	{
+		XLogRecPtr lastLSN = GetLastLSN(currLsn);
+		if (lastLSN != InvalidXLogRecPtr)
+		{
+			curFileTLI = ThisTimeLineID;
+			RequestXLogStreaming(ThisTimeLineID, lastLSN, PrimaryConnInfo,
+								 PrimarySlotName);
+		}
+	}
+}
+
+/*
  * Remove WAL files that are not part of the given timeline's history.
  *
  * This is called during recovery, whenever we switch to follow a new
@@ -6004,6 +6048,12 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (secs <= 0 && microsecs <= 0)
 		return false;
 
+	/*
+	 * Start WAL receiver if not started yet, to avoid WALs overflow at primary node
+	 * or large gap between primary and replica when apply delay is specified.
+	 */
+	StartWalRcv(record->EndRecPtr);
+
 	while (true)
 	{
 		ResetLatch(&XLogCtl->recoveryWakeupLatch);
@@ -11821,6 +11871,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						return false;
 
 					/*
+					 * If WAL receiver was altery started because of apply delay,
+					 * thre restart it.
+					 */
+					if (WalRcvStreaming())
+						ShutdownWalRcv();
+
+                    /*
 					 * If primary_conninfo is set, launch walreceiver to try
 					 * to stream the missing WAL.
 					 *
@@ -11990,6 +12047,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				if (stopOnError)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Konstantin Knizhnik (#1)
2 attachment(s)
Re: Replication & recovery_min_apply_delay

Hi

On 2019-Jan-30, Konstantin Knizhnik wrote:

One of our customers was faced with the following problem: he has
setup� physical primary-slave replication but for some reasons
specified very large (~12 hours) recovery_min_apply_delay.

We also came across this exact same problem some time ago. It's pretty
nasty. I wrote a quick TAP reproducer, attached (needed a quick patch
for PostgresNode itself too.)

I tried several failed strategies:
1. setting lastSourceFailed just before sleeping for apply delay, with
the idea that for the next fetch we would try stream. But this
doesn't work because WaitForWalToBecomeAvailable is not executed.

2. split WaitForWalToBecomeAvailable in two pieces, so that we can call
the first half in the restore loop. But this causes 1s of wait
between segments (error recovery) and we never actually catch up.

What back then I thought was the *real* solution but I didn't get around
to implementing is the idea you describe to start a walreceiver at an
earlier point.

I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?

I didn't find one.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Support-pg_basebackup-S-in-PostgresNode-backup.patchtext/x-diff; charset=us-asciiDownload
From 5d16db9a8308692f66b2836432fe84fbbec3e81f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 17 Aug 2018 14:20:47 -0300
Subject: [PATCH] Support pg_basebackup -S in PostgresNode->backup()

---
 src/test/perl/PostgresNode.pm | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index d9aeb277d9..2442251683 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -488,6 +488,9 @@ Create a hot backup with B<pg_basebackup> in subdirectory B<backup_name> of
 B<< $node->backup_dir >>, including the WAL. WAL files
 fetched at the end of the backup, not streamed.
 
+The keyword parameter replication_slot => 'myslot' can be used for the B<-S>
+argument to B<pg_basebackup>.
+
 You'll have to configure a suitable B<max_wal_senders> on the
 target server since it isn't done by default.
 
@@ -495,14 +498,17 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $port        = $self->port;
 	my $name        = $self->name;
 
+	my @cmd = ("pg_basebackup", "-D", $backup_path, "-p", $port, "--no-sync");
+	push @cmd, '-S', $params{replication_slot}
+	  if defined $params{replication_slot};
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
-	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-p', $port,
-		'--no-sync');
+	TestLib::system_or_bail(@cmd);
 	print "# Backup finished\n";
 }
 
-- 
2.11.0

014_nostream.pltext/x-perl; charset=us-asciiDownload
#3Thomas Munro
thomas.munro@gmail.com
In reply to: Alvaro Herrera (#2)
Re: Replication & recovery_min_apply_delay

On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Jan-30, Konstantin Knizhnik wrote:

I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?

I didn't find one.

It sounds like you are in agreement that there is a problem and this
is the best solution. I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?

--
Thomas Munro
https://enterprisedb.com

#4Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#3)
Re: Replication & recovery_min_apply_delay

On Mon, Jul 08, 2019 at 07:56:25PM +1200, Thomas Munro wrote:

On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Jan-30, Konstantin Knizhnik wrote:

I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?

I didn't find one.

It sounds like you are in agreement that there is a problem and this
is the best solution. I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?

Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that.. This does not feel right.
--
Michael

#5Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Michael Paquier (#4)
Re: Replication & recovery_min_apply_delay

On 08.07.2019 11:05, Michael Paquier wrote:

On Mon, Jul 08, 2019 at 07:56:25PM +1200, Thomas Munro wrote:

On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Jan-30, Konstantin Knizhnik wrote:

I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?

I didn't find one.

It sounds like you are in agreement that there is a problem and this
is the best solution. I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?

Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that.. This does not feel right.
--
Michael

I am sorry for delay with answer.
Looks like I have not noticed your reply:(
Can you explain me please why it is not correct to iterate through WAL
using WAL reader to get last LSN?
From my point of view it may be not so efficient way, but it should
return correct value, shouldn't it?
Can you suggest some better way to calculate last LSN?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Konstantin Knizhnik (#5)
Re: Replication & recovery_min_apply_delay

On 2019-Jul-31, Konstantin Knizhnik wrote:

On 08.07.2019 11:05, Michael Paquier wrote:

Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that.. This does not feel right.

I am sorry for delay with answer.
Looks like I have not noticed your reply:(
Can you explain me please why it is not correct to iterate through WAL using
WAL reader to get last LSN?

I agree that it's conceptually ugly, but as I mentioned in my previous
reply, I tried several other strategies before giving up and ended up
concluding that this way was a good way to solve the problem.

I don't endorse the exact patch submitted, though. I think it should
have a lot more commentary on what the code is doing and why.

As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#6)
Re: Replication & recovery_min_apply_delay

On Wed, Jul 31, 2019 at 04:43:26PM -0400, Alvaro Herrera wrote:

As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.

Hmmm. Instead of that, wouldn't it be cleaner to maintain in the
context of the startup process a marker similar to receivedUpto for
the last LSN? The issue with this one is that it gets reset easily so
we would lose track of it easily, and we need also to count with the
case where a WAL receiver is not started. So I think that we should
do that as a last replayed or received LSN if a WAL receiver is up and
running, whichever is newer. Splitting the WAL receiver restart logic
into a separate routine is a good idea in itself, the patch attempting
to switch primary_conninfo to be reloadable could make use of that.
--
Michael

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#7)
Re: Replication & recovery_min_apply_delay

On 2019-Aug-02, Michael Paquier wrote:

On Wed, Jul 31, 2019 at 04:43:26PM -0400, Alvaro Herrera wrote:

As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.

Hmmm. Instead of that, wouldn't it be cleaner to maintain in the
context of the startup process a marker similar to receivedUpto for
the last LSN? The issue with this one is that it gets reset easily so
we would lose track of it easily, and we need also to count with the
case where a WAL receiver is not started. So I think that we should
do that as a last replayed or received LSN if a WAL receiver is up and
running, whichever is newer. Splitting the WAL receiver restart logic
into a separate routine is a good idea in itself, the patch attempting
to switch primary_conninfo to be reloadable could make use of that.

Konstantin, any interest in trying this?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Alvaro Herrera (#8)
Re: Replication & recovery_min_apply_delay

On 04.09.2019 1:22, Alvaro Herrera wrote:

On 2019-Aug-02, Michael Paquier wrote:

On Wed, Jul 31, 2019 at 04:43:26PM -0400, Alvaro Herrera wrote:

As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member. I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.

Hmmm. Instead of that, wouldn't it be cleaner to maintain in the
context of the startup process a marker similar to receivedUpto for
the last LSN? The issue with this one is that it gets reset easily so
we would lose track of it easily, and we need also to count with the
case where a WAL receiver is not started. So I think that we should
do that as a last replayed or received LSN if a WAL receiver is up and
running, whichever is newer. Splitting the WAL receiver restart logic
into a separate routine is a good idea in itself, the patch attempting
to switch primary_conninfo to be reloadable could make use of that.

Konstantin, any interest in trying this?

Sorry, I do not understand this proposal.

and we need also to count with the case where a WAL receiver is not

started.

May be i missed something, but what this patch is trying to achieve is
to launch WAL receiver before already received transactions are applied.
So definitely WAL receiver is not started at this moment.

receivedUpto is just static variable in xlog.c, maintained by WAL receiver.
But as I mentioned above, WAL receiver is not started at the moment when
we need to know LSN of last record.

Certainly it should be possible to somehow persist receveidUpto, so we
do not need to scan WAL to determine the last LSN at next start.
By persisting last LSN introduce a lot of questions and problems.
For example when it needs to be flushed for the disk. If it is done
after each received transaction, then it can significantly suffer
performance.
If it is done more or less asynchronously, then there us a risk that we
requested streaming with wrong position.
In any case it will significantly complicate the patch and make it more
sensible for various errors.

I wonder what is wrong with determining LSN of last record by just
scanning WAL?
Certainly it is not the most efficient way. But I do not expect that
somebody will have hundreds or thousands megabytes of WAL.
Michael, do you see some other problems with GetLastLSN() functions
except time of its execution?

IMHO one of the ,ani advantages of this patch is that it is very simple.
We need to scan WAL to locate last LSN only if recovery_min_apply_delay
is set.
So this patch should not affect performance of all other cases.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Konstantin Knizhnik (#9)
Re: Replication & recovery_min_apply_delay

On Wed, Sep 4, 2019 at 4:37 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

receivedUpto is just static variable in xlog.c, maintained by WAL receiver.
But as I mentioned above, WAL receiver is not started at the moment when
we need to know LSN of last record.

Certainly it should be possible to somehow persist receveidUpto, so we
do not need to scan WAL to determine the last LSN at next start.
By persisting last LSN introduce a lot of questions and problems.
For example when it needs to be flushed for the disk. If it is done
after each received transaction, then it can significantly suffer
performance.
If it is done more or less asynchronously, then there us a risk that we
requested streaming with wrong position.
In any case it will significantly complicate the patch and make it more
sensible for various errors.

I think we don't necessary need exact value of receveidUpto. But it
could be some place to start scanning WAL from. We currently call
UpdateControlFile() in a lot of places. In particular we call it each
checkpoint. If even we would start scanning WAL from one checkpoint
back value of receveidUpto, we could still save a lot of resources.

I wonder what is wrong with determining LSN of last record by just
scanning WAL?
Certainly it is not the most efficient way. But I do not expect that
somebody will have hundreds or thousands megabytes of WAL.
Michael, do you see some other problems with GetLastLSN() functions
except time of its execution?

As I get this patch fixes a problem with very large recovery apply
delay. In this case, amount of accumulated WAL corresponding to that
delay could be also huge. Scanning all this amount of WAL could be
costly. And it's nice to evade.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11Michael Paquier
michael@paquier.xyz
In reply to: Alexander Korotkov (#10)
Re: Replication & recovery_min_apply_delay

On Tue, Sep 10, 2019 at 12:46:49AM +0300, Alexander Korotkov wrote:

On Wed, Sep 4, 2019 at 4:37 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

receivedUpto is just static variable in xlog.c, maintained by WAL receiver.
But as I mentioned above, WAL receiver is not started at the moment when
we need to know LSN of last record.

Certainly it should be possible to somehow persist receveidUpto, so we
do not need to scan WAL to determine the last LSN at next start.
By persisting last LSN introduce a lot of questions and problems.
For example when it needs to be flushed for the disk. If it is done
after each received transaction, then it can significantly suffer
performance.
If it is done more or less asynchronously, then there us a risk that we
requested streaming with wrong position.
In any case it will significantly complicate the patch and make it more
sensible for various errors.

I think we don't necessary need exact value of receveidUpto. But it
could be some place to start scanning WAL from. We currently call
UpdateControlFile() in a lot of places. In particular we call it each
checkpoint. If even we would start scanning WAL from one checkpoint
back value of receveidUpto, we could still save a lot of resources.

A minimum to set would be the minimum consistency LSN, but there are a
lot of gotchas to take into account when it comes to crash recovery.

As I get this patch fixes a problem with very large recovery apply
delay. In this case, amount of accumulated WAL corresponding to that
delay could be also huge. Scanning all this amount of WAL could be
costly. And it's nice to evade.

Yes, I suspect that it could be very costly in some configurations if
there is a large gap between the last replayed LSN and the last LSN
the WAL receiver has flushed.

There is an extra thing, which has not been mentioned yet on this
thread, that we need to be very careful about:
<para>
When the standby is started and <varname>primary_conninfo</varname> is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and
a corresponding walsender process in the primary.
</para>
This is a long-standing behavior, and based on the first patch
proposed we would start a WAL receiver once consistency has been
reached if there is any delay defined even if restore_command is
enabled. We cannot assume either that everybody will want to start a
WAL receiver in this configuration if there is archiving behind with a
lot of segments which allow for a larger catchup window..
--
Michael

#12Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#11)
Re: Replication & recovery_min_apply_delay

On Tue, Sep 10, 2019 at 03:23:25PM +0900, Michael Paquier wrote:

Yes, I suspect that it could be very costly in some configurations if
there is a large gap between the last replayed LSN and the last LSN
the WAL receiver has flushed.

There is an extra thing, which has not been mentioned yet on this
thread, that we need to be very careful about:
<para>
When the standby is started and <varname>primary_conninfo</varname> is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and
a corresponding walsender process in the primary.
</para>
This is a long-standing behavior, and based on the first patch
proposed we would start a WAL receiver once consistency has been
reached if there is any delay defined even if restore_command is
enabled. We cannot assume either that everybody will want to start a
WAL receiver in this configuration if there is archiving behind with a
lot of segments which allow for a larger catchup window..

Two months later, we still have a patch registered in the CF which is
incorrect on a couple of aspects, and with scenarios which are
documented and getting broken. Shouldn't we actually have a GUC to
control the startup of the WAL receiver instead?
--
Michael

#13Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Michael Paquier (#12)
1 attachment(s)
Re: Replication & recovery_min_apply_delay

On 15.11.2019 5:52, Michael Paquier wrote:

On Tue, Sep 10, 2019 at 03:23:25PM +0900, Michael Paquier wrote:

Yes, I suspect that it could be very costly in some configurations if
there is a large gap between the last replayed LSN and the last LSN
the WAL receiver has flushed.

There is an extra thing, which has not been mentioned yet on this
thread, that we need to be very careful about:
<para>
When the standby is started and <varname>primary_conninfo</varname> is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and
a corresponding walsender process in the primary.
</para>
This is a long-standing behavior, and based on the first patch
proposed we would start a WAL receiver once consistency has been
reached if there is any delay defined even if restore_command is
enabled. We cannot assume either that everybody will want to start a
WAL receiver in this configuration if there is archiving behind with a
lot of segments which allow for a larger catchup window..

Two months later, we still have a patch registered in the CF which is
incorrect on a couple of aspects, and with scenarios which are
documented and getting broken. Shouldn't we actually have a GUC to
control the startup of the WAL receiver instead?
--
Michael

Attached pleased find rebased version of the patch with
"wal_receiver_start_condition" GUC added (preserving by default original
behavior).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

wal_apply_delay-2.patchtext/x-patch; name=wal_apply_delay-2.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f837703..bf4e9d4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4291,6 +4291,24 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_condition">
+      <term><varname>wal_receiver_start_condition</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_condition</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        By default (<literal>"catchup"</literal> condition) the standby will connect to the primary after replaying all
+        WAL files available in the archive.
+        But in case of large <varname>recovery_min_apply_delay</varname> it may lead to master's WAL space overflow
+        because WAL receiver is not launched and so nobody is fetching changes from master.
+        Changine start condition to <literal>"consistency"</literal> will force starting WAL receiver once
+        consistency point is reached.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3b766e6..02e3fc2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -802,6 +802,7 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
  */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
+static bool stopOnError = false;
 
 typedef struct XLogPageReadPrivate
 {
@@ -3992,6 +3993,49 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 }
 
 /*
+ * Find latest WAL LSN
+ */
+static XLogRecPtr
+GetLastLSN(XLogRecPtr lsn)
+{
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+	XLogPageReadPrivate private;
+	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL, &XLogPageRead, &private);
+
+	stopOnError = true;
+	while (XLogReadRecord(xlogreader, lsn, &errormsg) != NULL)
+	{
+		lsn = InvalidXLogRecPtr;
+	}
+	stopOnError = false;
+	lsn = xlogreader->EndRecPtr;
+	XLogReaderFree(xlogreader);
+
+	return lsn;
+}
+
+/*
+ * Launch WalReceiver starting from last LSN if not started yet.
+ */
+static void
+StartWalRcv(XLogRecPtr currLsn)
+{
+	if (!WalRcvStreaming() && PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
+	{
+		XLogRecPtr lastLSN = GetLastLSN(currLsn);
+		if (lastLSN != InvalidXLogRecPtr)
+		{
+			curFileTLI = ThisTimeLineID;
+			RequestXLogStreaming(ThisTimeLineID, lastLSN, PrimaryConnInfo,
+								 PrimarySlotName);
+		}
+	}
+}
+
+/*
  * Remove WAL files that are not part of the given timeline's history.
  *
  * This is called during recovery, whenever we switch to follow a new
@@ -6013,6 +6057,13 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (secs <= 0 && microsecs <= 0)
 		return false;
 
+	/*
+	 * Start WAL receiver if not started yet, to avoid WALs overflow at primary node
+	 * or large gap between primary and replica when apply delay is specified.
+	 */
+	if (wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY)
+		StartWalRcv(record->EndRecPtr);
+
 	while (true)
 	{
 		ResetLatch(&XLogCtl->recoveryWakeupLatch);
@@ -11821,6 +11872,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						return false;
 
 					/*
+					 * If WAL receiver was altery started because of apply delay,
+					 * then restart it.
+					 */
+					if (WalRcvStreaming())
+						ShutdownWalRcv();
+
+                    /*
 					 * If primary_conninfo is set, launch walreceiver to try
 					 * to stream the missing WAL.
 					 *
@@ -11996,6 +12054,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				if (stopOnError)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f54ae76..87e0bda 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -75,6 +75,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int         wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4b3769b..ce2a735 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -459,6 +459,14 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"catchup", WAL_RCV_START_AT_CATCHUP, true},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, true},
+	{NULL, 0, false}
+};
+
+
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4577,6 +4585,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_CATCHUP,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index be02a76..c88fd4a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -330,6 +330,7 @@
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
+#wal_receiver_start_condition = 'catchup' # wal receiver start condition
 
 # - Subscribers -
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e12a934..d8bec2a 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -22,11 +22,17 @@
 #include "pgtime.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_CATCHUP, /* start a WAL receiver  after replaying all WAL files */
+	WAL_RCV_START_AT_CONSISTENCY /* start a WAL receiver once consistency has been reached */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
-
+extern int  wal_receiver_start_condition;
 /*
  * MAXCONNINFO: maximum size of a connection string.
  *
#14Michael Paquier
michael@paquier.xyz
In reply to: Konstantin Knizhnik (#13)
Re: Replication & recovery_min_apply_delay

On Fri, Nov 15, 2019 at 06:48:01PM +0300, Konstantin Knizhnik wrote:

Attached pleased find rebased version of the patch with
"wal_receiver_start_condition" GUC added (preserving by default original
behavior).

Konstantin, please be careful with the patch entry in the CF app.
This was marked as waiting on author, but that does not reflect the
reality as you have sent a new patch, so I have moved the patch to
next CF instead, with "Needs review" as status.
--
Michael

#15Asim R P
apraveen@pivotal.io
In reply to: Michael Paquier (#14)
2 attachment(s)
Re: Replication & recovery_min_apply_delay

Replay lag can build up naturally, even when recovery_min_apply_delay
is not set, because WAL generation on master is concurrent and WAL
replay on standby is performed by a single process.

Hao and I have incorporated the new GUC from Konstantin's patch
and used it to start WAL receiver in the main replay loop, regardless
of whether recover_min_apply_delay is set.

Instead of going through each existing WAL record to determine the
streaming start point, WAL received is changed to persist WAL segment
number of a new WAL segment just before it is created. WAL streaming
always begins from WAL segment boundary. The persistent segment
number can be easily used to compute the start point, which is the
beginning of that segment.

We also have a TAP test to demonstrate the problem in two situations -
(1) WAL receiver process dies due to replication connection getting
disconnected and (2) standby goes through restart. The test uses
recovery_min_apply_delay to delay the replay and expects new commits
made after WAL receiver exit to not block.

Asim

Attachments:

v2-0001-Test-that-replay-of-WAL-logs-on-standby-does-not-.patchapplication/octet-stream; name=v2-0001-Test-that-replay-of-WAL-logs-on-standby-does-not-.patchDownload
From 0ba35d82f672745809a964ff74936cf1a80f040c Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Tue, 21 Jan 2020 18:36:30 +0530
Subject: [PATCH v2 1/2] Test that replay of WAL logs on standby does not
 affect syncrep

The test sets up synchronous replication and induces replay lag.  The
replication connection is broken and new commits are made on master.
The test expects the commits to not block, in spite of the replay lag.
It fails if the commits take longer than a timeout.  The value of this
timeout is much less than the total replay lag.  If the commits do not
block, it is confirmed that the WAL streaming is re-established without
waiting for the startup process to finish replaying WAL already
available in pg_wal directory.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/test/recovery/t/018_replay_lag_syncrep.pl | 188 ++++++++++++++++++++++++++
 1 file changed, 188 insertions(+)
 create mode 100644 src/test/recovery/t/018_replay_lag_syncrep.pl

diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
new file mode 100644
index 0000000000..9cd79fdc89
--- /dev/null
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -0,0 +1,188 @@
+# Test impact of replay lag on synchronous replication.
+#
+# Replay lag is induced using recovery_min_apply_delay GUC.  Two ways
+# of breaking replication connection are covered - killing walsender
+# and restarting standby.  The test expects that replication
+# connection is restored without being affected due to replay lag.
+# This is validated by performing commits on master after replication
+# connection is disconnected and checking that they finish within a
+# few seconds.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+# Query checking sync_priority and sync_state of each standby
+my $check_sql =
+  "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Check that sync_state of a standby is expected (waiting till it is).
+# If $setting is given, synchronous_standby_names is set to it and
+# the configuration file is reloaded before the test.
+sub test_sync_state
+{
+	my ($self, $expected, $msg, $setting) = @_;
+
+	if (defined($setting))
+	{
+		$self->safe_psql('postgres',
+						 "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
+		$self->reload;
+	}
+
+	ok($self->poll_query_until('postgres', $check_sql, $expected), $msg);
+	return;
+}
+
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print("### Waiting for standby \"$standby_name\" on \"$master_name\"\n");
+	$master->poll_query_until('postgres', $query);
+	return;
+}
+
+# Initialize master node
+my $node_master = get_new_node('master');
+my @extra = (q[--wal-segsize], q[1]);
+$node_master->init(allows_streaming => 1, extra => \@extra);
+$node_master->start;
+my $backup_name = 'master_backup';
+
+# Setup physical replication slot for streaming replication
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot', true, false);]);
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby linking to master
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name,
+								has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+						   q[primary_slot_name = 'phys_slot']);
+# Enable debug logging in standby
+$node_standby->append_conf('postgresql.conf',
+						   q[log_min_messages = debug5]);
+
+start_standby_and_wait($node_master, $node_standby);
+
+# Make standby synchronous
+test_sync_state(
+	$node_master,
+	qq(standby|1|sync),
+	'standby is synchronous',
+	'standby');
+
+# Switch to a new WAL file after standby is created.  This gives the
+# standby a chance to save the new WAL file's beginning as replication
+# start point.
+$node_master->safe_psql('postgres',	'create table dummy(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Slow down WAL replay by inducing 10 seconds sleep before replaying
+# a commit WAL record.
+$node_standby->safe_psql('postgres',
+						 'ALTER SYSTEM set recovery_min_apply_delay TO 10000;');
+$node_standby->reload;
+
+# Commit some transactions on master to induce replay lag in standby.
+$node_master->safe_psql('postgres', 'CREATE TABLE replay_lag_test(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (101);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (102);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (103);');
+
+# Obtain WAL sender PID and kill it.
+my $walsender_pid = $node_master->safe_psql(
+	'postgres',
+	q[select active_pid from pg_get_replication_slots() where slot_name = 'phys_slot']);
+
+# Kill walsender, so that the replication connection breaks.
+kill 'SIGTERM', $walsender_pid;
+
+# The replication connection should be re-establised much earlier than
+# what it takes to finish replay.  Try to commit a transaction with a
+# timeout of 2 seconds.  The timeout should not be hit.
+my $timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (1);',
+	timeout => 2,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after WAL receiver restart');
+
+# Break the replication connection by restarting standby.
+$node_standby->restart;
+
+# Like in previous test, the replication connection should be
+# re-establised before pending WAL replay is finished.  Try to commit
+# a transaction with 2 second timeout.  The timeout should not be hit.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (2);',
+	timeout => 2,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after standby restart');
+
+# Reset the delay so that the replay process is no longer slowed down.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set recovery_min_apply_delay to 0;');
+$node_standby->reload;
+
+# Switch to a new WAL file and see if things work well.
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Transactions should work fine on master.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (3);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Standby should also have identical content.
+my $count_sql = q[select count(*) from replay_lag_test;];
+my $expected = q[6];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected), 'standby query');
+
+# Test that promotion followed by query works.
+$node_standby->promote;
+$node_master->stop;
+$node_standby->safe_psql('postgres', 'insert into replay_lag_test values (4);');
+
+$expected = q[7];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected),
+   'standby query after promotion');
-- 
2.14.3 (Apple Git-98)

v2-0002-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v2-0002-Start-WAL-receiver-before-startup-process-replays.patchDownload
From f1962630e2c4f7344b58ebd1583227068069ffed Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Tue, 21 Jan 2020 18:36:58 +0530
Subject: [PATCH v2 2/2] Start WAL receiver before startup process replays
 existing WAL

If WAL receiver is started only after startup process finishes replaying
WAL already available in pg_wal, synchornous replication is impacted
adversly.  Consider a temporary network outage causing streaming
replication connection to break.  This leads to exit of WAL receiver
process.  If the startup process has fallen behind, it may take a long
time to finish replaying WAL and then start walreceiver again to
re-establish streaming replication.  Commits on master will have to wait
all this while for the standby to flush WAL upto commit LSN.

This patch starts WAL receiver from main redo loop, if it is found not
running, after consistent state has been reached, so as to alleviate the
problem described above.  We borroed a new GUC from Konstantin's patch
to control this behvior: wal_receiver_start_condition.

WAL receiver now remembers the start point to request streaming from in
pg_control.  Before creating a new WAL segment file, it records the new
WAL segment number in pg_control.  If the WAL receiver process exits and
must restart, streaming starts from the beginning of the saved segment
file.  We are leveraging the fact that WAL streaming always begins from
a WAL segment boundry.

Alternatives we thought of (but did not implement) for persisting the
starting point: (1) postgresql.auto.conf file, similar to how
primary_conninfo is remembered.  This option requires creating a new GUC
that represents the starting point.  Start point is never set by a user,
so using a GUC to represent it does not seem appropriate.  (2) introduce
a new flat file.  This incurs the overhead to maintain an additional
flat file.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/access/transam/xlog.c             | 31 ++++++++++++
 src/backend/replication/walreceiver.c         | 68 +++++++++++++++++++++++++++
 src/backend/replication/walreceiverfuncs.c    | 20 ++++++--
 src/backend/utils/misc/guc.c                  | 17 +++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/bin/pg_controldata/pg_controldata.c       |  4 ++
 src/include/catalog/pg_control.h              |  7 +++
 src/include/replication/walreceiver.h         |  7 +++
 src/test/recovery/t/018_replay_lag_syncrep.pl |  7 +++
 9 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..0d88fbdd25 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7082,6 +7082,37 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * Start WAL receiver without waiting for startup process to
+				 * finish replay, so that streaming replication is established
+				 * at the earliest.  When the replication is configured to be
+				 * synchronous this would unblock commits waiting for WAL to
+				 * be written and/or flushed by synchronous standby.
+				 */
+				if (StandbyModeRequested &&
+					reachedConsistency &&
+					wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY &&
+					!WalRcvStreaming())
+				{
+					XLogRecPtr startpoint;
+					XLogSegNo startseg;
+					TimeLineID startpointTLI;
+					LWLockAcquire(ControlFileLock, LW_SHARED);
+					startseg = ControlFile->lastFlushedSeg;
+					startpointTLI = ControlFile->lastFlushedSegTLI;
+					LWLockRelease(ControlFileLock);
+					if (startpointTLI > 0)
+					{
+						elog(LOG, "found last flushed segment %lu on time line %d, starting WAL receiver",
+							 startseg, startpointTLI);
+						XLogSegNoOffsetToRecPtr(startseg, 0, wal_segment_size, startpoint);
+						RequestXLogStreaming(startpointTLI,
+											 startpoint,
+											 PrimaryConnInfo,
+											 PrimarySlotName);
+					}
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a5e85d32f3..0168ded801 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
 #include "funcapi.h"
@@ -77,11 +78,14 @@ bool		wal_receiver_create_temp_slot;
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int         wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
 WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 
+static ControlFileData *ControlFile = NULL;
+
 #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
 
 /*
@@ -163,6 +167,45 @@ ProcessWalRcvInterrupts(void)
 }
 
 
+/*
+ * Persist startpoint to pg_control file.  This is used to start replication
+ * without waiting for startup process to let us know where to start streaming
+ * from.
+ */
+static void
+SaveStartPoint(XLogRecPtr startpoint, TimeLineID startpointTLI)
+{
+	XLogSegNo oldseg, startseg;
+	TimeLineID oldTLI;
+
+	XLByteToSeg(startpoint, startseg, wal_segment_size);
+
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+#ifdef USE_ASSERT_CHECKING
+	/*
+	 * On a given timeline, the WAL segment to start streaming from should
+	 * never move backwards.
+	 */
+	if (ControlFile->lastFlushedSegTLI == startpointTLI)
+		Assert(ControlFile->lastFlushedSeg <= startseg);
+#endif
+
+	oldseg = ControlFile->lastFlushedSeg;
+	oldTLI = ControlFile->lastFlushedSegTLI;
+	if (oldseg < startseg || oldTLI != startpointTLI)
+	{
+		ControlFile->lastFlushedSeg = startseg;
+		ControlFile->lastFlushedSegTLI = startpointTLI;
+		UpdateControlFile();
+		elog(DEBUG3,
+			 "lastFlushedSeg (seg, TLI) old: (%lu, %u), new: (%lu, %u)",
+			 oldseg, oldTLI, startseg, startpointTLI);
+	}
+
+	LWLockRelease(ControlFileLock);
+}
+
 /* Main entry point for walreceiver process */
 void
 WalReceiverMain(void)
@@ -304,6 +347,10 @@ WalReceiverMain(void)
 	if (sender_host)
 		pfree(sender_host);
 
+	bool found;
+	ControlFile = ShmemInitStruct("Control File", sizeof(ControlFileData), &found);
+	Assert(found);
+
 	first_stream = true;
 	for (;;)
 	{
@@ -1055,6 +1102,27 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
+			/*
+			 * When a WAL segment file is completely filled,
+			 * LogstreamResult.Flush points to the beginning of the new WAL
+			 * segment file that will be created shortly.  Before sending a
+			 * reply with a LSN from the new WAL segment for the first time,
+			 * remember the LSN in pg_control.  The LSN is used as the
+			 * startpoint to start streaming again if the WAL receiver process
+			 * exits and starts again.
+			 *
+			 * It is important to update the LSN's segment number in
+			 * pg_control before including it in a replay back to the WAL
+			 * sender.  Once WAL sender receives the flush LSN from standby
+			 * reply, any older WAL segments that do not contain the flush LSN
+			 * may be cleaned up.  If the WAL receiver dies after sending a
+			 * reply but before updating pg_control, it is possible that the
+			 * starting segment saved in pg_control is no longer available on
+			 * master when it attempts to resume streaming.
+			 */
+			if (XLogSegmentOffset(LogstreamResult.Flush, wal_segment_size) == 0)
+				SaveStartPoint(LogstreamResult.Flush, ThisTimeLineID);
+
 			XLogWalRcvSendReply(false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..955b8fcf83 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -239,10 +239,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	SpinLockAcquire(&walrcv->mutex);
 
-	/* It better be stopped if we try to restart it */
-	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
-		   walrcv->walRcvState == WALRCV_WAITING);
-
 	if (conninfo != NULL)
 		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
 	else
@@ -253,12 +249,26 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	else
 		walrcv->slotname[0] = '\0';
 
+	/*
+	 * We used to assert that the WAL receiver is either in WALRCV_STOPPED or
+	 * in WALRCV_WAITING state.
+	 *
+	 * Such an assertion is not possible, now that this function is called by
+	 * startup process on two occasions.  One is just before starting to
+	 * replay WAL when starting up.  And the other is when it has finished
+	 * replaying all WAL in pg_xlog directory.  If the standby is starting up
+	 * after clean shutdown, there is not much WAL to be replayed and both
+	 * calls to this funcion can occur in quick succession.  By the time the
+	 * second request to start streaming is made, the WAL receiver can be in
+	 * any state.  We therefore cannot make any assertion on the state here.
+	 */
+
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
 		walrcv->walRcvState = WALRCV_STARTING;
 	}
-	else
+	else if (walrcv->walRcvState == WALRCV_WAITING)
 		walrcv->walRcvState = WALRCV_RESTARTING;
 	walrcv->startTime = now;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e44f71e991..7633b949fd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -464,6 +464,12 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"catchup", WAL_RCV_START_AT_CATCHUP, true},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, true},
+	{NULL, 0, false}
+};
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4613,6 +4619,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		check_ssl_max_protocol_version, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_CATCHUP,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e1048c0047..76c01d640a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -332,6 +332,7 @@
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
+#wal_receiver_start_condition = 'catchup' # wal receiver start condition
 
 # - Subscribers -
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 19e21ab491..f98f36ffe5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -234,6 +234,10 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified:             %s\n"),
 		   pgctime_str);
+	printf(_("Latest flushed WAL segment number:    %lu\n"),
+		   ControlFile->lastFlushedSeg);
+	printf(_("Latest flushed TimeLineID:            %u\n"),
+		   ControlFile->lastFlushedSegTLI);
 	printf(_("Latest checkpoint location:           %X/%X\n"),
 		   (uint32) (ControlFile->checkPoint >> 32),
 		   (uint32) ControlFile->checkPoint);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..27260bbea5 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -143,6 +143,11 @@ typedef struct ControlFileData
 	 * to disk, we mustn't start up until we reach X again. Zero when not
 	 * doing archive recovery.
 	 *
+	 * lastFlushedSeg is the WAL segment number of the most recently flushed
+	 * WAL file by walreceiver.  It is updated by walreceiver when a received
+	 * WAL record falls on a new WAL segment file.  This is used as the start
+	 * point to resume WAL streaming if it is stopped.
+	 *
 	 * backupStartPoint is the redo pointer of the backup start checkpoint, if
 	 * we are recovering from an online backup and haven't reached the end of
 	 * backup yet. It is reset to zero when the end of backup is reached, and
@@ -165,6 +170,8 @@ typedef struct ControlFileData
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
+	XLogSegNo	lastFlushedSeg;
+	TimeLineID	lastFlushedSegTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..36daa5fa48 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -22,11 +22,18 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_CATCHUP, /* start a WAL receiver  after replaying all WAL files */
+	WAL_RCV_START_AT_CONSISTENCY /* start a WAL receiver once consistency has been reached */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern bool wal_receiver_create_temp_slot;
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int  wal_receiver_start_condition;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
index 9cd79fdc89..121a0a9171 100644
--- a/src/test/recovery/t/018_replay_lag_syncrep.pl
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -74,6 +74,8 @@ $node_standby->init_from_backup($node_master, $backup_name,
 								has_streaming => 1);
 $node_standby->append_conf('postgresql.conf',
 						   q[primary_slot_name = 'phys_slot']);
+$node_standby->append_conf('postgresql.conf',
+						   q[wal_receiver_start_condition = 'consistency']);
 # Enable debug logging in standby
 $node_standby->append_conf('postgresql.conf',
 						   q[log_min_messages = debug5]);
@@ -125,6 +127,11 @@ my $walsender_pid = $node_master->safe_psql(
 # Kill walsender, so that the replication connection breaks.
 kill 'SIGTERM', $walsender_pid;
 
+# Wait for 10 seconds (recovery_min_apply_delay) to give startup
+# process a chance to restart WAL receiver.
+note('waiting for startup process to restart WAL receiver');
+sleep 10;
+
 # The replication connection should be re-establised much earlier than
 # what it takes to finish replay.  Try to commit a transaction with a
 # timeout of 2 seconds.  The timeout should not be hit.
-- 
2.14.3 (Apple Git-98)

#16Asim Rama Praveen
apraveen@pivotal.io
In reply to: Asim R P (#15)
Re: Replication & recovery_min_apply_delay

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

The logic to start WAL receiver early should not be coupled with recovery_min_apply_delay GUC. WAL receiver's delayed start affects replication in general, even when the GUC is not set.

A better fix would be to start WAL receiver in the main replay loop, as soon as consistent state has been reached.

As noted during previous reviews, scanning all WAL just to determine streaming start point seems slow. A faster solution seems desirable.

The new status of this patch is: Waiting on Author

#17Asim R P
apraveen@pivotal.io
In reply to: Asim Rama Praveen (#16)
Re: Replication & recovery_min_apply_delay

On Mon, Jan 27, 2020 at 5:11 PM Asim Rama Praveen <apraveen@pivotal.io>
wrote:

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

The logic to start WAL receiver early should not be coupled with

recovery_min_apply_delay GUC. WAL receiver's delayed start affects
replication in general, even when the GUC is not set.

A better fix would be to start WAL receiver in the main replay loop, as

soon as consistent state has been reached.

As noted during previous reviews, scanning all WAL just to determine

streaming start point seems slow. A faster solution seems desirable.

The new status of this patch is: Waiting on Author

That review is for Konstantin's patch "wal_apply_delay-2.patch". The
latest patch on this thread addresses the above review comments, so I've
changed the status in commitfest app back to "needs review".

Asim

#18Asim R P
apraveen@pivotal.io
In reply to: Asim R P (#17)
1 attachment(s)
Re: Replication & recovery_min_apply_delay

A key challenge here is how to determine the starting point for WAL
receiver when the startup process starts it while still replaying WAL
that's already received. Hao and I wrote a much faster and less intrusive
solution to determine the starting point. Scan the first page of each WAL
segment file, starting from the one that's currently being read for
replay. If the first page is found valid, move to the next segment file
and check its first page. Continue this one segment file at a time until
either the segment file doesn't exist or the page is not valid. The
starting point is then the previous segment's start address.

There is no need to read till the end of WAL, one record at a time, like
the original proposal upthread did. The most recently flushed LSN does not
need to be persisted on disk.

Please see attached patch which also contains a TAP test to induce replay
lag and validate the fix.

Asim

Attachments:

v3-0001-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v3-0001-Start-WAL-receiver-before-startup-process-replays.patchDownload
From f4c3520d23caf61554f1346dca54ef6ef3085138 Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Fri, 17 Jan 2020 18:14:41 +0530
Subject: [PATCH v3] Start WAL receiver before startup process replays existing
 WAL

If WAL receiver is started only after startup process finishes replaying
WAL already available in pg_wal, synchornous replication is impacted
adversly.  Consider a temporary network outage causing streaming
replication connection to break.  This leads to exit of WAL receiver
process.  If the startup process has fallen behind, it may take a long
time to finish replaying WAL and then start walreceiver again to
re-establish streaming replication.  Commits on master will have to wait
all this while for the standby to flush WAL upto commit LSN.

This experience can be alleviated if replication connection is
re-established as soon as it is found to be disconnected.  The patch
attempts to do so by starting WAL receiver as soon as consistent state
is reached.

The start point to request streaming from is set to the beginning of the
most recently flushed WAL segment.  To determine this, the startup
process scans first page of segments, stating from the segment currently
being read, one file at a time.

A new GUC, wal_receiver_start_condition, controls the new behavior.
When set to 'consistency', the new behavior takes effect.  The default
value is 'replay', which keeps the current behavior.

A TAP test is added to demonstrate the problem and validate the fix.

Discussion:
https://www.postgresql.org/message-id/CANXE4TewY1WNgu5J5ek38RD%2B2m9F2K%3DfgbWubjv9yG0BeyFxRQ%40mail.gmail.com
https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/access/transam/xlog.c             | 118 +++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/replication/walreceiverfuncs.c    |  20 ++-
 src/backend/utils/misc/guc.c                  |  17 +++
 src/include/replication/walreceiver.h         |   7 +
 src/test/recovery/t/018_replay_lag_syncrep.pl | 192 ++++++++++++++++++++++++++
 6 files changed, 345 insertions(+), 10 deletions(-)
 create mode 100644 src/test/recovery/t/018_replay_lag_syncrep.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fa446ffa4..c2f0782a24 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5577,6 +5577,98 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+static int
+XLogReadFirstPage(XLogRecPtr targetPagePtr, char *readBuf)
+{
+	int fd;
+	XLogSegNo segno;
+	char xlogfname[MAXFNAMELEN];
+
+	XLByteToSeg(targetPagePtr, segno, wal_segment_size);
+	elog(DEBUG3, "reading first page of segment %lu", segno);
+	fd = XLogFileReadAnyTLI(segno, LOG, XLOG_FROM_PG_WAL);
+	if (fd == -1)
+		return -1;
+
+	/* Seek to the beginning, we want to check if the first page is valid */
+	if (lseek(fd, (off_t) 0, SEEK_SET) < 0)
+	{
+		XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);
+		close(fd);
+		elog(ERROR, "could not seek XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	if (read(fd, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	{
+		close(fd);
+		elog(ERROR, "could not read from XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	close(fd);
+	return XLOG_BLCKSZ;
+}
+
+/*
+ * Find the LSN that points to the beginning of the segment file most recently
+ * flushed by WAL receiver.  It will be used as start point by new instance of
+ * WAL receiver.
+ *
+ * The XLogReaderState abstraction is not suited for this purpose.  The
+ * interface it offers is XLogReadRecord, which is not suited to read a
+ * specific page from WAL.
+ */
+static XLogRecPtr
+GetLastLSN(XLogRecPtr lsn)
+{
+	XLogSegNo lastValidSegNo;
+	char readBuf[XLOG_BLCKSZ];
+
+	XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+	/*
+	 * We know that lsn falls in a valid segment.  Start searching from the
+	 * next segment.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+
+	elog(LOG, "scanning WAL for last valid segment, starting from %X/%X",
+		 (uint32) (lsn >> 32), (uint32) lsn);
+
+	while (XLogReadFirstPage(lsn, readBuf) == XLOG_BLCKSZ)
+	{
+		/*
+		 * Validate page header, it must be a long header because we are
+		 * inspecting the first page in a segment file.  The big if condition
+		 * is modelled according to XLogReaderValidatePageHeader.
+		 */
+		XLogLongPageHeader longhdr = (XLogLongPageHeader) readBuf;
+		if ((longhdr->std.xlp_info & XLP_LONG_HEADER) == 0 ||
+			(longhdr->std.xlp_magic != XLOG_PAGE_MAGIC) ||
+			((longhdr->std.xlp_info & ~XLP_ALL_FLAGS) != 0) ||
+			(longhdr->xlp_sysid != ControlFile->system_identifier) ||
+			(longhdr->xlp_seg_size != wal_segment_size) ||
+			(longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ) ||
+			(longhdr->std.xlp_pageaddr != lsn) ||
+			(longhdr->std.xlp_tli != ThisTimeLineID))
+		{
+			break;
+		}
+		XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+		XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+	}
+
+	/*
+	 * The last valid segment number is previous to the one that was just
+	 * found to be invalid.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo, 0, wal_segment_size, lsn);
+
+	elog(LOG, "last valid segment number = %lu", lastValidSegNo);
+
+	return lsn;
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7105,6 +7197,27 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * Start WAL receiver without waiting for startup process to
+				 * finish replay, so that streaming replication is established
+				 * at the earliest.  When the replication is configured to be
+				 * synchronous this would unblock commits waiting for WAL to
+				 * be written and/or flushed by synchronous standby.
+				 */
+				if (StandbyModeRequested &&
+					reachedConsistency &&
+					wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY &&
+					!WalRcvStreaming())
+				{
+					XLogRecPtr startpoint = GetLastLSN(record->xl_prev);
+					elog(LOG, "starting WAL receiver, startpoint %X/%X",
+						 (uint32) (startpoint >> 32), (uint32) startpoint);
+					RequestXLogStreaming(ThisTimeLineID,
+										 startpoint,
+										 PrimaryConnInfo,
+										 PrimarySlotName);
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -12069,11 +12182,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
 
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..9f8e0060d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -78,6 +78,7 @@ bool		wal_receiver_create_temp_slot;
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int 		wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..955b8fcf83 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -239,10 +239,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	SpinLockAcquire(&walrcv->mutex);
 
-	/* It better be stopped if we try to restart it */
-	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
-		   walrcv->walRcvState == WALRCV_WAITING);
-
 	if (conninfo != NULL)
 		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
 	else
@@ -253,12 +249,26 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	else
 		walrcv->slotname[0] = '\0';
 
+	/*
+	 * We used to assert that the WAL receiver is either in WALRCV_STOPPED or
+	 * in WALRCV_WAITING state.
+	 *
+	 * Such an assertion is not possible, now that this function is called by
+	 * startup process on two occasions.  One is just before starting to
+	 * replay WAL when starting up.  And the other is when it has finished
+	 * replaying all WAL in pg_xlog directory.  If the standby is starting up
+	 * after clean shutdown, there is not much WAL to be replayed and both
+	 * calls to this funcion can occur in quick succession.  By the time the
+	 * second request to start streaming is made, the WAL receiver can be in
+	 * any state.  We therefore cannot make any assertion on the state here.
+	 */
+
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
 		walrcv->walRcvState = WALRCV_STARTING;
 	}
-	else
+	else if (walrcv->walRcvState == WALRCV_WAITING)
 		walrcv->walRcvState = WALRCV_RESTARTING;
 	walrcv->startTime = now;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..1270f58efa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -232,6 +232,12 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context,
  * NOTE! Option values may not contain double quotes!
  */
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"catchup", WAL_RCV_START_AT_CATCHUP, true},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, true},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry bytea_output_options[] = {
 	{"escape", BYTEA_OUTPUT_ESCAPE, false},
 	{"hex", BYTEA_OUTPUT_HEX, false},
@@ -4688,6 +4694,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_CATCHUP,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..36daa5fa48 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -22,11 +22,18 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_CATCHUP, /* start a WAL receiver  after replaying all WAL files */
+	WAL_RCV_START_AT_CONSISTENCY /* start a WAL receiver once consistency has been reached */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern bool wal_receiver_create_temp_slot;
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int  wal_receiver_start_condition;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
new file mode 100644
index 0000000000..e82d8a0a64
--- /dev/null
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -0,0 +1,192 @@
+# Test impact of replay lag on synchronous replication.
+#
+# Replay lag is induced using recovery_min_apply_delay GUC.  Two ways
+# of breaking replication connection are covered - killing walsender
+# and restarting standby.  The test expects that replication
+# connection is restored without being affected due to replay lag.
+# This is validated by performing commits on master after replication
+# connection is disconnected and checking that they finish within a
+# few seconds.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Query checking sync_priority and sync_state of each standby
+my $check_sql =
+  "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Check that sync_state of a standby is expected (waiting till it is).
+# If $setting is given, synchronous_standby_names is set to it and
+# the configuration file is reloaded before the test.
+sub test_sync_state
+{
+	my ($self, $expected, $msg, $setting) = @_;
+
+	if (defined($setting))
+	{
+		$self->safe_psql('postgres',
+						 "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
+		$self->reload;
+	}
+
+	ok($self->poll_query_until('postgres', $check_sql, $expected), $msg);
+	return;
+}
+
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print("### Waiting for standby \"$standby_name\" on \"$master_name\"\n");
+	$master->poll_query_until('postgres', $query);
+	return;
+}
+
+# Initialize master node
+my $node_master = get_new_node('master');
+my @extra = (q[--wal-segsize], q[1]);
+$node_master->init(allows_streaming => 1, extra => \@extra);
+$node_master->start;
+my $backup_name = 'master_backup';
+
+# Setup physical replication slot for streaming replication
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot', true, false);]);
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby linking to master
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name,
+								has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+						   q[primary_slot_name = 'phys_slot']);
+# Enable debug logging in standby
+$node_standby->append_conf('postgresql.conf',
+						   q[log_min_messages = debug5]);
+# Enable early WAL receiver startup
+$node_standby->append_conf('postgresql.conf',
+						   q[wal_receiver_start_condition = 'consistency']);
+
+start_standby_and_wait($node_master, $node_standby);
+
+# Make standby synchronous
+test_sync_state(
+	$node_master,
+	qq(standby|1|sync),
+	'standby is synchronous',
+	'standby');
+
+# Slow down WAL replay by inducing 10 seconds sleep before replaying
+# a commit WAL record.
+$node_standby->safe_psql('postgres',
+						 'ALTER SYSTEM set recovery_min_apply_delay TO 10000;');
+$node_standby->reload;
+
+# Commit some transactions on master to induce replay lag in standby.
+$node_master->safe_psql('postgres', 'CREATE TABLE replay_lag_test(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (101);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (102);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (103);');
+
+# Obtain WAL sender PID and kill it.
+my $walsender_pid = $node_master->safe_psql(
+	'postgres',
+	q[select active_pid from pg_get_replication_slots() where slot_name = 'phys_slot']);
+
+# Kill walsender, so that the replication connection breaks.
+kill 'SIGTERM', $walsender_pid;
+
+# The replication connection should be re-establised much earlier than
+# what it takes to finish replay.  Try to commit a transaction with a
+# timeout of recovery_min_apply_delay + 2 seconds.  The timeout should
+# not be hit.
+my $timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (1);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after WAL receiver restart');
+
+my $replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after WAL receiver restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication resumes in spite of replay lag');
+
+# Break the replication connection by restarting standby.
+$node_standby->restart;
+
+# Like in previous test, the replication connection should be
+# re-establised before pending WAL replay is finished.  Try to commit
+# a transaction with recovery_min_apply_delay + 2 second timeout.  The
+# timeout should not be hit.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (2);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after standby restart');
+$replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after standby restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication starts in spite of replay lag');
+
+# Reset the delay so that the replay process is no longer slowed down.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set recovery_min_apply_delay to 0;');
+$node_standby->reload;
+
+# Switch to a new WAL file and see if things work well.
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Transactions should work fine on master.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (3);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Standby should also have identical content.
+my $count_sql = q[select count(*) from replay_lag_test;];
+my $expected = q[6];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected), 'standby query');
+
+# Test that promotion followed by query works.
+$node_standby->promote;
+$node_master->stop;
+$node_standby->safe_psql('postgres', 'insert into replay_lag_test values (4);');
+
+$expected = q[7];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected),
+   'standby query after promotion');
-- 
2.14.3 (Apple Git-98)

#19Daniel Gustafsson
daniel@yesql.se
In reply to: Asim R P (#18)
Re: Replication & recovery_min_apply_delay

This thread has stalled and the patch no longer applies, so I'm marking this
Returned with Feedback. Please feel free to open a new entry if this patch is
revived.

cheers ./daniel

#20Mahendra Singh Thalor
mahi6run@gmail.com
In reply to: Daniel Gustafsson (#19)
1 attachment(s)
Re: Replication & recovery_min_apply_delay

On Mon, 3 Aug 2020 at 15:02, Daniel Gustafsson <daniel@yesql.se> wrote:

This thread has stalled and the patch no longer applies, so I'm marking
this
Returned with Feedback. Please feel free to open a new entry if this
patch is
revived.

cheers ./daniel

Hi all,
I took v3[1] patch from this thread and re-based on commit
head(5fedf7417b69295294b154a21).

Please find the attached patch for review.

link [1] : v3 patch
</messages/by-id/CANXE4TeinQdw+M2Or0kTR24eRgWCOg479N8=gRvj9Ouki-tZFg@mail.gmail.com&gt;
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v4-0001-Start-WAL-receiver-before-startup-process-replays.patchDownload
From 324a925b133700faa3c8ff162e5337677067c6ec Mon Sep 17 00:00:00 2001
From: Mahendra Singh Thalor <mahi6run@gmail.com>
Date: Tue, 26 Oct 2021 12:22:55 -0700
Subject: [PATCH] Subject: [PATCH v4] Start WAL receiver before startup process
 replays existing  WAL

If WAL receiver is started only after startup process finishes replaying
WAL already available in pg_wal, synchornous replication is impacted
adversly.  Consider a temporary network outage causing streaming
replication connection to break.  This leads to exit of WAL receiver
process.  If the startup process has fallen behind, it may take a long
time to finish replaying WAL and then start walreceiver again to
re-establish streaming replication.  Commits on master will have to wait
all this while for the standby to flush WAL upto commit LSN.

This experience can be alleviated if replication connection is
re-established as soon as it is found to be disconnected.  The patch
attempts to do so by starting WAL receiver as soon as consistent state
is reached.

The start point to request streaming from is set to the beginning of the
most recently flushed WAL segment.  To determine this, the startup
process scans first page of segments, stating from the segment currently
being read, one file at a time.

A new GUC, wal_receiver_start_condition, controls the new behavior.
When set to 'consistency', the new behavior takes effect.  The default
value is 'replay', which keeps the current behavior.

A TAP test is added to demonstrate the problem and validate the fix.

Note: rebased on commit 5fedf7417b69295294b154a219edd8a26eaa6ab6
---
 src/backend/access/transam/xlog.c             | 120 ++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/replication/walreceiverfuncs.c    |  20 +-
 src/backend/utils/misc/guc.c                  |  17 ++
 src/include/replication/walreceiver.h         |   7 +
 src/test/recovery/t/018_replay_lag_syncrep.pl | 192 ++++++++++++++++++
 6 files changed, 346 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/018_replay_lag_syncrep.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f547efd294..77b1da91fd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5817,6 +5817,98 @@ CleanupAfterArchiveRecovery(TimeLineID EndOfLogTLI, XLogRecPtr EndOfLog)
 	}
 }
 
+static int
+XLogReadFirstPage(XLogRecPtr targetPagePtr, char *readBuf)
+{
+	int fd;
+	XLogSegNo segno;
+	char xlogfname[MAXFNAMELEN];
+
+	XLByteToSeg(targetPagePtr, segno, wal_segment_size);
+	elog(DEBUG3, "reading first page of segment %lu", segno);
+	fd = XLogFileReadAnyTLI(segno, LOG, XLOG_FROM_PG_WAL);
+	if (fd == -1)
+		return -1;
+
+	/* Seek to the beginning, we want to check if the first page is valid */
+	if (lseek(fd, (off_t) 0, SEEK_SET) < 0)
+	{
+		XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);
+		close(fd);
+		elog(ERROR, "could not seek XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	if (read(fd, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	{
+		close(fd);
+		elog(ERROR, "could not read from XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	close(fd);
+	return XLOG_BLCKSZ;
+}
+
+/*
+ * Find the LSN that points to the beginning of the segment file most recently
+ * flushed by WAL receiver.  It will be used as start point by new instance of
+ * WAL receiver.
+ *
+ * The XLogReaderState abstraction is not suited for this purpose.  The
+ * interface it offers is XLogReadRecord, which is not suited to read a
+ * specific page from WAL.
+ */
+static XLogRecPtr
+GetLastLSN(XLogRecPtr lsn)
+{
+	XLogSegNo lastValidSegNo;
+	char readBuf[XLOG_BLCKSZ];
+
+	XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+	/*
+	 * We know that lsn falls in a valid segment.  Start searching from the
+	 * next segment.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+
+	elog(LOG, "scanning WAL for last valid segment, starting from %X/%X",
+		 (uint32) (lsn >> 32), (uint32) lsn);
+
+	while (XLogReadFirstPage(lsn, readBuf) == XLOG_BLCKSZ)
+	{
+		/*
+		 * Validate page header, it must be a long header because we are
+		 * inspecting the first page in a segment file.  The big if condition
+		 * is modelled according to XLogReaderValidatePageHeader.
+		 */
+		XLogLongPageHeader longhdr = (XLogLongPageHeader) readBuf;
+		if ((longhdr->std.xlp_info & XLP_LONG_HEADER) == 0 ||
+			(longhdr->std.xlp_magic != XLOG_PAGE_MAGIC) ||
+			((longhdr->std.xlp_info & ~XLP_ALL_FLAGS) != 0) ||
+			(longhdr->xlp_sysid != ControlFile->system_identifier) ||
+			(longhdr->xlp_seg_size != wal_segment_size) ||
+			(longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ) ||
+			(longhdr->std.xlp_pageaddr != lsn) ||
+			(longhdr->std.xlp_tli != ThisTimeLineID))
+		{
+			break;
+		}
+		XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+		XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+	}
+
+	/*
+	 * The last valid segment number is previous to the one that was just
+	 * found to be invalid.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo, 0, wal_segment_size, lsn);
+
+	elog(LOG, "last valid segment number = %lu", lastValidSegNo);
+
+	return lsn;
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7534,6 +7626,28 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * Start WAL receiver without waiting for startup process to
+				 * finish replay, so that streaming replication is established
+				 * at the earliest.  When the replication is configured to be
+				 * synchronous this would unblock commits waiting for WAL to
+				 * be written and/or flushed by synchronous standby.
+				 */
+				if (StandbyModeRequested &&
+					reachedConsistency &&
+					wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY &&
+					!WalRcvStreaming())
+				{
+					XLogRecPtr startpoint = GetLastLSN(record->xl_prev);
+					elog(LOG, "starting WAL receiver, startpoint %X/%X",
+						 (uint32) (startpoint >> 32), (uint32) startpoint);
+					RequestXLogStreaming(ThisTimeLineID,
+										 startpoint,
+										 PrimaryConnInfo,
+										 PrimarySlotName,
+										 wal_receiver_create_temp_slot);
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -12756,12 +12870,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b90e5ca98e..8cc9fdba6f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -89,6 +89,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int 		wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 6f0acbfdef..1df022e8c2 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -261,10 +261,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	SpinLockAcquire(&walrcv->mutex);
 
-	/* It better be stopped if we try to restart it */
-	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
-		   walrcv->walRcvState == WALRCV_WAITING);
-
 	if (conninfo != NULL)
 		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
 	else
@@ -287,12 +283,26 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 		walrcv->is_temp_slot = create_temp_slot;
 	}
 
+	/*
+	 * We used to assert that the WAL receiver is either in WALRCV_STOPPED or
+	 * in WALRCV_WAITING state.
+	 *
+	 * Such an assertion is not possible, now that this function is called by
+	 * startup process on two occasions.  One is just before starting to
+	 * replay WAL when starting up.  And the other is when it has finished
+	 * replaying all WAL in pg_xlog directory.  If the standby is starting up
+	 * after clean shutdown, there is not much WAL to be replayed and both
+	 * calls to this funcion can occur in quick succession.  By the time the
+	 * second request to start streaming is made, the WAL receiver can be in
+	 * any state.  We therefore cannot make any assertion on the state here.
+	 */
+
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
 		walrcv->walRcvState = WALRCV_STARTING;
 	}
-	else
+	else if (walrcv->walRcvState == WALRCV_WAITING)
 		walrcv->walRcvState = WALRCV_RESTARTING;
 	walrcv->startTime = now;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e91d5a3cfd..b2de8039c7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -247,6 +247,12 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context,
  * NOTE! Option values may not contain double quotes!
  */
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"catchup", WAL_RCV_START_AT_CATCHUP, true},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, true},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry bytea_output_options[] = {
 	{"escape", BYTEA_OUTPUT_ESCAPE, false},
 	{"hex", BYTEA_OUTPUT_HEX, false},
@@ -5007,6 +5013,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_CATCHUP,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0b607ed777..362b681ae5 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -24,10 +24,17 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_CATCHUP, /* start a WAL receiver  after replaying all WAL files */
+	WAL_RCV_START_AT_CONSISTENCY /* start a WAL receiver once consistency has been reached */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int  wal_receiver_start_condition;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
new file mode 100644
index 0000000000..e82d8a0a64
--- /dev/null
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -0,0 +1,192 @@
+# Test impact of replay lag on synchronous replication.
+#
+# Replay lag is induced using recovery_min_apply_delay GUC.  Two ways
+# of breaking replication connection are covered - killing walsender
+# and restarting standby.  The test expects that replication
+# connection is restored without being affected due to replay lag.
+# This is validated by performing commits on master after replication
+# connection is disconnected and checking that they finish within a
+# few seconds.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Query checking sync_priority and sync_state of each standby
+my $check_sql =
+  "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Check that sync_state of a standby is expected (waiting till it is).
+# If $setting is given, synchronous_standby_names is set to it and
+# the configuration file is reloaded before the test.
+sub test_sync_state
+{
+	my ($self, $expected, $msg, $setting) = @_;
+
+	if (defined($setting))
+	{
+		$self->safe_psql('postgres',
+						 "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
+		$self->reload;
+	}
+
+	ok($self->poll_query_until('postgres', $check_sql, $expected), $msg);
+	return;
+}
+
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print("### Waiting for standby \"$standby_name\" on \"$master_name\"\n");
+	$master->poll_query_until('postgres', $query);
+	return;
+}
+
+# Initialize master node
+my $node_master = get_new_node('master');
+my @extra = (q[--wal-segsize], q[1]);
+$node_master->init(allows_streaming => 1, extra => \@extra);
+$node_master->start;
+my $backup_name = 'master_backup';
+
+# Setup physical replication slot for streaming replication
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot', true, false);]);
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby linking to master
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name,
+								has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+						   q[primary_slot_name = 'phys_slot']);
+# Enable debug logging in standby
+$node_standby->append_conf('postgresql.conf',
+						   q[log_min_messages = debug5]);
+# Enable early WAL receiver startup
+$node_standby->append_conf('postgresql.conf',
+						   q[wal_receiver_start_condition = 'consistency']);
+
+start_standby_and_wait($node_master, $node_standby);
+
+# Make standby synchronous
+test_sync_state(
+	$node_master,
+	qq(standby|1|sync),
+	'standby is synchronous',
+	'standby');
+
+# Slow down WAL replay by inducing 10 seconds sleep before replaying
+# a commit WAL record.
+$node_standby->safe_psql('postgres',
+						 'ALTER SYSTEM set recovery_min_apply_delay TO 10000;');
+$node_standby->reload;
+
+# Commit some transactions on master to induce replay lag in standby.
+$node_master->safe_psql('postgres', 'CREATE TABLE replay_lag_test(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (101);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (102);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (103);');
+
+# Obtain WAL sender PID and kill it.
+my $walsender_pid = $node_master->safe_psql(
+	'postgres',
+	q[select active_pid from pg_get_replication_slots() where slot_name = 'phys_slot']);
+
+# Kill walsender, so that the replication connection breaks.
+kill 'SIGTERM', $walsender_pid;
+
+# The replication connection should be re-establised much earlier than
+# what it takes to finish replay.  Try to commit a transaction with a
+# timeout of recovery_min_apply_delay + 2 seconds.  The timeout should
+# not be hit.
+my $timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (1);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after WAL receiver restart');
+
+my $replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after WAL receiver restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication resumes in spite of replay lag');
+
+# Break the replication connection by restarting standby.
+$node_standby->restart;
+
+# Like in previous test, the replication connection should be
+# re-establised before pending WAL replay is finished.  Try to commit
+# a transaction with recovery_min_apply_delay + 2 second timeout.  The
+# timeout should not be hit.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (2);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after standby restart');
+$replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after standby restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication starts in spite of replay lag');
+
+# Reset the delay so that the replay process is no longer slowed down.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set recovery_min_apply_delay to 0;');
+$node_standby->reload;
+
+# Switch to a new WAL file and see if things work well.
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Transactions should work fine on master.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (3);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Standby should also have identical content.
+my $count_sql = q[select count(*) from replay_lag_test;];
+my $expected = q[6];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected), 'standby query');
+
+# Test that promotion followed by query works.
+$node_standby->promote;
+$node_master->stop;
+$node_standby->safe_psql('postgres', 'insert into replay_lag_test values (4);');
+
+$expected = q[7];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected),
+   'standby query after promotion');
-- 
2.17.1

#21Daniel Gustafsson
daniel@yesql.se
In reply to: Mahendra Singh Thalor (#20)
Re: Replication & recovery_min_apply_delay

On 26 Oct 2021, at 21:31, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

I took v3[1] patch from this thread and re-based on commit head(5fedf7417b69295294b154a21).

Please find the attached patch for review.

Cool! There is a Commitfest starting soon, as the previous entry was closed
you need to open a new for this patch at:

https://commitfest.postgresql.org/35/

--
Daniel Gustafsson https://vmware.com/

#22Dilip Kumar
dilipbalaut@gmail.com
In reply to: Mahendra Singh Thalor (#20)
Re: Replication & recovery_min_apply_delay

On Wed, Oct 27, 2021 at 1:01 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 3 Aug 2020 at 15:02, Daniel Gustafsson <daniel@yesql.se> wrote:

This thread has stalled and the patch no longer applies, so I'm marking this
Returned with Feedback. Please feel free to open a new entry if this patch is
revived.

cheers ./daniel

Hi all,
I took v3[1] patch from this thread and re-based on commit head(5fedf7417b69295294b154a21).

Please find the attached patch for review.

I have done an initial review of the patch, I have a few comments, I
will do a more detailed review later this week.

1.
Commit message says that two options are 'consistency' and 'replay'
whereas inside the patch those options are 'consistency' and
'catchup', Please change the commit message so that it matches with
what the code is actually doing.

2.
+static int
+XLogReadFirstPage(XLogRecPtr targetPagePtr, char *readBuf)

Add comments about what this function is doing.

3.
Please run pg_indend on your code and fix all coding guideline violations

4.
XLogReadFirstPage(). function is returning the size, but actually it
either returns -1 or returns XLOG_BLCKSZ. So logically this function
should just be returning boolean, true if it is able to read the first
page, or false otherwise.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#23Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Mahendra Singh Thalor (#20)
Re: Replication & recovery_min_apply_delay

On Wed, Oct 27, 2021 at 1:02 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 3 Aug 2020 at 15:02, Daniel Gustafsson <daniel@yesql.se> wrote:

This thread has stalled and the patch no longer applies, so I'm marking this
Returned with Feedback. Please feel free to open a new entry if this patch is
revived.

cheers ./daniel

Hi all,
I took v3[1] patch from this thread and re-based on commit head(5fedf7417b69295294b154a21).

Please find the attached patch for review.

Thanks for reviving this patch. Here are some comments:

1) I don't think we need lseek() to the beginning of the file right
after XLogFileReadAnyTLI as the open() sys call will ensure that the
fd is set to the beginning of the file.
+ fd = XLogFileReadAnyTLI(segno, LOG, XLOG_FROM_PG_WAL);
+ if (fd == -1)
+ return -1;
+
+ /* Seek to the beginning, we want to check if the first page is valid */
+ if (lseek(fd, (off_t) 0, SEEK_SET) < 0)
+ {
+ XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);

2) How about saying "starting WAL receiver while redo in progress,
startpoint %X/%X"," something like this? Because the following message
looks like we are starting the WAL receiver for the first time, just
to differentiate this with the redo case.
+ elog(LOG, "starting WAL receiver, startpoint %X/%X",

3) Although, WAL_RCV_START_AT_CATCHUP has been defined and used as
default for wal_receiver_start_condition GUC, are we starting wal
receiver when this is set? We are doing it only when
WAL_RCV_START_AT_CONSISTENCY. If we were to start the WAL receiver
even for WAL_RCV_START_AT_CATCHUP, let's have the LOG message (2)
clearly say this.

4) I think the default value for wal_receiver_start_condition GUC can
be WAL_RCV_START_AT_CONSISTENCY as it starts streaming once it reaches
a consistent state.

5) Should it be StandbyMode instead of StandbyModeRequested? I'm not
sure if it does make a difference.
+ if (StandbyModeRequested &&

6) Documentation missing for the new GUC.

7) Should we just collect the output of the read() and use it in the
if condition? It will be easier for debugging to know how many bytes
the read() returns.
+ if (read(fd, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)

8) I think we should be using new style ereport(LOG, than elog(LOG,

9) Can't we place this within an inline function for better
readability and reusability if needed?
+ if ((longhdr->std.xlp_info & XLP_LONG_HEADER) == 0 ||
+ (longhdr->std.xlp_magic != XLOG_PAGE_MAGIC) ||
+ ((longhdr->std.xlp_info & ~XLP_ALL_FLAGS) != 0) ||
+ (longhdr->xlp_sysid != ControlFile->system_identifier) ||
+ (longhdr->xlp_seg_size != wal_segment_size) ||
+ (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ) ||
+ (longhdr->std.xlp_pageaddr != lsn) ||
+ (longhdr->std.xlp_tli != ThisTimeLineID))
+ {

10) I don't think we need all the logs to be elog(LOG, which might
blow up the server logs. Try to have a one single message with LOG
that sort of gives information like whether the wal receiver is
started or not, the startpoint, the last valid segment number and so
on. The detailed message can be at DEBUG1 level.

11) I think you should use LSN_FORMAT_ARGS instead of
+ (uint32) (lsn >> 32), (uint32) lsn);
+ (uint32) (startpoint >> 32), (uint32) startpoint);

Regards,
Bharath Rupireddy.

#24Mahendra Singh Thalor
mahi6run@gmail.com
In reply to: Bharath Rupireddy (#23)
Re: Replication & recovery_min_apply_delay

Thanks Dilip and Bharath for the review.

I am working on review comments and will post an updated patch.

On Wed, 10 Nov 2021 at 15:31, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Oct 27, 2021 at 1:02 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

On Mon, 3 Aug 2020 at 15:02, Daniel Gustafsson <daniel@yesql.se> wrote:

This thread has stalled and the patch no longer applies, so I'm marking this
Returned with Feedback. Please feel free to open a new entry if this patch is
revived.

cheers ./daniel

Hi all,
I took v3[1] patch from this thread and re-based on commit head(5fedf7417b69295294b154a21).

Please find the attached patch for review.

Thanks for reviving this patch. Here are some comments:

1) I don't think we need lseek() to the beginning of the file right
after XLogFileReadAnyTLI as the open() sys call will ensure that the
fd is set to the beginning of the file.
+ fd = XLogFileReadAnyTLI(segno, LOG, XLOG_FROM_PG_WAL);
+ if (fd == -1)
+ return -1;
+
+ /* Seek to the beginning, we want to check if the first page is valid */
+ if (lseek(fd, (off_t) 0, SEEK_SET) < 0)
+ {
+ XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);

2) How about saying "starting WAL receiver while redo in progress,
startpoint %X/%X"," something like this? Because the following message
looks like we are starting the WAL receiver for the first time, just
to differentiate this with the redo case.
+ elog(LOG, "starting WAL receiver, startpoint %X/%X",

3) Although, WAL_RCV_START_AT_CATCHUP has been defined and used as
default for wal_receiver_start_condition GUC, are we starting wal
receiver when this is set? We are doing it only when
WAL_RCV_START_AT_CONSISTENCY. If we were to start the WAL receiver
even for WAL_RCV_START_AT_CATCHUP, let's have the LOG message (2)
clearly say this.

4) I think the default value for wal_receiver_start_condition GUC can
be WAL_RCV_START_AT_CONSISTENCY as it starts streaming once it reaches
a consistent state.

5) Should it be StandbyMode instead of StandbyModeRequested? I'm not
sure if it does make a difference.
+ if (StandbyModeRequested &&

6) Documentation missing for the new GUC.

7) Should we just collect the output of the read() and use it in the
if condition? It will be easier for debugging to know how many bytes
the read() returns.
+ if (read(fd, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)

8) I think we should be using new style ereport(LOG, than elog(LOG,

9) Can't we place this within an inline function for better
readability and reusability if needed?
+ if ((longhdr->std.xlp_info & XLP_LONG_HEADER) == 0 ||
+ (longhdr->std.xlp_magic != XLOG_PAGE_MAGIC) ||
+ ((longhdr->std.xlp_info & ~XLP_ALL_FLAGS) != 0) ||
+ (longhdr->xlp_sysid != ControlFile->system_identifier) ||
+ (longhdr->xlp_seg_size != wal_segment_size) ||
+ (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ) ||
+ (longhdr->std.xlp_pageaddr != lsn) ||
+ (longhdr->std.xlp_tli != ThisTimeLineID))
+ {

10) I don't think we need all the logs to be elog(LOG, which might
blow up the server logs. Try to have a one single message with LOG
that sort of gives information like whether the wal receiver is
started or not, the startpoint, the last valid segment number and so
on. The detailed message can be at DEBUG1 level.

11) I think you should use LSN_FORMAT_ARGS instead of
+ (uint32) (lsn >> 32), (uint32) lsn);
+ (uint32) (startpoint >> 32), (uint32) startpoint);

Regards,
Bharath Rupireddy.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

#25Dilip Kumar
dilipbalaut@gmail.com
In reply to: Mahendra Singh Thalor (#20)
Re: Replication & recovery_min_apply_delay

On Wed, Oct 27, 2021 at 1:01 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

I was going through and patch and trying to understand the idea that
what we are trying to achieve here. So basically, generally we start
the WAL receiver when we want to fetch the WAL, but if the user has
set the parameter WAL_RCV_START_AT_CONSISTENCY then we will start
fetching the WAL using walreciver in advance. IMHO the benefit of
this idea is that with the GUC we can control whether the extra WAL
will be collected at the primary or at the standby node right?

One problem is that if the currentsource is XLOG_FROM_ARCHIVE then we
might fetch the WAL using both walreceiver as well as from archive
right? because we are not changing the currentsource. Is this
intentional or do we want to change the currentsource as well?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#26Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Dilip Kumar (#25)
Re: Replication & recovery_min_apply_delay

On Sun, Nov 14, 2021 at 11:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Oct 27, 2021 at 1:01 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

I was going through and patch and trying to understand the idea that
what we are trying to achieve here. So basically, generally we start
the WAL receiver when we want to fetch the WAL, but if the user has
set the parameter WAL_RCV_START_AT_CONSISTENCY then we will start
fetching the WAL using walreciver in advance. IMHO the benefit of
this idea is that with the GUC we can control whether the extra WAL
will be collected at the primary or at the standby node right?

One problem is that if the currentsource is XLOG_FROM_ARCHIVE then we
might fetch the WAL using both walreceiver as well as from archive
right? because we are not changing the currentsource. Is this
intentional or do we want to change the currentsource as well?

Is there any relation between this patch and another one at [1]/messages/by-id/9AA68455-368B-484A-8A53-3C3000187A24@yesql.se?

[1]: /messages/by-id/9AA68455-368B-484A-8A53-3C3000187A24@yesql.se

Regards,
Bharath Rupireddy.

#27Dilip Kumar
dilipbalaut@gmail.com
In reply to: Bharath Rupireddy (#26)
Re: Replication & recovery_min_apply_delay

On Mon, Nov 22, 2021 at 4:25 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sun, Nov 14, 2021 at 11:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Oct 27, 2021 at 1:01 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:

I was going through and patch and trying to understand the idea that
what we are trying to achieve here. So basically, generally we start
the WAL receiver when we want to fetch the WAL, but if the user has
set the parameter WAL_RCV_START_AT_CONSISTENCY then we will start
fetching the WAL using walreciver in advance. IMHO the benefit of
this idea is that with the GUC we can control whether the extra WAL
will be collected at the primary or at the standby node right?

One problem is that if the currentsource is XLOG_FROM_ARCHIVE then we
might fetch the WAL using both walreceiver as well as from archive
right? because we are not changing the currentsource. Is this
intentional or do we want to change the currentsource as well?

Is there any relation between this patch and another one at [1]?

[1] /messages/by-id/9AA68455-368B-484A-8A53-3C3000187A24@yesql.se

Seems like both of them are trying to solve the same issue.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com