Switching XLog source from archive to streaming when primary available

Started by SATYANARAYANA NARLAPURAMabout 4 years ago61 messages

satyanarlapuram@gmail.com

about 4 years ago

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog
source from streaming to archive and continues in that state until it can
get the WAL from the archive location. On a server with high WAL activity,
typically getting the WAL from the archive is slower than streaming it from
the primary and couldn't exit from that state. This not only increases the
lag on the standby but also adversely impacts the primary as the WAL gets
accumulated, and vacuum is not able to collect the dead tuples. DBAs as a
mitigation can however remove/advance the slot or remove the
restore_command on the standby but this is a manual work I am trying to
avoid. I would like to propose the following, please let me know your
thoughts.

- Automatically attempt to switch the source from Archive to streaming
when the primary_conninfo is set after replaying 'N' wal segment governed
by the GUC retry_primary_conn_after_wal_segments
- when retry_primary_conn_after_wal_segments is set to -1 then the
feature is disabled
- When the retry attempt fails, then switch back to the archive

Thanks,
Satya

Dilip Kumar

dilipbalaut@gmail.com

about 4 years ago

In reply to: SATYANARAYANA NARLAPURAM (#1)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.

Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
when retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
When the retry attempt fails, then switch back to the archive

I think there is another thread [1]/messages/by-id/CAKYtNApe05WmeRo92gTePEmhOM4myMpCK_+ceSJtC7-AWLw1qw@mail.gmail.com that is logically trying to solve
a similar issue, basically, in the main recovery apply loop is the
walreceiver does not exist then it is launching the walreceiver.
However, in that patch, it is not changing the current Xlog source but
I think that is not a good idea because with that it will restore from
the archive as well as stream from the primary so I have given that
review comment on that thread as well. One big difference is that
patch is launching the walreceiver even if the WAL is locally
available and we don't really need more WAL but that is controlled by
a GUC.

[1]: /messages/by-id/CAKYtNApe05WmeRo92gTePEmhOM4myMpCK_+ceSJtC7-AWLw1qw@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: SATYANARAYANA NARLAPURAM (#1)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.

Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
when retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
When the retry attempt fails, then switch back to the archive

I've gone through the state machine in WaitForWALToBecomeAvailable and
I understand it this way: failed to receive WAL records from the
primary causes the current source to switch to archive and the standby
continues to get WAL records from archive location unless some failure
occurs there the current source is never going to switch back to
stream. Given the fact that getting WAL from archive location causes
delay in production environments, we miss to take the advantage of the
reconnection to primary after previous failed attempt.

So basically, we try to attempt to switch to streaming from archive
(even though fetching from archive can succeed) after a certain amount
of time or WAL segments. I prefer timing-based switch to streaming
from archive instead of after a number of WAL segments fetched from
archive. Right now, wal_retrieve_retry_interval is being used to wait
before switching to archive after failed attempt from streaming, IMO,
a similar GUC (that gets set once the source switched from streaming
to archive and on timeout it switches to streaming again) can be used
to switch from archive to streaming after the specified amount of
time.

Thoughts?

Regards,
Bharath Rupireddy.

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#3)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Sat, Apr 30, 2022 at 6:19 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:

Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.

Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
when retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
When the retry attempt fails, then switch back to the archive

I've gone through the state machine in WaitForWALToBecomeAvailable and
I understand it this way: failed to receive WAL records from the
primary causes the current source to switch to archive and the standby
continues to get WAL records from archive location unless some failure
occurs there the current source is never going to switch back to
stream. Given the fact that getting WAL from archive location causes
delay in production environments, we miss to take the advantage of the
reconnection to primary after previous failed attempt.

So basically, we try to attempt to switch to streaming from archive
(even though fetching from archive can succeed) after a certain amount
of time or WAL segments. I prefer timing-based switch to streaming
from archive instead of after a number of WAL segments fetched from
archive. Right now, wal_retrieve_retry_interval is being used to wait
before switching to archive after failed attempt from streaming, IMO,
a similar GUC (that gets set once the source switched from streaming
to archive and on timeout it switches to streaming again) can be used
to switch from archive to streaming after the specified amount of
time.

Thoughts?

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

Regards,
Bharath Rupireddy.

Attachments:

v1-0001-Switch-to-stream-mode-from-archive-occasionally.patchapplication/octet-stream; name=v1-0001-Switch-to-stream-mode-from-archive-occasionally.patchDownload

From 756d28ab2284e6eb8c057dae45280a38212cec48 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 24 May 2022 16:05:10 +0000
Subject: [PATCH v1] Switch to stream mode from archive occasionally

This patch enables standby to switch to stream mode i.e. get
WAL from primary even before it fails to receive from archive
location. Currently, if receive from archive location fails, it
switches back to stream mode and by then WAL receiver could have
come up. Since fetching WAL from archvie isn't cheap always,
switch to stream mode occasionally.

Right now, this switching is based on wal_retrieve_retry_interval
i.e. if the standby has received the WAL from archive for
wal_retrieve_retry_interval after the last failed attempt from
the stream, it switches back to stream mode to see if it can
receivefrom primary, if yes it's a good bet otherwise falls back
to archive.
---
 src/backend/access/transam/xlogrecovery.c | 53 +++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6eba626420..3746e24a18 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -241,6 +241,7 @@ static XLogSource readSource = XLOG_FROM_ANY;
  * walreceiver restart.  This is only valid in XLOG_FROM_STREAM state.
  */
 static XLogSource currentSource = XLOG_FROM_ANY;
+static XLogSource lastSource = XLOG_FROM_ANY;
 static bool lastSourceFailed = false;
 static bool pendingWalRcvRestart = false;
 
@@ -3065,6 +3066,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				 * so that we will check the archive next.
 				 */
 				lastSourceFailed = false;
+				lastSource = currentSource;
 				currentSource = XLOG_FROM_ANY;
 
 				continue;
@@ -3375,11 +3377,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 *-------
 	 */
 	if (!InArchiveRecovery)
+	{
+		lastSource = currentSource;
 		currentSource = XLOG_FROM_PG_WAL;
+	}
 	else if (currentSource == XLOG_FROM_ANY ||
 			 (!StandbyMode && currentSource == XLOG_FROM_STREAM))
 	{
 		lastSourceFailed = false;
+		lastSource = currentSource;
 		currentSource = XLOG_FROM_ARCHIVE;
 	}
 
@@ -3432,7 +3438,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Move to XLOG_FROM_STREAM state, and set to start a
 					 * walreceiver if necessary.
 					 */
+					lastSource = currentSource;
 					currentSource = XLOG_FROM_STREAM;
+					/*
+					 * XXX: we might have to see if the WAL receiver is already
+					 * running before even we just go ahead and start it
+					 * relying only on startWalReceiver flag. The WAL receiver
+					 * could've come up by then, if yes, there can be multiple
+					 * WAL receivers???
+					 */
 					startWalReceiver = true;
 					break;
 
@@ -3475,6 +3489,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					{
 						if (rescanLatestTimeLine(replayTLI, replayLSN))
 						{
+							lastSource = currentSource;
 							currentSource = XLOG_FROM_ARCHIVE;
 							break;
 						}
@@ -3512,6 +3527,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						HandleStartupProcInterrupts();
 					}
 					last_fail_time = now;
+					lastSource = currentSource;
 					currentSource = XLOG_FROM_ARCHIVE;
 					break;
 
@@ -3527,7 +3543,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			 * from the archive first.
 			 */
 			if (InArchiveRecovery)
+			{
+				lastSource = currentSource;
 				currentSource = XLOG_FROM_ARCHIVE;
+			}
 		}
 
 		if (currentSource != oldSource)
@@ -3562,6 +3581,40 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * Try to stream WAL from primary after a specified amount of
+				 * time fetching from archive. This is because fetching WAL
+				 * from archive isn't always cheaper and the primary could have
+				 * come up meanwhile.
+				 */
+				if (StandbyMode && lastSource == XLOG_FROM_STREAM)
+				{
+					TimestampTz now;
+
+					now = GetCurrentTimestamp();
+
+					if (TimestampDifferenceExceeds(last_fail_time, now,
+												   wal_retrieve_retry_interval))
+					{
+						elog(DEBUG2,
+							 "switching WAL source from %s to %s after \"wal_retrieve_retry_interval\" %d milliseconds",
+							 xlogSourceNames[currentSource],
+							 xlogSourceNames[lastSource],
+							 wal_retrieve_retry_interval);
+
+						currentSource =	lastSource;
+						last_fail_time = 0;
+
+						/*
+						 * Treat this as a failure to read from current source,
+						 * even though it is actually not, so that the state
+						 * machine moves to read it from XLOG_FROM_STREAM.
+						 */
+						lastSourceFailed = true;
+						break;
+					}
+				}
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
-- 
2.25.1

Cary Huang

cary.huang@highgo.ca

over 3 years ago

In reply to: Bharath Rupireddy (#4)

Re: Switching XLog source from archive to streaming when primary available

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Hello

I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during a very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and force standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds before it attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive until standby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems that the patch does the job as described and does bring some benefit during a very large REDO job where it will try to re-stream after restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a large one, PG is already switching back to streaming once it hits consistent state.

thank you

Cary Huang
HighGo Software Canada

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Cary Huang (#5)

Re: Switching XLog source from archive to streaming when primary available

On Sat, Jun 25, 2022 at 1:31 AM Cary Huang <cary.huang@highgo.ca> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Hello

I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during a very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and force standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds before it attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive until standby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems that the patch does the job as described and does bring some benefit during a very large REDO job where it will try to re-stream after restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a large one, PG is already switching back to streaming once it hits consistent state.

Thanks a lot Cary for testing the patch.

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

It will be great if I can hear some thoughts on the above points (as
posted upthread).

Regards,
Bharath Rupireddy.

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#6)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Jul 8, 2022 at 9:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Jun 25, 2022 at 1:31 AM Cary Huang <cary.huang@highgo.ca> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Hello

I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during a very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and force standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds before it attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive until standby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems that the patch does the job as described and does bring some benefit during a very large REDO job where it will try to re-stream after restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a large one, PG is already switching back to streaming once it hits consistent state.

Thanks a lot Cary for testing the patch.

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

It will be great if I can hear some thoughts on the above points (as
posted upthread).

Here's the v2 patch with a separate GUC, new GUC was necessary as the
existing GUC wal_retrieve_retry_interval is loaded with multiple
usages. When the feature is enabled, it will let standby to switch to
stream mode i.e. fetch WAL from primary before even fetching from
archive fails. The switching to stream mode from archive happens in 2
scenarios: 1) when standby is in initial recovery 2) when there was a
failure in receiving from primary (walreceiver got killed or crashed
or timed out, or connectivity to primary was broken - for whatever
reasons).

I also added test cases to the v2 patch.

Please review the patch.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachments:

v2-0001-Switch-WAL-source-to-stream-from-archive.patchapplication/octet-stream; name=v2-0001-Switch-WAL-source-to-stream-from-archive.patchDownload

From 2147194165ebc66232b6e7ff9e3a8b3d07d50a72 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 11 Aug 2022 15:35:20 +0000
Subject: [PATCH v2] Switch WAL source to stream from archive

This patch enables standby to switch to stream mode i.e. get
WAL from primary even before it fails to receive from archive
location. Currently, the standby switches to stream mode, only
when receive from archive location fails.

The standby makes an attempt to read WAL from primary after
wal_retrieve_retry_interval milliseconds reading from archive.
Reading WAL from archive may not always be efficient and cheaper
because network latencies, disk IO cost might differ on the archive
as compared to the primary and often the archive may sit far from
the standbys - all adding to the recovery performance on the
standbys.

Hence reading WAL from primary as opposed to archive enables
standbys to catch up with the primary sooner thus reducing
replication lag and avoiding WAL files accumulation on the primary.

This can benefit in any of the following situations:
1) standby in initial recovery after start/restart.
2) standby stopped streaming from primary because of connectivity
issues with the primary (either due to network issues or crash in
the primary or something else) or walreceiver got killed or
crashed for whatever reasons.
---
 doc/src/sgml/config.sgml                      |  31 +++++
 src/backend/access/transam/xlogrecovery.c     |  96 ++++++++++++-
 src/backend/utils/misc/guc.c                  |  11 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++++++
 6 files changed, 265 insertions(+), 4 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2522f4c8c5..d9a6a2ec78 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4837,6 +4837,37 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-wal-source-switch-interval" xreflabel="wal_source_switch_interval">
+      <term><varname>wal_source_switch_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_source_switch_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies how long the standby server should wait before switching WAL
+        source from WAL archive to primary (streaming replication). This can
+        happen either during the standby initial recovery or after a previous
+        failed attempt to stream WAL from the primary.
+        If this value is specified without units, it is taken as milliseconds.
+        The default value is 5 seconds. A setting of <literal>0</literal>
+        disables the feature.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standbys
+        impacting the recovery performance on the standbys. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standbys to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a59a0e826b..f22af22ba7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -88,6 +88,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			wal_source_switch_interval = 5000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -3397,6 +3398,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool first_time = true;
+	static TimestampTz last_switch_time = 0;
+	bool	intentionalSourceSwitch = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
 
@@ -3418,6 +3422,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after every wal_source_switch_interval
+	 * milliseconds, when state machine is in XLOG_FROM_ARCHIVE state. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3452,8 +3461,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			 * Don't allow any retry loops to occur during nonblocking
 			 * readahead.  Let the caller process everything that has been
 			 * decoded already first.
+			 *
+			 * Continue retrying for requested WAL when there was an
+			 * intentional source switch from archive to stream.
 			 */
-			if (nonblocking)
+			if (nonblocking && !intentionalSourceSwitch)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3583,15 +3595,30 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			if (intentionalSourceSwitch)
+			{
+				elog(DEBUG2,
+					 "switched WAL source to %s after fetching WAL from %s for %d milliseconds",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 wal_source_switch_interval);
+			}
+			else
+			{
+				elog(DEBUG2, "switched WAL source from %s to %s after %s",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 lastSourceFailed ? "failure" : "success");
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
 		lastSourceFailed = false;
+		intentionalSourceSwitch = false;
 
 		switch (currentSource)
 		{
@@ -3614,6 +3641,67 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * Make an attempt to read WAL from primary after
+				 * wal_retrieve_retry_interval milliseconds reading from
+				 * archive.
+				 *
+				 * Reading WAL from archive may not always be efficient and
+				 * cheaper because network latencies, disk IO cost might differ
+				 * on the archive as compared to the primary and often the
+				 * archive may sit far from the standbys - all adding to the
+				 * recovery performance on the standbys.
+				 *
+				 * Hence reading WAL from primary as opposed to archive enables
+				 * standbys to catch up with the primary sooner thus reducing
+				 * replication lag and avoiding WAL files accumulation on the
+				 * primary.
+				 *
+				 * We are here for any of the following reasons:
+				 * 1) standby in initial recovery after start/restart.
+				 * 2) standby stopped streaming from primary because of
+				 * connectivity issues with the primary (either due to network
+				 * issues or crash in the primary or something else) or
+				 * walreceiver got killed or crashed for whatever reasons.
+				 */
+				if (StandbyMode && currentSource == XLOG_FROM_ARCHIVE)
+				{
+					TimestampTz curr_time;
+
+					curr_time = GetCurrentTimestamp();
+
+					/* Assume last_switch_time as curr_time for the first time */
+					if (first_time)
+						last_switch_time = curr_time;
+
+					if (!first_time &&
+						TimestampDifferenceExceeds(last_switch_time, curr_time,
+												   wal_source_switch_interval))
+					{
+						elog(DEBUG2,
+							 "trying to switch WAL source to %s after fetching WAL from %s for %d milliseconds",
+							 xlogSourceNames[XLOG_FROM_STREAM],
+							 xlogSourceNames[currentSource],
+							 wal_source_switch_interval);
+
+						last_switch_time = curr_time;
+
+						/*
+						 * Treat this as a failure to read from archive, even
+						 * though it is actually not, so that the state machine
+						 * will move on to stream the WAL from primary.
+						 */
+						lastSourceFailed = true;
+						intentionalSourceSwitch = true;
+
+						break;
+					}
+
+					/* We're not here for the first time any more */
+					if (first_time)
+						first_time = false;
+				}
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5db5df6285..ce50e6596d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3324,6 +3324,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_source_switch_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time to wait before switching WAL source from archive to primary"),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&wal_source_switch_interval,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502c..ec70a76a11 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#wal_source_switch_interval = 5s	# time to wait before switching WAL
+					# source from archive to primary
+					# 0 disables the feature, > 0 indicates the
+					# interval in milliseconds.
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0aa85d90e8..a4f8e9c804 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int wal_source_switch_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..b5579e745d
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout=1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout=1h
+wal_source_switch_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# The standby now connects to primary during inital recovery after
+# fetching WAL from archive for about wal_source_switch_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to .* after fetching WAL from .* for .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to .* after fetching WAL from .* for .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#7)

Re: Switching XLog source from archive to streaming when primary available

+      <indexterm>
+       <primary><varname>wal_source_switch_interval</varname> configuration parameter</primary>
+      </indexterm>

I don't want to bikeshed on the name too much, but I do think we need
something more descriptive. I'm thinking of something like
streaming_replication_attempt_interval or
streaming_replication_retry_interval.

+        Specifies how long the standby server should wait before switching WAL
+        source from WAL archive to primary (streaming replication). This can
+        happen either during the standby initial recovery or after a previous
+        failed attempt to stream WAL from the primary.

I'm not sure what the second sentence means. In general, I think the
explanation in your commit message is much clearer:

The standby makes an attempt to read WAL from primary after
wal_retrieve_retry_interval milliseconds reading from archive.

+        If this value is specified without units, it is taken as milliseconds.
+        The default value is 5 seconds. A setting of <literal>0</literal>
+        disables the feature.

5 seconds seems low. I would expect the default to be 1-5 minutes. I
think it's important to strike a balance between interrupting archive
recovery to attempt streaming replication and letting archive recovery make
progress.

+	 * Try reading WAL from primary after every wal_source_switch_interval
+	 * milliseconds, when state machine is in XLOG_FROM_ARCHIVE state. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.

It's not clear to me how this is expected to interact with the pg_wal phase
of standby recovery. As the docs note [0]https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION, standby servers loop through
archive recovery, recovery from pg_wal, and streaming replication. Does
this cause the pg_wal phase to be skipped (i.e., the standby goes straight
from archive recovery to streaming replication)? I wonder if it'd be
better for this mechanism to simply move the standby to the pg_wal phase so
that the usual ordering isn't changed.

+					if (!first_time &&
+						TimestampDifferenceExceeds(last_switch_time, curr_time,
+												   wal_source_switch_interval))

Shouldn't this also check that wal_source_switch_interval is not set to 0?

+						elog(DEBUG2,
+							 "trying to switch WAL source to %s after fetching WAL from %s for %d milliseconds",
+							 xlogSourceNames[XLOG_FROM_STREAM],
+							 xlogSourceNames[currentSource],
+							 wal_source_switch_interval);
+
+						last_switch_time = curr_time;

Shouldn't the last_switch_time be set when the state machine first enters
XLOG_FROM_ARCHIVE? IIUC this logic is currently counting time spent
elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
source switch. This would mean that a standby that has spent a lot of time
in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
XLOG_FROM_ARCHIVE when it failed again. Given the standby will wait for
wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
seems like we could end up rapidly looping between sources. Perhaps I am
misunderstanding how this is meant to work.

+	{
+		{"wal_source_switch_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time to wait before switching WAL source from archive to primary"),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&wal_source_switch_interval,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},

I wonder if the lower bound should be higher to avoid switching
unnecessarily rapidly between WAL sources. I see that
WaitForWALToBecomeAvailable() ensures that standbys do not switch from
XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
wal_retrieve_retry_interval. Perhaps wal_retrieve_retry_interval should be
the lower bound for this GUC, too. Or maybe WaitForWALToBecomeAvailable()
should make sure that the standby makes at least once attempt to restore
the file from archive before switching to streaming replication.

[0]: https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#8)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Sep 7, 2022 at 3:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

+      <indexterm>
+       <primary><varname>wal_source_switch_interval</varname> configuration parameter</primary>
+      </indexterm>
I don't want to bikeshed on the name too much, but I do think we need
something more descriptive. I'm thinking of something like
streaming_replication_attempt_interval or
streaming_replication_retry_interval.

I could come up with wal_source_switch_interval after a log of
bikeshedding myself :). However, streaming_replication_retry_interval
looks much better, I've used it in the latest patch. Thanks.

+        Specifies how long the standby server should wait before switching WAL
+        source from WAL archive to primary (streaming replication). This can
+        happen either during the standby initial recovery or after a previous
+        failed attempt to stream WAL from the primary.

I'm not sure what the second sentence means. In general, I think the
explanation in your commit message is much clearer:

I polished the comments, docs and commit message a bit, please check now.

5 seconds seems low. I would expect the default to be 1-5 minutes. I
think it's important to strike a balance between interrupting archive
recovery to attempt streaming replication and letting archive recovery make
progress.

+1 for a default value of 5 minutes to avoid frequent interruptions
for archive mode when primary is really down for a long time. I've
also added a cautionary note in the docs about the lower values.

+        * Try reading WAL from primary after every wal_source_switch_interval
+        * milliseconds, when state machine is in XLOG_FROM_ARCHIVE state. If
+        * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+        * it falls back to XLOG_FROM_ARCHIVE state.
It's not clear to me how this is expected to interact with the pg_wal phase
of standby recovery. As the docs note [0], standby servers loop through
archive recovery, recovery from pg_wal, and streaming replication. Does
this cause the pg_wal phase to be skipped (i.e., the standby goes straight
from archive recovery to streaming replication)? I wonder if it'd be
better for this mechanism to simply move the standby to the pg_wal phase so
that the usual ordering isn't changed.

[0] https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION

It doesn't change any behaviour as such for XLOG_FROM_PG_WAL. In
standby mode when recovery_command is specified, the initial value of
currentSource would be XLOG_FROM_ARCHIVE (see [1]if (!InArchiveRecovery) currentSource = XLOG_FROM_PG_WAL; else if (currentSource == XLOG_FROM_ANY || (!StandbyMode && currentSource == XLOG_FROM_STREAM)) { lastSourceFailed = false; currentSource = XLOG_FROM_ARCHIVE; }). If the archive is
exhausted of WAL or the standby fails to fetch from the archive, then
it switches to XLOG_FROM_STREAM. If the standby fails to receive WAL
from primary, it switches back to XLOG_FROM_ARCHIVE. This continues
unless the standby gets promoted. With the patch, we enable the
standby to try fetching from the primary, instead of waiting for WAL
in the archive to get exhausted or for an error to occur in the
standby while receiving from the archive.

+                                       if (!first_time &&
+                                               TimestampDifferenceExceeds(last_switch_time, curr_time,
+                                                                                                  wal_source_switch_interval))

Shouldn't this also check that wal_source_switch_interval is not set to 0?

Corrected.

+                                               elog(DEBUG2,
+                                                        "trying to switch WAL source to %s after fetching WAL from %s for %d milliseconds",
+                                                        xlogSourceNames[XLOG_FROM_STREAM],
+                                                        xlogSourceNames[currentSource],
+                                                        wal_source_switch_interval);
+
+                                               last_switch_time = curr_time;
Shouldn't the last_switch_time be set when the state machine first enters
XLOG_FROM_ARCHIVE? IIUC this logic is currently counting time spent
elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
source switch. This would mean that a standby that has spent a lot of time
in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
XLOG_FROM_ARCHIVE when it failed again. Given the standby will wait for
wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
seems like we could end up rapidly looping between sources. Perhaps I am
misunderstanding how this is meant to work.

last_switch_time indicates the time when the standby last attempted to
switch to primary. For instance, a standby:
1) for the first time, sets last_switch_time = current_time when in archive mode
2) if current_time < last_switch_time + interval, continues to be in
archive mode
3) if current_time >= last_switch_time + interval, attempts to switch
to primary and sets last_switch_time = current_time
3.1) if successfully switches to primary, continues in there and for
any reason fails to fetch from primary, then enters archive mode and
loops from step (2)
3.2) if fails to switch to primary, then enters archive mode and loops
from step (2)

Hope this clarifies the behaviour.

+       {
+               {"wal_source_switch_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+                       gettext_noop("Sets the time to wait before switching WAL source from archive to primary"),
+                       gettext_noop("0 turns this feature off."),
+                       GUC_UNIT_MS
+               },
+               &wal_source_switch_interval,
+               5000, 0, INT_MAX,
+               NULL, NULL, NULL
+       },
I wonder if the lower bound should be higher to avoid switching
unnecessarily rapidly between WAL sources. I see that
WaitForWALToBecomeAvailable() ensures that standbys do not switch from
XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
wal_retrieve_retry_interval. Perhaps wal_retrieve_retry_interval should be
the lower bound for this GUC, too. Or maybe WaitForWALToBecomeAvailable()
should make sure that the standby makes at least once attempt to restore
the file from archive before switching to streaming replication.

No, we need a way to disable the feature, so I'm not changing the
lower bound. And let's not make this GUC dependent on any other GUC, I
would like to keep it simple for better usability. However, I've
increased the default value to 5min and added a note in the docs about
the lower values.

I'm attaching the v3 patch with the review comments addressed, please
review it further.

[1]: if (!InArchiveRecovery) currentSource = XLOG_FROM_PG_WAL; else if (currentSource == XLOG_FROM_ANY || (!StandbyMode && currentSource == XLOG_FROM_STREAM)) { lastSourceFailed = false; currentSource = XLOG_FROM_ARCHIVE; }
if (!InArchiveRecovery)
currentSource = XLOG_FROM_PG_WAL;
else if (currentSource == XLOG_FROM_ANY ||
(!StandbyMode && currentSource == XLOG_FROM_STREAM))
{
lastSourceFailed = false;
currentSource = XLOG_FROM_ARCHIVE;
}

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/octet-stream; name=v3-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload

From c9c7601707b97ebcf029cfdc2db69593b7f5f4a0 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 8 Sep 2022 10:07:34 +0000
Subject: [PATCH v3] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-By: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-By: Cary Huang, Nathan Bossart
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  36 +++++
 src/backend/access/transam/xlogrecovery.c     |  95 ++++++++++++-
 src/backend/utils/misc/guc.c                  |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++++++
 6 files changed, 270 insertions(+), 4 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c7..278ecd54c3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4840,6 +4840,42 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). If the standby fails to switch to stream mode, it falls back
+        to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2af5ae3d..142fd19e7f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -88,6 +88,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -3411,6 +3412,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool first_time = true;
+	static TimestampTz last_switch_time = 0;
+	bool	intentionalSourceSwitch = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
 
@@ -3432,6 +3436,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after every
+	 * streaming_replication_retry_interval milliseconds, when state machine is
+	 * in XLOG_FROM_ARCHIVE state. If successful, the state machine moves to
+	 * XLOG_FROM_STREAM state, otherwise it falls back to XLOG_FROM_ARCHIVE
+	 * state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3466,8 +3476,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			 * Don't allow any retry loops to occur during nonblocking
 			 * readahead.  Let the caller process everything that has been
 			 * decoded already first.
+			 *
+			 * Continue retrying for requested WAL when there was an
+			 * intentional source switch from archive to stream.
 			 */
-			if (nonblocking)
+			if (nonblocking && !intentionalSourceSwitch)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3597,15 +3610,30 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			if (intentionalSourceSwitch)
+			{
+				elog(DEBUG2,
+					 "switched WAL source to %s after fetching WAL from %s for %d milliseconds",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 streaming_replication_retry_interval);
+			}
+			else
+			{
+				elog(DEBUG2, "switched WAL source from %s to %s after %s",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 lastSourceFailed ? "failure" : "success");
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
 		lastSourceFailed = false;
+		intentionalSourceSwitch = false;
 
 		switch (currentSource)
 		{
@@ -3628,6 +3656,65 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * Make an attempt to read WAL from primary after reading from
+				 * archive for streaming_replication_retry_interval
+				 * milliseconds. Reading WAL from the archive may not always be
+				 * efficient and cheaper because network latencies, disk IO
+				 * cost might differ on the archive as compared to the primary
+				 * and often the archive may sit far from the standby - all
+				 * adding to recovery performance on the standby. Hence reading
+				 * WAL from the primary as opposed to the archive enables the
+				 * standby to catch up with the primary sooner thus reducing
+				 * replication lag and avoiding WAL files accumulation on the
+				 * primary.
+				 *
+				 * We are here for any of the following reasons:
+				 * 1) standby in initial recovery after start/restart.
+				 * 2) standby stopped streaming from primary because of
+				 * connectivity issues with the primary (either due to network
+				 * issues or crash in the primary or something else) or
+				 * walreceiver got killed or crashed for whatever reasons.
+				 */
+				if (streaming_replication_retry_interval > 0 &&
+					StandbyMode &&
+					currentSource == XLOG_FROM_ARCHIVE)
+				{
+					TimestampTz curr_time;
+
+					curr_time = GetCurrentTimestamp();
+
+					/* Assume last_switch_time as curr_time for the first time */
+					if (first_time)
+						last_switch_time = curr_time;
+
+					if (!first_time &&
+						TimestampDifferenceExceeds(last_switch_time, curr_time,
+												   streaming_replication_retry_interval))
+					{
+						elog(DEBUG2,
+							 "trying to switch WAL source to %s after fetching WAL from %s for %d milliseconds",
+							 xlogSourceNames[XLOG_FROM_STREAM],
+							 xlogSourceNames[currentSource],
+							 streaming_replication_retry_interval);
+
+						last_switch_time = curr_time;
+
+						/*
+						 * Treat this as a failure to read from archive, even
+						 * though it is actually not, so that the state machine
+						 * will move on to stream the WAL from primary.
+						 */
+						lastSourceFailed = true;
+						intentionalSourceSwitch = true;
+						break;
+					}
+
+					/* We're not here for the first time any more */
+					if (first_time)
+						first_time = false;
+				}
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 55bf998511..b5f0575fdc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3321,6 +3321,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication"),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502c..6e083c72da 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0aa85d90e8..2d5c815246 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..03c92af753
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to .* after fetching WAL from .* for .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to .* after fetching WAL from .* for .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#10

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#9)

Re: Switching XLog source from archive to streaming when primary available

On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:

On Wed, Sep 7, 2022 at 3:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

It's not clear to me how this is expected to interact with the pg_wal phase
of standby recovery. As the docs note [0], standby servers loop through
archive recovery, recovery from pg_wal, and streaming replication. Does
this cause the pg_wal phase to be skipped (i.e., the standby goes straight
from archive recovery to streaming replication)? I wonder if it'd be
better for this mechanism to simply move the standby to the pg_wal phase so
that the usual ordering isn't changed.

It doesn't change any behaviour as such for XLOG_FROM_PG_WAL. In
standby mode when recovery_command is specified, the initial value of
currentSource would be XLOG_FROM_ARCHIVE (see [1]). If the archive is
exhausted of WAL or the standby fails to fetch from the archive, then
it switches to XLOG_FROM_STREAM. If the standby fails to receive WAL
from primary, it switches back to XLOG_FROM_ARCHIVE. This continues
unless the standby gets promoted. With the patch, we enable the
standby to try fetching from the primary, instead of waiting for WAL
in the archive to get exhausted or for an error to occur in the
standby while receiving from the archive.

Okay. I see that you are checking for XLOG_FROM_ARCHIVE.

Shouldn't the last_switch_time be set when the state machine first enters
XLOG_FROM_ARCHIVE? IIUC this logic is currently counting time spent
elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
source switch. This would mean that a standby that has spent a lot of time
in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
XLOG_FROM_ARCHIVE when it failed again. Given the standby will wait for
wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
seems like we could end up rapidly looping between sources. Perhaps I am
misunderstanding how this is meant to work.

last_switch_time indicates the time when the standby last attempted to
switch to primary. For instance, a standby:
1) for the first time, sets last_switch_time = current_time when in archive mode
2) if current_time < last_switch_time + interval, continues to be in
archive mode
3) if current_time >= last_switch_time + interval, attempts to switch
to primary and sets last_switch_time = current_time
3.1) if successfully switches to primary, continues in there and for
any reason fails to fetch from primary, then enters archive mode and
loops from step (2)
3.2) if fails to switch to primary, then enters archive mode and loops
from step (2)

Let's say I have this new parameter set to 5 minutes, and I have a standby
that's been at step 3.1 for 5 days before failing and going back to step 2.
Won't the standby immediately jump back to step 3.1? I think we should
place the limit on how long the server stays in XLOG_FROM_ARCHIVE, not how
long it's been since we last tried XLOG_FROM_STREAM.

I wonder if the lower bound should be higher to avoid switching
unnecessarily rapidly between WAL sources. I see that
WaitForWALToBecomeAvailable() ensures that standbys do not switch from
XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
wal_retrieve_retry_interval. Perhaps wal_retrieve_retry_interval should be
the lower bound for this GUC, too. Or maybe WaitForWALToBecomeAvailable()
should make sure that the standby makes at least once attempt to restore
the file from archive before switching to streaming replication.

No, we need a way to disable the feature, so I'm not changing the
lower bound. And let's not make this GUC dependent on any other GUC, I
would like to keep it simple for better usability. However, I've
increased the default value to 5min and added a note in the docs about
the lower values.

I'm attaching the v3 patch with the review comments addressed, please
review it further.

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#11

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#9)

Re: Switching XLog source from archive to streaming when primary available

Being late for the party.

It seems to me that the function is getting too long. I think we
might want to move the core part of the patch into another function.

I think it might be better if intentionalSourceSwitch doesn't need
lastSourceFailed set. It would look like this:

if (lastSourceFailed || switchSource)
{
if (nonblocking && lastSourceFailed)
return XLREAD_WOULDBLOCK;

+					if (first_time)
+						last_switch_time = curr_time;
..
+					if (!first_time &&
+						TimestampDifferenceExceeds(last_switch_time, curr_time,
..
+					/* We're not here for the first time any more */
+					if (first_time)
+						first_time = false;

I don't think the flag first_time is needed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#10)

Re: Switching XLog source from archive to streaming when primary available

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:

I'm attaching the v3 patch with the review comments addressed, please
review it further.

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#12)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:

I'm attaching the v3 patch with the review comments addressed, please
review it further.

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

Hm. In that case, I think we can get rid of timeout based switching
mechanism and have this behaviour - the standby can attempt to switch
to streaming mode from archive, say, after fetching 1, 2 or a
configurable number of WAL files. In fact, this is the original idea
proposed by Satya in this thread.

If okay, I can code on that. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#14

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#13)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:

On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

Hm. In that case, I think we can get rid of timeout based switching
mechanism and have this behaviour - the standby can attempt to switch
to streaming mode from archive, say, after fetching 1, 2 or a
configurable number of WAL files. In fact, this is the original idea
proposed by Satya in this thread.

IMO the timeout approach would be more intuitive for users. When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure. WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay. So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#15

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#14)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:

On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress. As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication. There might be other ways to handle this.

+1.

Hm. In that case, I think we can get rid of timeout based switching
mechanism and have this behaviour - the standby can attempt to switch
to streaming mode from archive, say, after fetching 1, 2 or a
configurable number of WAL files. In fact, this is the original idea
proposed by Satya in this thread.

IMO the timeout approach would be more intuitive for users. When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure. WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay.

How about the amount of WAL bytes fetched from the archive after which
a standby attempts to connect to primary or enter streaming mode? Of
late, we've changed some GUCs to represent bytes instead of WAL
files/segments, see [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3fe108c025e4a080315562d4c15ecbe3f00405e.

So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

If we have configurable WAL bytes instead of timeout for standby WAL
source switch from archive to primary, we don't have the above problem
right?

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3fe108c025e4a080315562d4c15ecbe3f00405e

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#16

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#15)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 09, 2022 at 11:07:00PM +0530, Bharath Rupireddy wrote:

On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

IMO the timeout approach would be more intuitive for users. When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure. WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay.

How about the amount of WAL bytes fetched from the archive after which
a standby attempts to connect to primary or enter streaming mode? Of
late, we've changed some GUCs to represent bytes instead of WAL
files/segments, see [1].

Well, for wal_keep_size, using bytes makes sense. Given you know how much
disk space you have, you can set this parameter accordingly to avoid
retaining too much of it for standby servers. For your proposed parameter,
it's not so simple. The same setting could have wildly different timing
behavior depending on the server. I still think that a timeout is the most
intuitive.

So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

If we have configurable WAL bytes instead of timeout for standby WAL
source switch from archive to primary, we don't have the above problem
right?

If you are going to stop replaying in the middle of a WAL archive, then
maybe. But I don't think I'd recommend that.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#17

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#16)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Sat, Sep 10, 2022 at 3:35 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Well, for wal_keep_size, using bytes makes sense. Given you know how much
disk space you have, you can set this parameter accordingly to avoid
retaining too much of it for standby servers. For your proposed parameter,
it's not so simple. The same setting could have wildly different timing
behavior depending on the server. I still think that a timeout is the most
intuitive.

Hm. In v3 patch, I've used the timeout approach, but tracking the
duration server spends in XLOG_FROM_ARCHIVE as opposed to tracking
last failed time in streaming from primary.

So I think it
would be difficult for the end user to decide on a value. However, even
the timeout approach has this sort of problem. If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute. That would likely
need to be documented.

Added a note in the docs.

On Fri, Sep 9, 2022 at 10:46 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Being late for the party.

Thanks for reviewing this.

It seems to me that the function is getting too long. I think we
might want to move the core part of the patch into another function.

Yeah, the WaitForWALToBecomeAvailable() (without this patch) has
around 460 LOC out of which WAL fetching from the chosen source is of
240 LOC, IMO, this code will be a candidate for a new function. I
think that part can be discussed separately.

Having said that, I moved the new code to a new function.

I think it might be better if intentionalSourceSwitch doesn't need
lastSourceFailed set. It would look like this:

if (lastSourceFailed || switchSource)
{
if (nonblocking && lastSourceFailed)
return XLREAD_WOULDBLOCK;

I think the above looks good, done that way in the latest patch.

I don't think the flag first_time is needed.

Addressed this in the v4 patch.

Please review the attached v4 patch addressing above review comments.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/octet-stream; name=v4-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload

From 02766cd6bfa533070b8488c6172f62f93ecbb855 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 12 Sep 2022 03:28:23 +0000
Subject: [PATCH v4] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +++++++
 src/backend/access/transam/xlogrecovery.c     | 124 +++++++++++++++--
 src/backend/utils/misc/guc.c                  |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++++++
 6 files changed, 301 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c7..694da93b8c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4840,6 +4840,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). If the standby fails to switch to stream mode, it falls back
+        to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9a80084a68..4b25cdae0d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -65,6 +65,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define StreamingReplRetryEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -88,6 +94,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -295,6 +302,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -437,6 +449,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3413,6 +3427,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	bool	sourceSwitched = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
 
@@ -3434,6 +3449,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3457,19 +3477,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || sourceSwitched)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3599,15 +3620,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (StreamingReplRetryEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (sourceSwitched)
+			{
+				elog(DEBUG2,
+					 "switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+					 xlogSourceNames[currentSource],
+					 xlogSourceNames[oldSource],
+					 streaming_replication_retry_interval);
+			}
+			else
+			{
+				elog(DEBUG2, "switched WAL source from %s to %s after %s",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 lastSourceFailed ? "failure" : "success");
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
 		lastSourceFailed = false;
+		sourceSwitched = false;
 
 		switch (currentSource)
 		{
@@ -3630,6 +3670,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				sourceSwitched = ShouldSwitchWALSourceToPrimary();
+
+				if (sourceSwitched)
+					break;
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
@@ -3872,6 +3917,63 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool sourceSwitched;
+
+	if (!StreamingReplRetryEnabled())
+		return false;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			elog(DEBUG2,
+				 "trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+				 xlogSourceNames[XLOG_FROM_STREAM],
+				 xlogSourceNames[currentSource],
+				 streaming_replication_retry_interval);
+
+			sourceSwitched = true;
+		}
+		else
+			sourceSwitched = false;
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+		sourceSwitched = false;
+	}
+
+	return sourceSwitched;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 55bf998511..587cee5bb8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3321,6 +3321,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502c..6e083c72da 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0aa85d90e8..2d5c815246 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..d33ca9635c
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#18

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#17)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Sep 12, 2022 at 9:03 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please review the attached v4 patch addressing above review comments.

Oops, there's a compiler warning [1]https://cirrus-ci.com/task/5730076611313664?logs=gcc_warning#L450 with the v4 patch, fixed it.
Please review the attached v5 patch.

[1]: https://cirrus-ci.com/task/5730076611313664?logs=gcc_warning#L450

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/x-patch; name=v5-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload

From 857415757ec21bd8b0195c17694618a1cbf22a57 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 12 Sep 2022 04:44:35 +0000
Subject: [PATCH v5] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +++++++
 src/backend/access/transam/xlogrecovery.c     | 124 +++++++++++++++--
 src/backend/utils/misc/guc.c                  |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++++++
 6 files changed, 301 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c7..694da93b8c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4840,6 +4840,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). If the standby fails to switch to stream mode, it falls back
+        to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9a80084a68..35f7985e65 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -65,6 +65,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define StreamingReplRetryEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -88,6 +94,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -295,6 +302,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -437,6 +449,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3413,6 +3427,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
 
@@ -3434,6 +3449,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3457,19 +3477,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3599,15 +3620,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (StreamingReplRetryEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+			{
+				elog(DEBUG2,
+					 "switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+					 xlogSourceNames[currentSource],
+					 xlogSourceNames[oldSource],
+					 streaming_replication_retry_interval);
+			}
+			else
+			{
+				elog(DEBUG2, "switched WAL source from %s to %s after %s",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 lastSourceFailed ? "failure" : "success");
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
 		lastSourceFailed = false;
+		switchSource = false;
 
 		switch (currentSource)
 		{
@@ -3630,6 +3670,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				switchSource = ShouldSwitchWALSourceToPrimary();
+
+				if (switchSource)
+					break;
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
@@ -3872,6 +3917,63 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!StreamingReplRetryEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			elog(DEBUG2,
+				 "trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+				 xlogSourceNames[XLOG_FROM_STREAM],
+				 xlogSourceNames[currentSource],
+				 streaming_replication_retry_interval);
+
+			shouldSwitchSource = true;
+		}
+		else
+			shouldSwitchSource = false;
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+		shouldSwitchSource = false;
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 55bf998511..587cee5bb8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3321,6 +3321,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502c..6e083c72da 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0aa85d90e8..2d5c815246 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..d33ca9635c
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#19

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#18)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Sep 12, 2022 at 11:56 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please review the attached v5 patch.

I'm attaching the v6 patch that's rebased on to the latest HEAD.
Please consider this for review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/octet-stream; name=v6-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload

From b1181f681718a7ca90453a0ad2bc80500983e699 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 15 Sep 2022 04:55:23 +0000
Subject: [PATCH v6] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +++++++
 src/backend/access/transam/xlogrecovery.c     | 124 +++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++++++
 6 files changed, 301 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 700914684d..892442a053 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4840,6 +4840,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). If the standby fails to switch to stream mode, it falls back
+        to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 30661bdad6..e737575bf5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define StreamingReplRetryEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +97,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -298,6 +305,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +452,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3416,6 +3430,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
 
@@ -3437,6 +3452,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3460,19 +3480,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3602,15 +3623,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (StreamingReplRetryEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+			{
+				elog(DEBUG2,
+					 "switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+					 xlogSourceNames[currentSource],
+					 xlogSourceNames[oldSource],
+					 streaming_replication_retry_interval);
+			}
+			else
+			{
+				elog(DEBUG2, "switched WAL source from %s to %s after %s",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 lastSourceFailed ? "failure" : "success");
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
 		lastSourceFailed = false;
+		switchSource = false;
 
 		switch (currentSource)
 		{
@@ -3633,6 +3673,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				switchSource = ShouldSwitchWALSourceToPrimary();
+
+				if (switchSource)
+					break;
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
@@ -3875,6 +3920,63 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!StreamingReplRetryEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			elog(DEBUG2,
+				 "trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+				 xlogSourceNames[XLOG_FROM_STREAM],
+				 xlogSourceNames[currentSource],
+				 streaming_replication_retry_interval);
+
+			shouldSwitchSource = true;
+		}
+		else
+			shouldSwitchSource = false;
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+		shouldSwitchSource = false;
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 550e95056c..cbb9cfca51 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3075,6 +3075,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2ae76e5cfb..84e52f3688 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0aa85d90e8..2d5c815246 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..d33ca9635c
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#20

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#19)

Re: Switching XLog source from archive to streaming when primary available

At Thu, 15 Sep 2022 10:28:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

I'm attaching the v6 patch that's rebased on to the latest HEAD.
Please consider this for review.

Thaks for the new version!

+#define StreamingReplRetryEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)

It seems to me a bit too complex..

+			/* Save the timestamp at which we're switching to archive. */
+			if (StreamingReplRetryEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();

Anyway we are going to open a file just after this so
GetCurrentTimestamp() cannot cause a perceptible degradation.
Coulnd't we do that unconditionally, to get rid of the macro?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#21

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#20)

Re: Switching XLog source from archive to streaming when primary available

On Thu, Sep 15, 2022 at 1:52 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 15 Sep 2022 10:28:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

I'm attaching the v6 patch that's rebased on to the latest HEAD.
Please consider this for review.

Thaks for the new version!
+#define StreamingReplRetryEnabled() \
+       (streaming_replication_retry_interval > 0 && \
+        StandbyMode && \
+        currentSource == XLOG_FROM_ARCHIVE)
It seems to me a bit too complex..

I don't think so, it just tells whether a standby is allowed to switch
source to stream from archive.

+                       /* Save the timestamp at which we're switching to archive. */
+                       if (StreamingReplRetryEnabled())
+                               switched_to_archive_at = GetCurrentTimestamp();
Anyway we are going to open a file just after this so
GetCurrentTimestamp() cannot cause a perceptible degradation.
Coulnd't we do that unconditionally, to get rid of the macro?

Do we really need to do it unconditionally? I don't think so. And, we
can't get rid of the macro, as we need to check for the current
source, GUC and standby mode. When this feature is disabled, it
mustn't execute any extra code IMO.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#22

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#21)

Re: Switching XLog source from archive to streaming when primary available

At Fri, 16 Sep 2022 09:15:58 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Thu, Sep 15, 2022 at 1:52 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Thu, 15 Sep 2022 10:28:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

I'm attaching the v6 patch that's rebased on to the latest HEAD.
Please consider this for review.

Thaks for the new version!
+#define StreamingReplRetryEnabled() \
+       (streaming_replication_retry_interval > 0 && \
+        StandbyMode && \
+        currentSource == XLOG_FROM_ARCHIVE)
It seems to me a bit too complex..

In other words, it seems to me that the macro name doesn't manifest
the condition correctly.

I don't think so, it just tells whether a standby is allowed to switch
source to stream from archive.
+                       /* Save the timestamp at which we're switching to archive. */
+                       if (StreamingReplRetryEnabled())
+                               switched_to_archive_at = GetCurrentTimestamp();
Anyway we are going to open a file just after this so
GetCurrentTimestamp() cannot cause a perceptible degradation.
Coulnd't we do that unconditionally, to get rid of the macro?
Do we really need to do it unconditionally? I don't think so. And, we
can't get rid of the macro, as we need to check for the current
source, GUC and standby mode. When this feature is disabled, it
mustn't execute any extra code IMO.

I don't think we don't particularly want to do that unconditionally.
I wanted just to get rid of the macro from the usage site. Even if
the same condition is used elsewhere, I see it better to write out the
condition directly there..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#23

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#22)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 16, 2022 at 12:06 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

In other words, it seems to me that the macro name doesn't manifest
the condition correctly.

I don't think we don't particularly want to do that unconditionally.
I wanted just to get rid of the macro from the usage site. Even if
the same condition is used elsewhere, I see it better to write out the
condition directly there..

I wanted to avoid a bit of duplicate code there. How about naming that
macro IsXLOGSourceSwitchToStreamEnabled() or
SwitchFromArchiveToStreamEnabled() or just SwitchFromArchiveToStream()
or any other better name?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#24

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#23)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 16, 2022 at 4:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Sep 16, 2022 at 12:06 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

In other words, it seems to me that the macro name doesn't manifest
the condition correctly.

I don't think we don't particularly want to do that unconditionally.
I wanted just to get rid of the macro from the usage site. Even if
the same condition is used elsewhere, I see it better to write out the
condition directly there..

I wanted to avoid a bit of duplicate code there. How about naming that
macro IsXLOGSourceSwitchToStreamEnabled() or
SwitchFromArchiveToStreamEnabled() or just SwitchFromArchiveToStream()
or any other better name?

SwitchFromArchiveToStreamEnabled() seemed better at this point. I'm
attaching the v7 patch with that change. Please review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v7-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/x-patch; name=v7-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload

From b94327d9af60a44f252251be97ee1efabc964f97 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 19 Sep 2022 14:04:49 +0000
Subject: [PATCH v7] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +++++++
 src/backend/access/transam/xlogrecovery.c     | 124 +++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++++++
 6 files changed, 301 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 700914684d..892442a053 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4840,6 +4840,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). If the standby fails to switch to stream mode, it falls back
+        to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b41e682664..afc769cf16 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +97,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -298,6 +305,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +452,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3416,6 +3430,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
 
@@ -3437,6 +3452,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3460,19 +3480,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3601,15 +3622,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+			{
+				elog(DEBUG2,
+					 "switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+					 xlogSourceNames[currentSource],
+					 xlogSourceNames[oldSource],
+					 streaming_replication_retry_interval);
+			}
+			else
+			{
+				elog(DEBUG2, "switched WAL source from %s to %s after %s",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 lastSourceFailed ? "failure" : "success");
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
 		lastSourceFailed = false;
+		switchSource = false;
 
 		switch (currentSource)
 		{
@@ -3632,6 +3672,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				switchSource = ShouldSwitchWALSourceToPrimary();
+
+				if (switchSource)
+					break;
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
@@ -3874,6 +3919,63 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			elog(DEBUG2,
+				 "trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+				 xlogSourceNames[XLOG_FROM_STREAM],
+				 xlogSourceNames[currentSource],
+				 streaming_replication_retry_interval);
+
+			shouldSwitchSource = true;
+		}
+		else
+			shouldSwitchSource = false;
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+		shouldSwitchSource = false;
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 550e95056c..cbb9cfca51 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3075,6 +3075,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2ae76e5cfb..84e52f3688 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0aa85d90e8..2d5c815246 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..d33ca9635c
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#25

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#24)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Sep 19, 2022 at 07:49:21PM +0530, Bharath Rupireddy wrote:

SwitchFromArchiveToStreamEnabled() seemed better at this point. I'm
attaching the v7 patch with that change. Please review it further.

As I mentioned upthread [0]/messages/by-id/20220906215704.GA2084086@nathanxps13, I'm still a little concerned that this patch
will cause the state machine to go straight from archive recovery to
streaming replication, skipping recovery from pg_wal. I wonder if this
could be resolved by moving the standby to the pg_wal phase instead.
Concretely, this line

+ if (switchSource)
+ break;

would instead change currentSource from XLOG_FROM_ARCHIVE to
XLOG_FROM_PG_WAL before the call to XLogFileReadAnyTLI(). I suspect the
behavior would be basically the same, but it would maintain the existing
ordering.

However, I do see the following note elsewhere in xlogrecovery.c:

* The segment can be fetched via restore_command, or via walreceiver having
* streamed the record, or it can already be present in pg_wal. Checking
* pg_wal is mainly for crash recovery, but it will be polled in standby mode
* too, in case someone copies a new segment directly to pg_wal. That is not
* documented or recommended, though.

Given this information, the present behavior might not be too important,
but I don't see a point in changing it without good reason.

[0]: /messages/by-id/20220906215704.GA2084086@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#26

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#25)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Sun, Oct 9, 2022 at 3:22 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

As I mentioned upthread [0], I'm still a little concerned that this patch
will cause the state machine to go straight from archive recovery to
streaming replication, skipping recovery from pg_wal.

[0] /messages/by-id/20220906215704.GA2084086@nathanxps13

Yes, it goes straight to streaming replication skipping recovery from
pg_wal with the patch.

I wonder if this
could be resolved by moving the standby to the pg_wal phase instead.
Concretely, this line
+                               if (switchSource)
+                                       break;
would instead change currentSource from XLOG_FROM_ARCHIVE to
XLOG_FROM_PG_WAL before the call to XLogFileReadAnyTLI(). I suspect the
behavior would be basically the same, but it would maintain the existing
ordering.

We can give it a chance to restore from pg_wal before switching to
streaming to not change any behaviour of the state machine. But, not
definitely by setting currentSource to XLOG_FROM_WAL, we basically
never explicitly set currentSource to XLOG_FROM_WAL, other than when
not in archive recovery i.e. InArchiveRecovery is false. Also, see the
comment [1]/* * We just successfully read a file in pg_wal. We prefer files in * the archive over ones in pg_wal, so try the next file again * from the archive first. */.

Instead, the simplest would be to just pass XLOG_FROM_WAL to
XLogFileReadAnyTLI() when we're about to switch the source to stream
mode. This doesn't change the existing behaviour.

However, I do see the following note elsewhere in xlogrecovery.c:

* The segment can be fetched via restore_command, or via walreceiver having
* streamed the record, or it can already be present in pg_wal. Checking
* pg_wal is mainly for crash recovery, but it will be polled in standby mode
* too, in case someone copies a new segment directly to pg_wal. That is not
* documented or recommended, though.

Given this information, the present behavior might not be too important,
but I don't see a point in changing it without good reason.

Yeah, with the attached patch we don't skip pg_wal before switching to
streaming mode.

I've also added a note in the 'Standby Server Operation' section about
the new feature.

Please review the v8 patch further.

Unrelated to this patch, the fact that the standby polls pg_wal is not
documented or recommended, is not true, it is actually documented [2]https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION The standby server will also attempt to restore any WAL found in the standby cluster's pg_wal directory. That typically happens after a server restart, when the standby replays again WAL that was streamed from the primary before the restart, but you can also manually copy files to pg_wal at any time to have them replayed..
Whether or not we change the docs to be something like [3]The standby server will also attempt to restore any WAL found in the standby cluster's pg_wal directory. That typically happens after a server restart, when the standby replays again WAL that was streamed from the primary before the restart, but you can also manually copy files to pg_wal at any time to have them replayed. However, copying of WAL files manually is not recommended., is a
separate discussion.

[1]: /* * We just successfully read a file in pg_wal. We prefer files in * the archive over ones in pg_wal, so try the next file again * from the archive first. */
/*
* We just successfully read a file in pg_wal. We prefer files in
* the archive over ones in pg_wal, so try the next file again
* from the archive first.
*/

[2]: https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION The standby server will also attempt to restore any WAL found in the standby cluster's pg_wal directory. That typically happens after a server restart, when the standby replays again WAL that was streamed from the primary before the restart, but you can also manually copy files to pg_wal at any time to have them replayed.
The standby server will also attempt to restore any WAL found in the
standby cluster's pg_wal directory. That typically happens after a
server restart, when the standby replays again WAL that was streamed
from the primary before the restart, but you can also manually copy
files to pg_wal at any time to have them replayed.

[3]: The standby server will also attempt to restore any WAL found in the standby cluster's pg_wal directory. That typically happens after a server restart, when the standby replays again WAL that was streamed from the primary before the restart, but you can also manually copy files to pg_wal at any time to have them replayed. However, copying of WAL files manually is not recommended.
The standby server will also attempt to restore any WAL found in the
standby cluster's pg_wal directory. That typically happens after a
server restart, when the standby replays again WAL that was streamed
from the primary before the restart, but you can also manually copy
files to pg_wal at any time to have them replayed. However, copying of
WAL files manually is not recommended.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/x-patch; name=v8-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload

From 171e11088cca63e99726b49b1b9b408eed81f299 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sun, 9 Oct 2022 08:11:10 +0000
Subject: [PATCH v8] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 ++++++
 doc/src/sgml/high-availability.sgml           |   7 +
 src/backend/access/transam/xlogrecovery.c     | 132 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 +++++++++++++++++
 7 files changed, 312 insertions(+), 15 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 66312b53b8..85baac9bbb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4847,6 +4847,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). If the standby fails to switch to stream mode, it falls back
+        to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b2b3129397..e38ce258e7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -636,6 +636,13 @@ protocol to make nodes agree on a serializable transactional order.
     <filename>pg_wal</filename> at any time to have them replayed.
    </para>
 
+   <para>
+    The standby server can attempt to switch to streaming replication after
+    reading WAL from archive, see
+    <xref linkend="guc-streaming-replication-retry-interval"/> for more
+    details.
+   </para>
+
    <para>
     At startup, the standby begins by restoring all WAL available in the
     archive location, calling <varname>restore_command</varname>. Once it
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..d0939f9b0b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +97,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -298,6 +305,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +452,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3416,8 +3430,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3437,6 +3453,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. If
+	 * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+	 * it falls back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3460,19 +3481,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3601,15 +3623,30 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+				elog(DEBUG2,
+					 "switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+					 xlogSourceNames[currentSource],
+					 xlogSourceNames[oldSource],
+					 streaming_replication_retry_interval);
+			else
+				elog(DEBUG2, "switched WAL source from %s to %s after %s",
+					 xlogSourceNames[oldSource],
+					 xlogSourceNames[currentSource],
+					 lastSourceFailed ? "failure" : "success");
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
 		lastSourceFailed = false;
+		switchSource = false;
 
 		switch (currentSource)
 		{
@@ -3632,13 +3669,21 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				switchSource = ShouldSwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching the source to
+				 * stream mode, give it a chance to read from pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (switchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else if (currentSource == XLOG_FROM_ARCHIVE)
+					readFrom = XLOG_FROM_ANY;
+				else
+					readFrom = currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3874,6 +3919,63 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			elog(DEBUG2,
+				 "trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+				 xlogSourceNames[XLOG_FROM_STREAM],
+				 xlogSourceNames[currentSource],
+				 streaming_replication_retry_interval);
+
+			shouldSwitchSource = true;
+		}
+		else
+			shouldSwitchSource = false;
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+		shouldSwitchSource = false;
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934..df09125611 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3065,6 +3065,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c351..97bc00d5e6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0e3e246bd2..8c5be66946 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..d33ca9635c
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to .* after fetching WAL from .* for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#27

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#26)

Re: Switching XLog source from archive to streaming when primary available

On Sun, Oct 09, 2022 at 02:39:47PM +0530, Bharath Rupireddy wrote:

We can give it a chance to restore from pg_wal before switching to
streaming to not change any behaviour of the state machine. But, not
definitely by setting currentSource to XLOG_FROM_WAL, we basically
never explicitly set currentSource to XLOG_FROM_WAL, other than when
not in archive recovery i.e. InArchiveRecovery is false. Also, see the
comment [1].

Instead, the simplest would be to just pass XLOG_FROM_WAL to
XLogFileReadAnyTLI() when we're about to switch the source to stream
mode. This doesn't change the existing behaviour.

It might be more consistent with existing behavior, but one thing I hadn't
considered is that it might make your proposed feature ineffective when
users are copying files straight into pg_wal. IIUC as long as the files
are present in pg_wal, the source-switch logic won't kick in.

Unrelated to this patch, the fact that the standby polls pg_wal is not
documented or recommended, is not true, it is actually documented [2].
Whether or not we change the docs to be something like [3], is a
separate discussion.

I wonder if it would be better to simply remove this extra polling of
pg_wal as a prerequisite to your patch. The existing commentary leads me
to think there might not be a strong reason for this behavior, so it could
be a nice way to simplify your patch.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#28

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#27)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Instead, the simplest would be to just pass XLOG_FROM_WAL to
XLogFileReadAnyTLI() when we're about to switch the source to stream
mode. This doesn't change the existing behaviour.

It might be more consistent with existing behavior, but one thing I hadn't
considered is that it might make your proposed feature ineffective when
users are copying files straight into pg_wal. IIUC as long as the files
are present in pg_wal, the source-switch logic won't kick in.

It happens even now, that is, the server will not switch to streaming
mode from the archive after a failure if there's someone continuously
copying WAL files to the pg_wal directory. I have not personally seen
anyone or any service doing that. It doesn't mean that can't happen.
They might do it for some purpose such as 1) to bring back in sync
quickly a standby that's lagging behind the primary after the archive
connection and/or streaming replication connection are/is broken but
many WAL files leftover on the primary 2) before promoting a standby
that's lagging behind the primary for failover or other purposes.
However, I'm not sure if someone does these things on production
servers.

Unrelated to this patch, the fact that the standby polls pg_wal is not
documented or recommended, is not true, it is actually documented [2].
Whether or not we change the docs to be something like [3], is a
separate discussion.

I wonder if it would be better to simply remove this extra polling of
pg_wal as a prerequisite to your patch. The existing commentary leads me
to think there might not be a strong reason for this behavior, so it could
be a nice way to simplify your patch.

I don't think it's a good idea to remove that completely. As said
above, it might help someone, we never know.

I think for this feature, we just need to decide on whether or not
we'd allow pg_wal polling before switching to streaming mode. If we
allow it like in the v8 patch, we can document the behavior.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#29

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#28)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Oct 10, 2022 at 11:33:57AM +0530, Bharath Rupireddy wrote:

On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I wonder if it would be better to simply remove this extra polling of
pg_wal as a prerequisite to your patch. The existing commentary leads me
to think there might not be a strong reason for this behavior, so it could
be a nice way to simplify your patch.

I don't think it's a good idea to remove that completely. As said
above, it might help someone, we never know.

It would be great to hear whether anyone is using this functionality. If
no one is aware of existing usage and there is no interest in keeping it
around, I don't think it would be unreasonable to remove it in v16.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#30

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Nathan Bossart (#29)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Tue, Oct 11, 2022 at 8:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Oct 10, 2022 at 11:33:57AM +0530, Bharath Rupireddy wrote:

On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I wonder if it would be better to simply remove this extra polling of
pg_wal as a prerequisite to your patch. The existing commentary leads me
to think there might not be a strong reason for this behavior, so it could
be a nice way to simplify your patch.

I don't think it's a good idea to remove that completely. As said
above, it might help someone, we never know.

It would be great to hear whether anyone is using this functionality. If
no one is aware of existing usage and there is no interest in keeping it
around, I don't think it would be unreasonable to remove it in v16.

It seems like exhausting all the WAL in pg_wal before switching to
streaming after failing to fetch from archive is unremovable. I found
this after experimenting with it, here are my findings:
1. The standby has to recover initial WAL files in the pg_wal
directory even for the normal post-restart/first-time-start case, I
mean, in non-crash recovery case.
2. The standby received WAL files from primary (walreceiver just
writes and flushes the received WAL to WAL files under pg_wal)
pretty-fast and/or standby recovery is slow, say both the standby
connection to primary and archive connection are broken for whatever
reasons, then it has WAL files to recover in pg_wal directory.

I think the fundamental behaviour for the standy is that it has to
fully recover to the end of WAL under pg_wal no matter who copies WAL
files there. I fully understand the consequences of manually copying
WAL files into pg_wal, for that matter, manually copying/tinkering any
other files into/under the data directory is something we don't
recommend and encourage.

In summary, the standby state machine in WaitForWALToBecomeAvailable()
exhausts all the WAL in pg_wal before switching to streaming after
failing to fetch from archive. The v8 patch proposed upthread deviates
from this behaviour. Hence, attaching v9 patch that keeps the
behaviour as-is, that means, the standby exhausts all the WAL in
pg_wal before switching to streaming after fetching WAL from archive
for at least streaming_replication_retry_interval milliseconds.

Please review the v9 patch further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v9-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchapplication/octet-stream; name=v9-0001-Allow-standby-to-switch-WAL-source-from-archive-t.patchDownload

From 72f64424863575d5bb90af977566b58b33e81aa3 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 17 Oct 2022 15:32:17 +0000
Subject: [PATCH v9] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. However, exhaust all the WAL present in
pg_wal before switching. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  46 +++++
 doc/src/sgml/high-availability.sgml           |   7 +
 src/backend/access/transam/xlogrecovery.c     | 158 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++
 7 files changed, 337 insertions(+), 17 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 66312b53b8..17dc961300 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4847,6 +4847,52 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). However, exhaust all the WAL present in pg_wal before
+        switching. If the standby fails to switch to stream mode, it falls
+        back to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b2b3129397..e38ce258e7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -636,6 +636,13 @@ protocol to make nodes agree on a serializable transactional order.
     <filename>pg_wal</filename> at any time to have them replayed.
    </para>
 
+   <para>
+    The standby server can attempt to switch to streaming replication after
+    reading WAL from archive, see
+    <xref linkend="guc-streaming-replication-retry-interval"/> for more
+    details.
+   </para>
+
    <para>
     At startup, the standby begins by restoring all WAL available in the
     archive location, calling <varname>restore_command</varname>. Once it
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..6faadf4c13 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +97,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -298,6 +305,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +452,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3416,8 +3430,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3437,6 +3454,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3460,19 +3483,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3601,9 +3625,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+				ereport(DEBUG2,
+						(errmsg("switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+								xlogSourceNames[currentSource],
+								xlogSourceNames[oldSource],
+								streaming_replication_retry_interval)));
+			else
+				ereport(DEBUG2,
+						(errmsg("switched WAL source from %s to %s after %s",
+								xlogSourceNames[oldSource],
+								xlogSourceNames[currentSource],
+								lastSourceFailed ? "failure" : "success")));
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3611,6 +3650,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		lastSourceFailed = false;
 
+		if (switchSource)
+		{
+			Assert(canSwitchSource == true);
+			switchSource = false;
+			canSwitchSource = false;
+		}
+
 		switch (currentSource)
 		{
 			case XLOG_FROM_ARCHIVE:
@@ -3632,13 +3678,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * See if we can switch the source to streaming from archive.
+				 */
+				if (!canSwitchSource)
+					canSwitchSource = ShouldSwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching the source to
+				 * stream mode, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3646,6 +3703,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * We have exhausted the WAL in pg_wal, ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3874,6 +3939,61 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			ereport(DEBUG2,
+					(errmsg("trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+							xlogSourceNames[XLOG_FROM_STREAM],
+							xlogSourceNames[currentSource],
+							streaming_replication_retry_interval),
+					 errdetail("However, all the WAL present in pg_wal is exhausted before switching.")));
+
+			shouldSwitchSource = true;
+		}
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
@@ -4199,7 +4319,11 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 				continue;
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
+		/*
+		 * When failed to read from archive, try reading from pg_wal, see
+		 * below.
+		 */
+		if (source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_ARCHIVE, true);
@@ -4212,7 +4336,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 			}
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
+		if (source == XLOG_FROM_ARCHIVE || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_PG_WAL, true);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934..df09125611 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3065,6 +3065,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c351..97bc00d5e6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0e3e246bd2..8c5be66946 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..0bc988df19
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#31

Ian Lawrence Barwick

barwick@gmail.com

about 3 years ago

In reply to: Bharath Rupireddy (#30)

Re: Switching XLog source from archive to streaming when primary available

2022年10月18日(火) 11:02 Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>:

On Tue, Oct 11, 2022 at 8:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Oct 10, 2022 at 11:33:57AM +0530, Bharath Rupireddy wrote:

On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I wonder if it would be better to simply remove this extra polling of
pg_wal as a prerequisite to your patch. The existing commentary leads me
to think there might not be a strong reason for this behavior, so it could
be a nice way to simplify your patch.

I don't think it's a good idea to remove that completely. As said
above, it might help someone, we never know.

It would be great to hear whether anyone is using this functionality. If
no one is aware of existing usage and there is no interest in keeping it
around, I don't think it would be unreasonable to remove it in v16.

It seems like exhausting all the WAL in pg_wal before switching to
streaming after failing to fetch from archive is unremovable. I found
this after experimenting with it, here are my findings:
1. The standby has to recover initial WAL files in the pg_wal
directory even for the normal post-restart/first-time-start case, I
mean, in non-crash recovery case.
2. The standby received WAL files from primary (walreceiver just
writes and flushes the received WAL to WAL files under pg_wal)
pretty-fast and/or standby recovery is slow, say both the standby
connection to primary and archive connection are broken for whatever
reasons, then it has WAL files to recover in pg_wal directory.

I think the fundamental behaviour for the standy is that it has to
fully recover to the end of WAL under pg_wal no matter who copies WAL
files there. I fully understand the consequences of manually copying
WAL files into pg_wal, for that matter, manually copying/tinkering any
other files into/under the data directory is something we don't
recommend and encourage.

In summary, the standby state machine in WaitForWALToBecomeAvailable()
exhausts all the WAL in pg_wal before switching to streaming after
failing to fetch from archive. The v8 patch proposed upthread deviates
from this behaviour. Hence, attaching v9 patch that keeps the
behaviour as-is, that means, the standby exhausts all the WAL in
pg_wal before switching to streaming after fetching WAL from archive
for at least streaming_replication_retry_interval milliseconds.

Please review the v9 patch further.

Thanks for the updated patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

'tap': {
'tests': [
't/001_basic.pl',
],
},

For some additional details please see this Wiki article:

https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

https://wiki.postgresql.org/wiki/Meson

Regards

Ian Barwick

#32

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Ian Lawrence Barwick (#31)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Nov 16, 2022 at 9:38 AM Ian Lawrence Barwick <barwick@gmail.com> wrote:

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

Thanks for pointing it out. Yeah, the test wasn't picking up on meson
builds. I added the new test file name in
src/test/recovery/meson.build.

I'm attaching the v10 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v10-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v10-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 70323a5c96a889a21a5339c531193b6440d2e270 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 16 Nov 2022 05:34:41 +0000
Subject: [PATCH v10] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. However, exhaust all the WAL present in
pg_wal before switching. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  46 +++++
 doc/src/sgml/high-availability.sgml           |   7 +
 src/backend/access/transam/xlogrecovery.c     | 158 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/034_wal_source_switch.pl  | 126 ++++++++++++++
 8 files changed, 338 insertions(+), 17 deletions(-)
 create mode 100644 src/test/recovery/t/034_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd50ea8e48..8a3a9ff296 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4852,6 +4852,52 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). However, exhaust all the WAL present in pg_wal before
+        switching. If the standby fails to switch to stream mode, it falls
+        back to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b2b3129397..e38ce258e7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -636,6 +636,13 @@ protocol to make nodes agree on a serializable transactional order.
     <filename>pg_wal</filename> at any time to have them replayed.
    </para>
 
+   <para>
+    The standby server can attempt to switch to streaming replication after
+    reading WAL from archive, see
+    <xref linkend="guc-streaming-replication-retry-interval"/> for more
+    details.
+   </para>
+
    <para>
     At startup, the standby begins by restoring all WAL available in the
     archive location, calling <varname>restore_command</varname>. Once it
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..6faadf4c13 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +97,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -298,6 +305,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +452,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3416,8 +3430,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3437,6 +3454,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3460,19 +3483,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3601,9 +3625,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+				ereport(DEBUG2,
+						(errmsg("switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+								xlogSourceNames[currentSource],
+								xlogSourceNames[oldSource],
+								streaming_replication_retry_interval)));
+			else
+				ereport(DEBUG2,
+						(errmsg("switched WAL source from %s to %s after %s",
+								xlogSourceNames[oldSource],
+								xlogSourceNames[currentSource],
+								lastSourceFailed ? "failure" : "success")));
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3611,6 +3650,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		lastSourceFailed = false;
 
+		if (switchSource)
+		{
+			Assert(canSwitchSource == true);
+			switchSource = false;
+			canSwitchSource = false;
+		}
+
 		switch (currentSource)
 		{
 			case XLOG_FROM_ARCHIVE:
@@ -3632,13 +3678,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * See if we can switch the source to streaming from archive.
+				 */
+				if (!canSwitchSource)
+					canSwitchSource = ShouldSwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching the source to
+				 * stream mode, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3646,6 +3703,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * We have exhausted the WAL in pg_wal, ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3874,6 +3939,61 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			ereport(DEBUG2,
+					(errmsg("trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+							xlogSourceNames[XLOG_FROM_STREAM],
+							xlogSourceNames[currentSource],
+							streaming_replication_retry_interval),
+					 errdetail("However, all the WAL present in pg_wal is exhausted before switching.")));
+
+			shouldSwitchSource = true;
+		}
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
@@ -4199,7 +4319,11 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 				continue;
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
+		/*
+		 * When failed to read from archive, try reading from pg_wal, see
+		 * below.
+		 */
+		if (source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_ARCHIVE, true);
@@ -4212,7 +4336,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 			}
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
+		if (source == XLOG_FROM_ARCHIVE || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_PG_WAL, true);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 836b49484a..6492248d19 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3063,6 +3063,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c351..97bc00d5e6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 0e3e246bd2..8c5be66946 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b0e398363f..3bd4ec4b37 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -38,6 +38,7 @@ tests += {
       't/031_recovery_conflict.pl',
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
+      't/034_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/034_wal_source_switch.pl b/src/test/recovery/t/034_wal_source_switch.pl
new file mode 100644
index 0000000000..0bc988df19
--- /dev/null
+++ b/src/test/recovery/t/034_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# Find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#33

Nathan Bossart

nathandbossart@gmail.com

about 3 years ago

In reply to: Bharath Rupireddy (#30)

Re: Switching XLog source from archive to streaming when primary available

On Tue, Oct 18, 2022 at 07:31:33AM +0530, Bharath Rupireddy wrote:

In summary, the standby state machine in WaitForWALToBecomeAvailable()
exhausts all the WAL in pg_wal before switching to streaming after
failing to fetch from archive. The v8 patch proposed upthread deviates
from this behaviour. Hence, attaching v9 patch that keeps the
behaviour as-is, that means, the standby exhausts all the WAL in
pg_wal before switching to streaming after fetching WAL from archive
for at least streaming_replication_retry_interval milliseconds.

I think this is okay. The following comment explains why archives are
preferred over existing files in pg_wal:

* When doing archive recovery, we always prefer an archived log file even
* if a file of the same name exists in XLOGDIR. The reason is that the
* file in XLOGDIR could be an old, un-filled or partly-filled version
* that was copied and restored as part of backing up $PGDATA.

With your patch, we might replay one of these "old" files in pg_wal instead
of the complete version of the file from the archives, but I think that is
still correct. We'll just replay whatever exists in pg_wal (which may be
un-filled or partly-filled) before attempting streaming. If that fails,
we'll go back to trying the archives again.

Would you mind testing this scenario?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#34

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 3 years ago

In reply to: Nathan Bossart (#33)

Re: Switching XLog source from archive to streaming when primary available

On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Oct 18, 2022 at 07:31:33AM +0530, Bharath Rupireddy wrote:

In summary, the standby state machine in WaitForWALToBecomeAvailable()
exhausts all the WAL in pg_wal before switching to streaming after
failing to fetch from archive. The v8 patch proposed upthread deviates
from this behaviour. Hence, attaching v9 patch that keeps the
behaviour as-is, that means, the standby exhausts all the WAL in
pg_wal before switching to streaming after fetching WAL from archive
for at least streaming_replication_retry_interval milliseconds.

I think this is okay. The following comment explains why archives are
preferred over existing files in pg_wal:

* When doing archive recovery, we always prefer an archived log file even
* if a file of the same name exists in XLOGDIR. The reason is that the
* file in XLOGDIR could be an old, un-filled or partly-filled version
* that was copied and restored as part of backing up $PGDATA.

With your patch, we might replay one of these "old" files in pg_wal instead
of the complete version of the file from the archives,

That's true even today, without the patch, no? We're not changing the
existing behaviour of the state machine. Can you explain how it
happens with the patch?

On HEAD, after failing to read from the archive, exhaust all wal from
pg_wal and then switch to streaming mode. With the patch, after
reading from the archive for at least
streaming_replication_retry_interval milliseconds, exhaust all wal
from pg_wal and then switch to streaming mode.

but I think that is
still correct. We'll just replay whatever exists in pg_wal (which may be
un-filled or partly-filled) before attempting streaming. If that fails,
we'll go back to trying the archives again.

Would you mind testing this scenario?

How about something like below for testing the above scenario? If it
looks okay, I can add it as a new TAP test file.

1. Generate WAL files f1 and f2 and archive them.
2. Check the replay lsn and WAL file name on the standby, when it
replays upto f2, stop the standby.
3. Set recovery to fail on the standby, and stop the standby.
4. Generate f3, f4 (partially filled) on the primary.
5. Manually copy f3, f4 to the standby's pg_wal.
6. Start the standby, since recovery is set to fail, and there're new
WAL files (f3, f4) under its pg_wal, it must replay those WAL files
(check the replay lsn and WAL file name, it must be f4) before
switching to streaming.
7. Generate f5 on the primary.
8. The standby should receive f5 and replay it (check the replay lsn
and WAL file name, it must be f5).
9. Set streaming to fail on the standby and set recovery to succeed.
10. Generate f6 on the primary.
11. The standby should receive f6 via archive and replay it (check the
replay lsn and WAL file name, it must be f6).

If needed, we can look out for these messages to confirm it works as expected:
elog(DEBUG2, "switched WAL source from %s to %s after %s",
xlogSourceNames[oldSource], xlogSourceNames[currentSource],
lastSourceFailed ? "failure" : "success");
ereport(LOG,
(errmsg("restored log file \"%s\" from archive",
xlogfname)));

Essentially, it covers what the documentation
https://www.postgresql.org/docs/devel/warm-standby.html says:

"In standby mode, the server continuously applies WAL received from
the primary server. The standby server can read WAL from a WAL archive
(see restore_command) or directly from the primary over a TCP
connection (streaming replication). The standby server will also
attempt to restore any WAL found in the standby cluster's pg_wal
directory. That typically happens after a server restart, when the
standby replays again WAL that was streamed from the primary before
the restart, but you can also manually copy files to pg_wal at any
time to have them replayed."

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#35

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: Bharath Rupireddy (#34)

Re: Switching XLog source from archive to streaming when primary available

On Tue, Jan 17, 2023 at 07:44:52PM +0530, Bharath Rupireddy wrote:

On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

With your patch, we might replay one of these "old" files in pg_wal instead
of the complete version of the file from the archives,

That's true even today, without the patch, no? We're not changing the
existing behaviour of the state machine. Can you explain how it
happens with the patch?

My point is that on HEAD, we will always prefer a complete archive file.
With your patch, we might instead choose to replay an old file in pg_wal
because we are artificially advancing the state machine. IOW even if
there's a complete archive available, we might not use it. This is a
behavior change, but I think it is okay.

Would you mind testing this scenario?

How about something like below for testing the above scenario? If it
looks okay, I can add it as a new TAP test file.

1. Generate WAL files f1 and f2 and archive them.
2. Check the replay lsn and WAL file name on the standby, when it
replays upto f2, stop the standby.
3. Set recovery to fail on the standby, and stop the standby.
4. Generate f3, f4 (partially filled) on the primary.
5. Manually copy f3, f4 to the standby's pg_wal.
6. Start the standby, since recovery is set to fail, and there're new
WAL files (f3, f4) under its pg_wal, it must replay those WAL files
(check the replay lsn and WAL file name, it must be f4) before
switching to streaming.
7. Generate f5 on the primary.
8. The standby should receive f5 and replay it (check the replay lsn
and WAL file name, it must be f5).
9. Set streaming to fail on the standby and set recovery to succeed.
10. Generate f6 on the primary.
11. The standby should receive f6 via archive and replay it (check the
replay lsn and WAL file name, it must be f6).

I meant testing the scenario where there's an old file in pg_wal, a
complete file in the archives, and your new GUC forces replay of the
former. This might be difficult to do in a TAP test. Ultimately, I just
want to validate the assumptions discussed above.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#36

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 3 years ago

In reply to: Nathan Bossart (#35)

Re: Switching XLog source from archive to streaming when primary available

On Thu, Jan 19, 2023 at 6:20 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Jan 17, 2023 at 07:44:52PM +0530, Bharath Rupireddy wrote:

On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

With your patch, we might replay one of these "old" files in pg_wal instead
of the complete version of the file from the archives,

That's true even today, without the patch, no? We're not changing the
existing behaviour of the state machine. Can you explain how it
happens with the patch?

My point is that on HEAD, we will always prefer a complete archive file.
With your patch, we might instead choose to replay an old file in pg_wal
because we are artificially advancing the state machine. IOW even if
there's a complete archive available, we might not use it. This is a
behavior change, but I think it is okay.

Oh, yeah, I too agree that it's okay because manually copying WAL
files directly to pg_wal (which eventually get replayed before
switching to streaming) isn't recommended anyway for production level
servers. I think, we covered it in the documentation that it exhausts
all the WAL present in pg_wal before switching. Isn't that enough?

+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). However, exhaust all the WAL present in pg_wal before
+        switching. If the standby fails to switch to stream mode, it falls
+        back to archive mode.

Would you mind testing this scenario?

ndby should receive f6 via archive and replay it (check the

replay lsn an> >

I meant testing the scenario where there's an old file in pg_wal, a
complete file in the archives, and your new GUC forces replay of the
former. This might be difficult to do in a TAP test. Ultimately, I just
want to validate the assumptions discussed above.

I think testing the scenario [1]RestoreArchivedFile(): /* * When doing archive recovery, we always prefer an archived log file even * if a file of the same name exists in XLOGDIR. The reason is that the * file in XLOGDIR could be an old, un-filled or partly-filled version * that was copied and restored as part of backing up $PGDATA. * is achievable. I could write a TAP
test for it - https://github.com/BRupireddy/postgres/tree/prefer_archived_wal_v1.
It's a bit flaky and needs a little more work (1 - writing a custom
script for restore_command that sleeps only after fetching an
existing WAL file from archive, not sleeping for a history file or a
non-existent WAL file. 2- finding a command-line way to sleep on
Windows.) to stabilize it, but it seems doable. I can spend some more
time, if one thinks that the test is worth adding to the core, perhaps
discussing it separately from this thread.

[1]: RestoreArchivedFile(): /* * When doing archive recovery, we always prefer an archived log file even * if a file of the same name exists in XLOGDIR. The reason is that the * file in XLOGDIR could be an old, un-filled or partly-filled version * that was copied and restored as part of backing up $PGDATA. *
/*
* When doing archive recovery, we always prefer an archived log file even
* if a file of the same name exists in XLOGDIR. The reason is that the
* file in XLOGDIR could be an old, un-filled or partly-filled version
* that was copied and restored as part of backing up $PGDATA.
*

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#37

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 3 years ago

In reply to: Bharath Rupireddy (#32)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Nov 16, 2022 at 11:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I'm attaching the v10 patch for further review.

Needed a rebase. I'm attaching the v11 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v11-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v11-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 6cf6c9bc0a6d6c4325459cddc22fd6fcdc970a03 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 24 Feb 2023 04:38:13 +0000
Subject: [PATCH v11] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be efficient and cheaper because network
latencies, disk IO cost might differ on the archive as compared
to primary and often the archive may sit far from standby
impacting recovery performance on the standby. Hence reading WAL
from the primary, by setting this parameter, as opposed to the
archive enables the standby to catch up with the primary sooner
thus reducing replication lag and avoiding WAL files accumulation
on the primary.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication. However, exhaust all the WAL present in
pg_wal before switching. If the standby fails to switch to stream
mode, it falls back to archive mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  46 +++++
 doc/src/sgml/high-availability.sgml           |   7 +
 src/backend/access/transam/xlogrecovery.c     | 158 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/035_wal_source_switch.pl  | 126 ++++++++++++++
 8 files changed, 338 insertions(+), 17 deletions(-)
 create mode 100644 src/test/recovery/t/035_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..fd400ac662 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4884,6 +4884,52 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+    <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). However, exhaust all the WAL present in pg_wal before
+        switching. If the standby fails to switch to stream mode, it falls
+        back to archive mode.
+        If this value is specified without units, it is taken as milliseconds.
+        The default is five minutes (<literal>5min</literal>).
+        With a lower setting of this parameter, the standby makes frequent
+        WAL source switch attempts when the primary is lost for quite longer.
+        To avoid this, set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode, only when receive from
+        WAL archive finishes (no more WAL left there) or fails for any reason.
+        This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server
+        command line.
+       </para>
+       <para>
+        Reading WAL from archive may not always be efficient and cheaper
+        because network latencies, disk IO cost might differ on the archive as
+        compared to primary and often the archive may sit far from standby
+        impacting recovery performance on the standby. Hence reading WAL
+        from the primary, by setting this parameter, as opposed to the archive
+        enables the standby to catch up with the primary sooner thus reducing
+        replication lag and avoiding WAL files accumulation on the primary.
+      </para>
+       <para>
+        Note that the standby may not always attempt to switch source from
+        WAL archive to streaming replication at exact
+        <varname>streaming_replication_retry_interval</varname> intervals.
+        For example, if the parameter is set to <literal>1min</literal> and
+        fetching from WAL archive takes <literal>5min</literal>, then the
+        source switch attempt happens for the next WAL after current WAL is
+        fetched from WAL archive and applied.
+      </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f180607528..0323f4cc43 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -636,6 +636,13 @@ protocol to make nodes agree on a serializable transactional order.
     <filename>pg_wal</filename> at any time to have them replayed.
    </para>
 
+   <para>
+    The standby server can attempt to switch to streaming replication after
+    reading WAL from archive, see
+    <xref linkend="guc-streaming-replication-retry-interval"/> for more
+    details.
+   </para>
+
    <para>
     At startup, the standby begins by restoring all WAL available in the
     archive location, calling <varname>restore_command</varname>. Once it
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..199b313dc1 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,12 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +97,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +304,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +452,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3441,8 +3455,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3462,6 +3479,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3485,19 +3508,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
-		 * happened outside this function, e.g when a CRC check fails on a
-		 * record, or within this loop.
+		 * First check if we failed to read from the current source or we
+		 * intentionally would want to switch the source from archive to
+		 * primary, and advance the state machine if so. The failure to read
+		 * might've happened outside this function, e.g when a CRC check fails
+		 * on a record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3629,9 +3653,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+				ereport(DEBUG2,
+						(errmsg("switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+								xlogSourceNames[currentSource],
+								xlogSourceNames[oldSource],
+								streaming_replication_retry_interval)));
+			else
+				ereport(DEBUG2,
+						(errmsg("switched WAL source from %s to %s after %s",
+								xlogSourceNames[oldSource],
+								xlogSourceNames[currentSource],
+								lastSourceFailed ? "failure" : "success")));
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3639,6 +3678,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		lastSourceFailed = false;
 
+		if (switchSource)
+		{
+			Assert(canSwitchSource == true);
+			switchSource = false;
+			canSwitchSource = false;
+		}
+
 		switch (currentSource)
 		{
 			case XLOG_FROM_ARCHIVE:
@@ -3660,13 +3706,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * See if we can switch the source to streaming from archive.
+				 */
+				if (!canSwitchSource)
+					canSwitchSource = ShouldSwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching the source to
+				 * stream mode, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3674,6 +3731,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * We have exhausted the WAL in pg_wal, ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3904,6 +3969,61 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby should make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from the
+ * archive may not always be efficient and cheaper because network latencies,
+ * disk IO cost might differ on the archive as compared to the primary and
+ * often the archive may sit far from the standby - all adding to recovery
+ * performance on the standby. Hence reading WAL from the primary as opposed to
+ * the archive enables the standby to catch up with the primary sooner thus
+ * reducing replication lag and avoiding WAL files accumulation on the primary.
+ *
+ * We are here for any of the following reasons:
+ * 1) standby in initial recovery after start/restart.
+ * 2) standby stopped streaming from primary because of connectivity issues
+ * with the primary (either due to network issues or crash in the primary or
+ * something else) or walreceiver got killed or crashed for whatever reasons.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			ereport(DEBUG2,
+					(errmsg("trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+							xlogSourceNames[XLOG_FROM_STREAM],
+							xlogSourceNames[currentSource],
+							streaming_replication_retry_interval),
+					 errdetail("However, all the WAL present in pg_wal is exhausted before switching.")));
+
+			shouldSwitchSource = true;
+		}
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
@@ -4229,7 +4349,11 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 				continue;
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
+		/*
+		 * When failed to read from archive, try reading from pg_wal, see
+		 * below.
+		 */
+		if (source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_ARCHIVE, true);
@@ -4242,7 +4366,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 			}
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
+		if (source == XLOG_FROM_ARCHIVE || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_PG_WAL, true);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..c741b02297 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3128,6 +3128,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..ebb73e14de 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -351,6 +351,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..dfa0301d61 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 59465b97f3..ad964c4cb0 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -40,6 +40,7 @@ tests += {
       't/032_relfilenode_reuse.pl',
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
+      't/035_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/035_wal_source_switch.pl b/src/test/recovery/t/035_wal_source_switch.pl
new file mode 100644
index 0000000000..c28fba5d88
--- /dev/null
+++ b/src/test/recovery/t/035_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# Find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#38

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#37)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Feb 24, 2023 at 10:26 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Nov 16, 2022 at 11:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I'm attaching the v10 patch for further review.

Needed a rebase. I'm attaching the v11 patch for further review.

Needed a rebase, so attaching the v12 patch. I word-smithed comments
and docs a bit.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v12-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 80a982fb1c0d0737c5144d8ef50bb5e3ff845d07 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 25 Apr 2023 08:36:48 +0000
Subject: [PATCH v12] Allow standby to switch WAL source from archive to
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc., all impacting recovery
performance on standby. And, while standby is reading WAL from
archive, primary accumulates WAL because the standby's replication
slot stays inactive. To avoid these problems, one can use this
parameter to make standby switch to stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 ++++++
 doc/src/sgml/high-availability.sgml           |  15 +-
 src/backend/access/transam/xlogrecovery.c     | 144 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/037_wal_source_switch.pl  | 126 +++++++++++++++
 8 files changed, 328 insertions(+), 20 deletions(-)
 create mode 100644 src/test/recovery/t/037_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b56f073a91..8050f981e9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4923,6 +4923,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (getting WAL from
+        primary). However, standby exhausts all the WAL present in pg_wal
+        before switching. If standby fails to switch to stream mode, it falls
+        back to archive mode. If this parameter's value is specified without
+        units, it is taken as milliseconds. Default is five minutes
+        (<literal>5min</literal>). With a lower value for this parameter,
+        standby makes frequent WAL source switch attempts. To avoid this, it is
+        recommended to set a reasonable value. A setting of <literal>0</literal>
+        disables the feature. When disabled, standby typically switches to
+        stream mode only when receive from WAL archive finishes (no more WAL
+        left there) or fails for any reason. This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server command
+        line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact <varname>streaming_replication_retry_interval</varname>
+         intervals. For example, if the parameter is set to <literal>1min</literal>
+         and fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc., all impacting recovery performance on
+        standby. And, while standby is reading WAL from archive, primary
+        accumulates WAL because the standby's replication slot stays inactive.
+        To avoid these problems, one can use this parameter to make standby
+        switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index cf61b2ed2a..3b592f48c8 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..dbe2520dcc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,11 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +96,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +303,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +451,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3460,8 +3473,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3481,6 +3497,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3504,19 +3526,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary, and
 		 * advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3648,9 +3671,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+				ereport(DEBUG2,
+						(errmsg("switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+								xlogSourceNames[currentSource],
+								xlogSourceNames[oldSource],
+								streaming_replication_retry_interval)));
+			else
+				ereport(DEBUG2,
+						(errmsg("switched WAL source from %s to %s after %s",
+								xlogSourceNames[oldSource],
+								xlogSourceNames[currentSource],
+								lastSourceFailed ? "failure" : "success")));
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3658,6 +3696,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		lastSourceFailed = false;
 
+		if (switchSource)
+		{
+			Assert(canSwitchSource == true);
+			switchSource = false;
+			canSwitchSource = false;
+		}
+
 		switch (currentSource)
 		{
 			case XLOG_FROM_ARCHIVE:
@@ -3679,13 +3724,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * See if we can switch the source to streaming from archive.
+				 */
+				if (!canSwitchSource)
+					canSwitchSource = ShouldSwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching the source to
+				 * stream mode, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3693,6 +3749,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Exhausted all the WAL in pg_wal, now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3923,6 +3987,54 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby can make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from archive
+ * may not always be as efficient and fast as reading from primary. This can be
+ * due to the differences in disk types, IO costs, network latencies etc., all
+ * impacting recovery performance on standby. And, while standby is reading WAL
+ * from archive, primary accumulates WAL because the standby's replication slot
+ * stays inactive. To avoid these problems, we try to make standby switch to
+ * stream mode sooner.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			ereport(DEBUG2,
+					(errmsg("trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+							xlogSourceNames[XLOG_FROM_STREAM],
+							xlogSourceNames[currentSource],
+							streaming_replication_retry_interval)));
+
+			shouldSwitchSource = true;
+		}
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
@@ -4248,7 +4360,11 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 				continue;
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
+		/*
+		 * When failed to read from archive, try reading from pg_wal, see
+		 * below.
+		 */
+		if (source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_ARCHIVE, true);
@@ -4261,7 +4377,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 			}
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
+		if (source == XLOG_FROM_ARCHIVE || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_PG_WAL, true);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 2f42cebaf6..a2116501a3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3159,6 +3159,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0609853995..6f6b95a996 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -355,6 +355,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..dfa0301d61 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 2008958010..aaf71b9323 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -42,6 +42,7 @@ tests += {
       't/034_create_database.pl',
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
+      't/037_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/037_wal_source_switch.pl b/src/test/recovery/t/037_wal_source_switch.pl
new file mode 100644
index 0000000000..c28fba5d88
--- /dev/null
+++ b/src/test/recovery/t/037_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# Find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#39

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#38)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Tue, Apr 25, 2023 at 9:27 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Needed a rebase. I'm attaching the v11 patch for further review.

Needed a rebase, so attaching the v12 patch. I word-smithed comments
and docs a bit.

Needed a rebase. I'm attaching the v13 patch for further consideration.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v13-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/octet-stream; name=v13-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 0086efe82c2da6a696cf84cb1dbfdf33c297df74 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 21 Jul 2023 07:06:52 +0000
Subject: [PATCH v13] Allow standby to switch WAL source from archive to 
 streaming replication

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc., all impacting recovery
performance on standby. And, while standby is reading WAL from
archive, primary accumulates WAL because the standby's replication
slot stays inactive. To avoid these problems, one can use this
parameter to make standby switch to stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 ++++++
 doc/src/sgml/high-availability.sgml           |  15 +-
 src/backend/access/transam/xlogrecovery.c     | 144 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/038_wal_source_switch.pl  | 126 +++++++++++++++
 8 files changed, 328 insertions(+), 20 deletions(-)
 create mode 100644 src/test/recovery/t/038_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 11251fa05e..474a215bdc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4854,6 +4854,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (getting WAL from
+        primary). However, standby exhausts all the WAL present in pg_wal
+        before switching. If standby fails to switch to stream mode, it falls
+        back to archive mode. If this parameter's value is specified without
+        units, it is taken as milliseconds. Default is five minutes
+        (<literal>5min</literal>). With a lower value for this parameter,
+        standby makes frequent WAL source switch attempts. To avoid this, it is
+        recommended to set a reasonable value. A setting of <literal>0</literal>
+        disables the feature. When disabled, standby typically switches to
+        stream mode only when receive from WAL archive finishes (no more WAL
+        left there) or fails for any reason. This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server command
+        line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact <varname>streaming_replication_retry_interval</varname>
+         intervals. For example, if the parameter is set to <literal>1min</literal>
+         and fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc., all impacting recovery performance on
+        standby. And, while standby is reading WAL from archive, primary
+        accumulates WAL because the standby's replication slot stays inactive.
+        To avoid these problems, one can use this parameter to make standby
+        switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5f9257313a..b04dae7ce0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..49c67723b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,11 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +96,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +303,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +451,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3460,8 +3473,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool	switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3481,6 +3497,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3504,19 +3526,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary, and
 		 * advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3648,9 +3671,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+				ereport(DEBUG2,
+						(errmsg("switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+								xlogSourceNames[currentSource],
+								xlogSourceNames[oldSource],
+								streaming_replication_retry_interval)));
+			else
+				ereport(DEBUG2,
+						(errmsg("switched WAL source from %s to %s after %s",
+								xlogSourceNames[oldSource],
+								xlogSourceNames[currentSource],
+								lastSourceFailed ? "failure" : "success")));
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3658,6 +3696,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		lastSourceFailed = false;
 
+		if (switchSource)
+		{
+			Assert(canSwitchSource == true);
+			switchSource = false;
+			canSwitchSource = false;
+		}
+
 		switch (currentSource)
 		{
 			case XLOG_FROM_ARCHIVE:
@@ -3679,13 +3724,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * See if we can switch the source to streaming from archive.
+				 */
+				if (!canSwitchSource)
+					canSwitchSource = ShouldSwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching the source to
+				 * stream mode, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3693,6 +3749,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Exhausted all the WAL in pg_wal, now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3923,6 +3987,54 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby can make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from archive
+ * may not always be as efficient and fast as reading from primary. This can be
+ * due to the differences in disk types, IO costs, network latencies etc., all
+ * impacting recovery performance on standby. And, while standby is reading WAL
+ * from archive, primary accumulates WAL because the standby's replication slot
+ * stays inactive. To avoid these problems, we try to make standby switch to
+ * stream mode sooner.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			ereport(DEBUG2,
+					(errmsg("trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+							xlogSourceNames[XLOG_FROM_STREAM],
+							xlogSourceNames[currentSource],
+							streaming_replication_retry_interval)));
+
+			shouldSwitchSource = true;
+		}
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
@@ -4248,7 +4360,11 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 				continue;
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
+		/*
+		 * When failed to read from archive, try reading from pg_wal, see
+		 * below.
+		 */
+		if (source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_ARCHIVE, true);
@@ -4261,7 +4377,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 			}
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
+		if (source == XLOG_FROM_ARCHIVE || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_PG_WAL, true);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f9dba43b8c..9322e5d694 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3156,6 +3156,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c768af9a73..87d66ae81e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -354,6 +354,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..dfa0301d61 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e7328e4894..634fc9a2b2 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -43,6 +43,7 @@ tests += {
       't/035_standby_logical_decoding.pl',
       't/036_truncated_dropped.pl',
       't/037_invalid_database.pl',
+      't/038_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/038_wal_source_switch.pl b/src/test/recovery/t/038_wal_source_switch.pl
new file mode 100644
index 0000000000..c28fba5d88
--- /dev/null
+++ b/src/test/recovery/t/038_wal_source_switch.pl
@@ -0,0 +1,126 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+    allows_streaming => 1,
+    has_archiving => 1,
+    extra => ['--wal-segsize=1']);
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,
+    has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 100ms
+wal_recycle = off
+log_min_messages = 'debug2'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Stop standby
+$node_standby->stop;
+
+# Advance WAL by 100 segments (= 100MB) on primary
+advance_wal($node_primary, 100);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	q|SELECT COUNT(*) >= 100 FROM pg_ls_waldir()|, 't');
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+$node_primary->wait_for_catchup($node_standby);
+
+ok(find_in_log(
+		$node_standby,
+        qr/restored log file ".*" from archive/),
+	    'check that some of WAL segments were fetched from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/trying to switch WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby tried to switch WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/switched WAL source to stream after fetching WAL from archive for at least .* milliseconds/),
+	    'check that standby actually switched WAL source to primary from archive');
+
+ok(find_in_log(
+		$node_standby,
+        qr/started streaming WAL from primary at .* on timeline .*/),
+	    'check that standby strated streaming from primary');
+
+# Stop standby
+$node_standby->stop;
+
+# Stop primary
+$node_primary->stop;
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+# Find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
+
+done_testing();
-- 
2.34.1

#40

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 2 years ago

In reply to: Bharath Rupireddy (#39)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Fri, Jul 21, 2023 at 12:38 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Needed a rebase. I'm attaching the v13 patch for further consideration.

Needed a rebase. I'm attaching the v14 patch. It also has the following changes:

- Ran pgindent on the new source code.
- Ran pgperltidy on the new TAP test.
- Improved the newly added TAP test a bit. Used the new wait_for_log
core TAP function in place of custom find_in_log.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v14-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v14-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From e6010e3b2e4c52d32a5d2f3eb2d59954617b221b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 21 Oct 2023 13:41:46 +0000
Subject: [PATCH v14] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc., all impacting recovery
performance on standby. And, while standby is reading WAL from
archive, primary accumulates WAL because the standby's replication
slot stays inactive. To avoid these problems, one can use this
parameter to make standby switch to stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Reported-by: SATYANARAYANA NARLAPURAM
Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 ++++++
 doc/src/sgml/high-availability.sgml           |  15 +-
 src/backend/access/transam/xlogrecovery.c     | 146 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/040_wal_source_switch.pl  | 130 ++++++++++++++++
 8 files changed, 333 insertions(+), 21 deletions(-)
 create mode 100644 src/test/recovery/t/040_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3839c72c86..3a18ba9b26 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4791,6 +4791,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (getting WAL from
+        primary). However, standby exhausts all the WAL present in pg_wal
+        before switching. If standby fails to switch to stream mode, it falls
+        back to archive mode. If this parameter's value is specified without
+        units, it is taken as milliseconds. Default is five minutes
+        (<literal>5min</literal>). With a lower value for this parameter,
+        standby makes frequent WAL source switch attempts. To avoid this, it is
+        recommended to set a reasonable value. A setting of <literal>0</literal>
+        disables the feature. When disabled, standby typically switches to
+        stream mode only when receive from WAL archive finishes (no more WAL
+        left there) or fails for any reason. This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server command
+        line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact <varname>streaming_replication_retry_interval</varname>
+         intervals. For example, if the parameter is set to <literal>1min</literal>
+         and fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc., all impacting recovery performance on
+        standby. And, while standby is reading WAL from archive, primary
+        accumulates WAL because the standby's replication slot stays inactive.
+        To avoid these problems, one can use this parameter to make standby
+        switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5f9257313a..b04dae7ce0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index d6f2bb8286..c1a2a83a0b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -68,6 +68,11 @@
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
+#define SwitchFromArchiveToStreamEnabled() \
+	(streaming_replication_retry_interval > 0 && \
+	 StandbyMode && \
+	 currentSource == XLOG_FROM_ARCHIVE)
+
 /*
  * GUC support
  */
@@ -91,6 +96,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +303,11 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine
+ * switches to XLOG_FROM_ARCHIVE.
+ */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +451,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool ShouldSwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3471,8 +3484,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3492,6 +3508,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3515,19 +3537,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary,
+		 * and advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3659,9 +3682,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we're switching to archive. */
+			if (SwitchFromArchiveToStreamEnabled())
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			if (switchSource)
+				ereport(DEBUG1,
+						(errmsg_internal("switched WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+										 xlogSourceNames[currentSource],
+										 xlogSourceNames[oldSource],
+										 streaming_replication_retry_interval)));
+			else
+				ereport(DEBUG1,
+						(errmsg_internal("switched WAL source from %s to %s after %s",
+										 xlogSourceNames[oldSource],
+										 xlogSourceNames[currentSource],
+										 lastSourceFailed ? "failure" : "success")));
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3669,6 +3707,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		lastSourceFailed = false;
 
+		if (switchSource)
+		{
+			Assert(canSwitchSource == true);
+			switchSource = false;
+			canSwitchSource = false;
+		}
+
 		switch (currentSource)
 		{
 			case XLOG_FROM_ARCHIVE:
@@ -3690,13 +3735,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/*
+				 * See if we can switch the source to streaming from archive.
+				 */
+				if (!canSwitchSource)
+					canSwitchSource = ShouldSwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching the source to
+				 * stream mode, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3704,6 +3760,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Exhausted all the WAL in pg_wal, now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3934,6 +3998,54 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * This function tells if a standby can make an attempt to read WAL from
+ * primary after reading from archive for at least
+ * streaming_replication_retry_interval milliseconds. Reading WAL from archive
+ * may not always be as efficient and fast as reading from primary. This can be
+ * due to the differences in disk types, IO costs, network latencies etc., all
+ * impacting recovery performance on standby. And, while standby is reading WAL
+ * from archive, primary accumulates WAL because the standby's replication slot
+ * stays inactive. To avoid these problems, we try to make standby switch to
+ * stream mode sooner.
+ */
+static bool
+ShouldSwitchWALSourceToPrimary(void)
+{
+	bool		shouldSwitchSource = false;
+
+	if (!SwitchFromArchiveToStreamEnabled())
+		return shouldSwitchSource;
+
+	if (switched_to_archive_at > 0)
+	{
+		TimestampTz curr_time;
+
+		curr_time = GetCurrentTimestamp();
+
+		if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time,
+									   streaming_replication_retry_interval))
+		{
+			ereport(DEBUG1,
+					(errmsg_internal("trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds",
+									 xlogSourceNames[XLOG_FROM_STREAM],
+									 xlogSourceNames[currentSource],
+									 streaming_replication_retry_interval)));
+
+			shouldSwitchSource = true;
+		}
+	}
+	else if (switched_to_archive_at == 0)
+	{
+		/*
+		 * Save the timestamp if we're about to fetch WAL from archive for the
+		 * first time.
+		 */
+		switched_to_archive_at = GetCurrentTimestamp();
+	}
+
+	return shouldSwitchSource;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
@@ -4259,7 +4371,11 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 				continue;
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
+		/*
+		 * When failed to read from archive, try reading from pg_wal, see
+		 * below.
+		 */
+		if (source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_ARCHIVE, true);
@@ -4272,7 +4388,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 			}
 		}
 
-		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
+		if (source == XLOG_FROM_ARCHIVE || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
 							  XLOG_FROM_PG_WAL, true);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4c58574166..b6dfa74e3f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3168,6 +3168,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..0e03887683 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -352,6 +352,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 47c29350f5..dfa0301d61 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 9d8039684a..06041a5f74 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -45,6 +45,7 @@ tests += {
       't/037_invalid_database.pl',
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
+      't/040_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/040_wal_source_switch.pl b/src/test/recovery/t/040_wal_source_switch.pl
new file mode 100644
index 0000000000..9388fa3a39
--- /dev/null
+++ b/src/test/recovery/t/040_wal_source_switch.pl
@@ -0,0 +1,130 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test for WAL source switch feature
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+	allows_streaming => 1,
+	has_archiving    => 1,
+	extra            => ['--wal-segsize=1']);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    wal_recycle = off
+));
+$node_primary->start;
+
+# Create an inactive replication slot to keep the WAL
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'rep1'
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+streaming_replication_retry_interval = 10ms
+wal_recycle = off
+log_min_messages = 'debug1'
+));
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->stop;
+
+# Advance WAL by 10 segments (= 10MB) on primary
+advance_wal($node_primary, 10);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 10 FROM pg_ls_waldir();")
+  or die "Timed out while waiting for primary to generate WAL";
+
+# Wait until generated WAL files have been stored on the archives of the
+# primary. This ensures that the standby created below will be able to restore
+# the WAL files.
+my $primary_archive = $node_primary->archive_dir;
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 10 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+# Generate some data on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,10) AS a;");
+
+my $offset = -s $node_standby->logfile;
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? restored log file ".*" from archive/, $offset);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? trying to switch WAL source to stream after fetching WAL from archive for at least 10 milliseconds/,
+	$offset);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source to stream after fetching WAL from archive for at least 10 milliseconds/,
+	$offset);
+
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check the streamed data on the standby
+my $result =
+  $node_standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;");
+
+is($result, '10', 'check streamed data on standby');
+
+$node_standby->stop;
+$node_primary->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+	my ($node, $n) = @_;
+
+	# Advance by $n segments (= (wal_segment_size * $n) bytes) on primary.
+	for (my $i = 0; $i < $n; $i++)
+	{
+		$node->safe_psql('postgres',
+			"CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+	}
+	return;
+}
+
+done_testing();
-- 
2.34.1

#41

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 2 years ago

In reply to: Bharath Rupireddy (#40)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Sat, Oct 21, 2023 at 11:59 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Jul 21, 2023 at 12:38 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Needed a rebase. I'm attaching the v13 patch for further consideration.

Needed a rebase. I'm attaching the v14 patch. It also has the following changes:

- Ran pgindent on the new source code.
- Ran pgperltidy on the new TAP test.
- Improved the newly added TAP test a bit. Used the new wait_for_log
core TAP function in place of custom find_in_log.

Thoughts?

I took a closer look at v14 and came up with the following changes:

1. Used advance_wal introduced by commit c161ab74f7.
2. Simplified the core logic and new TAP tests.
3. Reworded the comments and docs.
4. Simplified new DEBUG messages.

I've attached the v15 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v15-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v15-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 8bdd3b999343c02ae01a7d15ef7fdd0be25623bd Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 28 Dec 2023 11:14:05 +0000
Subject: [PATCH v15] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  47 +++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 115 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/040_wal_source_switch.pl  |  93 ++++++++++++++
 8 files changed, 269 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/040_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b5624ca884..04aa2fa8d2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4866,6 +4866,53 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in pg_wal
+        before switching. If the standby fails to switch to stream mode, it
+        falls back to archive mode. If this parameter value is specified
+        without units, it is taken as milliseconds. Default is
+        <literal>5min</literal>. With a lower value for this parameter, the
+        standby makes frequent WAL source switch attempts. To avoid this, it is
+        recommended to set a reasonable value. A setting of <literal>0</literal>
+        disables the feature. When disabled, the standby typically switches to
+        stream mode only after receiving WAL from archive finishes (i.e., no
+        more WAL left there) or fails for any reason. This parameter can only
+        be set in the <filename>postgresql.conf</filename> file or on the
+        server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc.. All of these can impact the recovery
+        performance on standby and increase the replication lag on primary. In
+        addition, the primary keeps accumulating WAL needed for the standby
+        while the standby reads WAL from archive because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 9dd52ff275..35926e1df3 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6f4f81f992..e59057558e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,8 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +443,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3492,8 +3497,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3513,6 +3521,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3536,19 +3550,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary,
+		 * and advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3680,9 +3695,27 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									(switchSource ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3711,13 +3744,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (!canSwitchSource)
+					canSwitchSource = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3725,6 +3768,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3955,6 +4006,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc.. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static bool
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return false;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return false;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval))
+		return true;
+
+	return false;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9f59440526..127029e3e4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3200,6 +3200,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b2809c711a..4dd27554aa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -360,6 +360,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index ee0bc74278..2e9fb8dfe6 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 9d8039684a..06041a5f74 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -45,6 +45,7 @@ tests += {
       't/037_invalid_database.pl',
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
+      't/040_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/040_wal_source_switch.pl b/src/test/recovery/t/040_wal_source_switch.pl
new file mode 100644
index 0000000000..5586019eae
--- /dev/null
+++ b/src/test/recovery/t/040_wal_source_switch.pl
@@ -0,0 +1,93 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+#
+# Test for WAL source switch feature.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+	allows_streaming => 1,
+	has_archiving    => 1,
+	extra            => ['--wal-segsize=1']);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+	autovacuum = off
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf', qq(
+	primary_slot_name = 'standby_slot'
+	streaming_replication_retry_interval = 1ms
+	log_min_messages = 'debug1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->stop;
+
+# Advance WAL by 5 segments (= 5MB) on primary
+$node_primary->advance_wal(5);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_waldir();")
+  or die "Timed out while waiting for primary to generate WAL";
+
+# Wait until generated WAL files have been stored on the archives of the
+# primary. This ensures that the standby created below will be able to restore
+# the WAL files.
+my $primary_archive = $node_primary->archive_dir;
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+# Generate some data on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,5) AS a;");
+
+my $offset = -s $node_standby->logfile;
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $result =
+  $node_standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;");
+is($result, '5', 'data from primary is streamed to standby');
+
+done_testing();
-- 
2.34.1

#42

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 2 years ago

In reply to: Bharath Rupireddy (#41)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Thu, Dec 28, 2023 at 5:26 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I took a closer look at v14 and came up with the following changes:

1. Used advance_wal introduced by commit c161ab74f7.
2. Simplified the core logic and new TAP tests.
3. Reworded the comments and docs.
4. Simplified new DEBUG messages.

I've attached the v15 patch for further review.

Per a recent commit c538592, FATAL-ized perl warnings in the newly
added TAP test and attached the v16 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v16-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/octet-stream; name=v16-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From e1464a4c8f5c7b7baf65e3856aa262bee10c86f2 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 3 Jan 2024 11:24:34 +0000
Subject: [PATCH v16] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  47 +++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 115 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/040_wal_source_switch.pl  |  93 ++++++++++++++
 8 files changed, 269 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/040_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f323bba018..5d507029d5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4866,6 +4866,53 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in pg_wal
+        before switching. If the standby fails to switch to stream mode, it
+        falls back to archive mode. If this parameter value is specified
+        without units, it is taken as milliseconds. Default is
+        <literal>5min</literal>. With a lower value for this parameter, the
+        standby makes frequent WAL source switch attempts. To avoid this, it is
+        recommended to set a reasonable value. A setting of <literal>0</literal>
+        disables the feature. When disabled, the standby typically switches to
+        stream mode only after receiving WAL from archive finishes (i.e., no
+        more WAL left there) or fails for any reason. This parameter can only
+        be set in the <filename>postgresql.conf</filename> file or on the
+        server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc.. All of these can impact the recovery
+        performance on standby and increase the replication lag on primary. In
+        addition, the primary keeps accumulating WAL needed for the standby
+        while the standby reads WAL from archive because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 9dd52ff275..35926e1df3 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6f4f81f992..e59057558e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,8 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +443,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3492,8 +3497,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3513,6 +3521,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3536,19 +3550,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary,
+		 * and advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3680,9 +3695,27 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									(switchSource ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3711,13 +3744,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (!canSwitchSource)
+					canSwitchSource = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3725,6 +3768,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3955,6 +4006,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc.. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static bool
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return false;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return false;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval))
+		return true;
+
+	return false;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 3945a92ddd..74f58fcae4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3211,6 +3211,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b2809c711a..4dd27554aa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -360,6 +360,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index ee0bc74278..2e9fb8dfe6 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 9d8039684a..06041a5f74 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -45,6 +45,7 @@ tests += {
       't/037_invalid_database.pl',
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
+      't/040_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/040_wal_source_switch.pl b/src/test/recovery/t/040_wal_source_switch.pl
new file mode 100644
index 0000000000..b1ee9d6242
--- /dev/null
+++ b/src/test/recovery/t/040_wal_source_switch.pl
@@ -0,0 +1,93 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+#
+# Test for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+	allows_streaming => 1,
+	has_archiving    => 1,
+	extra            => ['--wal-segsize=1']);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+	autovacuum = off
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf', qq(
+	primary_slot_name = 'standby_slot'
+	streaming_replication_retry_interval = 1ms
+	log_min_messages = 'debug1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->stop;
+
+# Advance WAL by 5 segments (= 5MB) on primary
+$node_primary->advance_wal(5);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_waldir();")
+  or die "Timed out while waiting for primary to generate WAL";
+
+# Wait until generated WAL files have been stored on the archives of the
+# primary. This ensures that the standby created below will be able to restore
+# the WAL files.
+my $primary_archive = $node_primary->archive_dir;
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+# Generate some data on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,5) AS a;");
+
+my $offset = -s $node_standby->logfile;
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $result =
+  $node_standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;");
+is($result, '5', 'data from primary is streamed to standby');
+
+done_testing();
-- 
2.34.1

#43

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#42)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Jan 3, 2024 at 4:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Dec 28, 2023 at 5:26 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I took a closer look at v14 and came up with the following changes:

1. Used advance_wal introduced by commit c161ab74f7.
2. Simplified the core logic and new TAP tests.
3. Reworded the comments and docs.
4. Simplified new DEBUG messages.

I've attached the v15 patch for further review.

Per a recent commit c538592, FATAL-ized perl warnings in the newly
added TAP test and attached the v16 patch.

Needed a rebase due to commit 776621a (conflict in
src/test/recovery/meson.build for new TAP test file added). Please
find the attached v17 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v17-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v17-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 95a6f6c484dcff68f01ff90d9011abff2f15ad89 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 31 Jan 2024 11:59:02 +0000
Subject: [PATCH v17] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  47 +++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 115 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/041_wal_source_switch.pl  |  93 ++++++++++++++
 8 files changed, 269 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/041_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..f0e45cf49d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4867,6 +4867,53 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in pg_wal
+        before switching. If the standby fails to switch to stream mode, it
+        falls back to archive mode. If this parameter value is specified
+        without units, it is taken as milliseconds. Default is
+        <literal>5min</literal>. With a lower value for this parameter, the
+        standby makes frequent WAL source switch attempts. To avoid this, it is
+        recommended to set a reasonable value. A setting of <literal>0</literal>
+        disables the feature. When disabled, the standby typically switches to
+        stream mode only after receiving WAL from archive finishes (i.e., no
+        more WAL left there) or fails for any reason. This parameter can only
+        be set in the <filename>postgresql.conf</filename> file or on the
+        server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc.. All of these can impact the recovery
+        performance on standby and increase the replication lag on primary. In
+        addition, the primary keeps accumulating WAL needed for the standby
+        while the standby reads WAL from archive because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 236c0af65f..ab2e4293bf 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bb472da27..7f83ea22a1 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,8 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +443,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3526,8 +3531,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3547,6 +3555,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3570,19 +3584,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary,
+		 * and advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3714,9 +3729,27 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									(switchSource ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3745,13 +3778,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (!canSwitchSource)
+					canSwitchSource = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3759,6 +3802,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3989,6 +4040,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc.. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static bool
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return false;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return false;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval))
+		return true;
+
+	return false;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7fe58518d7..6179107371 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3221,6 +3221,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index da10b43dac..19e8f8f5be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -360,6 +360,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index bf087ac2a9..d891199462 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
+      't/041_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl
new file mode 100644
index 0000000000..b1ee9d6242
--- /dev/null
+++ b/src/test/recovery/t/041_wal_source_switch.pl
@@ -0,0 +1,93 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+#
+# Test for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(
+	allows_streaming => 1,
+	has_archiving    => 1,
+	extra            => ['--wal-segsize=1']);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+	autovacuum = off
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf('postgresql.conf', qq(
+	primary_slot_name = 'standby_slot'
+	streaming_replication_retry_interval = 1ms
+	log_min_messages = 'debug1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->stop;
+
+# Advance WAL by 5 segments (= 5MB) on primary
+$node_primary->advance_wal(5);
+
+# Wait for primary to generate requested WAL files
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_waldir();")
+  or die "Timed out while waiting for primary to generate WAL";
+
+# Wait until generated WAL files have been stored on the archives of the
+# primary. This ensures that the standby created below will be able to restore
+# the WAL files.
+my $primary_archive = $node_primary->archive_dir;
+$node_primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+# Generate some data on the primary
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,5) AS a;");
+
+my $offset = -s $node_standby->logfile;
+
+# Standby now connects to primary during inital recovery after fetching WAL
+# from archive for about streaming_replication_retry_interval milliseconds.
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $result =
+  $node_standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;");
+is($result, '5', 'data from primary is streamed to standby');
+
+done_testing();
-- 
2.34.1

#44

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#43)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Needed a rebase due to commit 776621a (conflict in
src/test/recovery/meson.build for new TAP test file added). Please
find the attached v17 patch.

Strengthened tests a bit by using recovery_min_apply_delay to mimic
standby spending some time fetching from archive. PSA v18 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v18-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/octet-stream; name=v18-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 6eb23bc2527a4457fe5b7272240b5e3d7f8a3949 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 19 Feb 2024 09:09:33 +0000
Subject: [PATCH v18] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  47 +++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 115 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/041_wal_source_switch.pl  | 108 ++++++++++++++++
 8 files changed, 284 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/041_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ffd711b7f2..e9a4f3062c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4872,6 +4872,53 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in pg_wal
+        before switching. If the standby fails to switch to stream mode, it
+        falls back to archive mode. If this parameter value is specified
+        without units, it is taken as milliseconds. Default is
+        <literal>5min</literal>. With a lower value for this parameter, the
+        standby makes frequent WAL source switch attempts. To avoid this, it is
+        recommended to set a reasonable value. A setting of <literal>0</literal>
+        disables the feature. When disabled, the standby typically switches to
+        stream mode only after receiving WAL from archive finishes (i.e., no
+        more WAL left there) or fails for any reason. This parameter can only
+        be set in the <filename>postgresql.conf</filename> file or on the
+        server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc.. All of these can impact the recovery
+        performance on standby and increase the replication lag on primary. In
+        addition, the primary keeps accumulating WAL needed for the standby
+        while the standby reads WAL from archive because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 236c0af65f..ab2e4293bf 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bb472da27..7f83ea22a1 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,8 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +443,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3526,8 +3531,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3547,6 +3555,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3570,19 +3584,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary,
+		 * and advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3714,9 +3729,27 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									(switchSource ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3745,13 +3778,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (!canSwitchSource)
+					canSwitchSource = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3759,6 +3802,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3989,6 +4040,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc.. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static bool
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return false;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return false;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval))
+		return true;
+
+	return false;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 70652f0a3f..8c835a37a3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3232,6 +3232,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e10755972a..1aec929dbc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -360,6 +360,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index bf087ac2a9..d891199462 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
+      't/041_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl
new file mode 100644
index 0000000000..a664547948
--- /dev/null
+++ b/src/test/recovery/t/041_wal_source_switch.pl
@@ -0,0 +1,108 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+#
+# Test for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(
+	allows_streaming => 1,
+	has_archiving    => 1,
+	extra            => ['--wal-segsize=1']);
+
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+	autovacuum = off
+));
+$primary->start;
+
+$primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$primary->backup($backup_name);
+
+# Create a streaming standby
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup(
+	$primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $retry_interval = 1;
+$standby->append_conf('postgresql.conf', qq(
+	primary_slot_name = 'standby_slot'
+	streaming_replication_retry_interval = '${retry_interval}ms'
+	log_min_messages = 'debug1'
+));
+$standby->start;
+
+# Wait until standby has replayed enough data
+$primary->wait_for_catchup($standby);
+
+$standby->stop;
+
+# Advance WAL by 5 segments (= 5MB) on primary
+$primary->advance_wal(5);
+
+# Wait for primary to generate requested WAL files
+$primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_waldir();")
+  or die "Timed out while waiting for primary to generate WAL";
+
+# Wait until generated WAL files have been stored on the archives of the
+# primary. This ensures that the standby created below will be able to restore
+# the WAL files.
+my $primary_archive = $primary->archive_dir;
+$primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+# Generate some data on the primary
+$primary->safe_psql('postgres',
+	"CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,5) AS a;");
+
+my $offset = -s $standby->logfile;
+
+# Standby initially fetches WAL from archive after the restart. Since it is
+# asked to retry fetching from primary after retry interval
+# (i.e. streaming_replication_retry_interval), it will do so. To mimic the
+# standby spending some time fetching from archive, we use apply delay
+# (i.e. recovery_min_apply_delay) greater than the retry interval, so that for
+# fetching the next WAL file the standby honours retry interval and fetches it
+# from primary.
+my $apply_delay = $retry_interval * 5;
+$standby->append_conf(
+	'postgresql.conf', qq(
+recovery_min_apply_delay = '${apply_delay}ms'
+));
+$standby->start;
+
+# Wait until standby has replayed enough data
+$primary->wait_for_catchup($standby);
+
+$standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $result =
+  $standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;");
+is($result, '5', 'data from primary is streamed to standby');
+
+$standby->stop;
+$primary->stop;
+
+done_testing();
-- 
2.34.1

#45

Japin Li

japinli@hotmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#44)

Re: Switching XLog source from archive to streaming when primary available

On Mon, 19 Feb 2024 at 18:36, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Needed a rebase due to commit 776621a (conflict in
src/test/recovery/meson.build for new TAP test file added). Please
find the attached v17 patch.

Strengthened tests a bit by using recovery_min_apply_delay to mimic
standby spending some time fetching from archive. PSA v18 patch.

Here are some minor comments:

[1]: + primary). However, the standby exhausts all the WAL present in pg_wal
+ primary). However, the standby exhausts all the WAL present in pg_wal

s|pg_wal|<filename>pg_wal</filename>|g

[2]
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+	autovacuum = off
+));

Keeping the same indentation might be better.

#46

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Japin Li (#45)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Feb 19, 2024 at 8:25 PM Japin Li <japinli@hotmail.com> wrote:

Strengthened tests a bit by using recovery_min_apply_delay to mimic
standby spending some time fetching from archive. PSA v18 patch.

Here are some minor comments:

Thanks for taking a look at it.

[1]
+ primary). However, the standby exhausts all the WAL present in pg_wal

s|pg_wal|<filename>pg_wal</filename>|g

Done.

[2]
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+       autovacuum = off
+));

Keeping the same indentation might be better.

The autovacuum line looks mis-indented in the patch file. However, I
now ran src/tools/pgindent/perltidyrc
src/test/recovery/t/041_wal_source_switch.pl on it.

Please see the attached v19 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v19-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/octet-stream; name=v19-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 910220fcd8e186c2e79990158bff9972cc9c604a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 20 Feb 2024 05:37:24 +0000
Subject: [PATCH v19] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  48 ++++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 115 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/041_wal_source_switch.pl  | 110 +++++++++++++++++
 8 files changed, 287 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/041_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ffd711b7f2..2fbf1ad6e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4872,6 +4872,54 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in
+        <filename>pg_wal</filename> directory before switching. If the standby
+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as
+        milliseconds. Default is <literal>5min</literal>. With a lower value
+        for this parameter, the standby makes frequent WAL source switch
+        attempts. To avoid this, it is recommended to set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode only after receiving WAL
+        from archive finishes (i.e., no more WAL left there) or fails for any
+        reason. This parameter can only be set in the
+        <filename>postgresql.conf</filename> file or on the server command
+        line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc.. All of these can impact the recovery
+        performance on standby and increase the replication lag on primary. In
+        addition, the primary keeps accumulating WAL needed for the standby
+        while the standby reads WAL from archive because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 236c0af65f..ab2e4293bf 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bb472da27..7f83ea22a1 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,8 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +443,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3526,8 +3531,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3547,6 +3555,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3570,19 +3584,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary,
+		 * and advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3714,9 +3729,27 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									(switchSource ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3745,13 +3778,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (!canSwitchSource)
+					canSwitchSource = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3759,6 +3802,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3989,6 +4040,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc.. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static bool
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return false;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return false;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval))
+		return true;
+
+	return false;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 70652f0a3f..8c835a37a3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3232,6 +3232,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e10755972a..1aec929dbc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -360,6 +360,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index bf087ac2a9..d891199462 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
+      't/041_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl
new file mode 100644
index 0000000000..082680bf4a
--- /dev/null
+++ b/src/test/recovery/t/041_wal_source_switch.pl
@@ -0,0 +1,110 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+#
+# Test for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(
+	allows_streaming => 1,
+	has_archiving => 1,
+	extra => ['--wal-segsize=1']);
+
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf(
+	'postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+	autovacuum = off
+));
+$primary->start;
+
+$primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$primary->backup($backup_name);
+
+# Create a streaming standby
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup(
+	$primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $retry_interval = 1;
+$standby->append_conf(
+	'postgresql.conf', qq(
+	primary_slot_name = 'standby_slot'
+	streaming_replication_retry_interval = '${retry_interval}ms'
+	log_min_messages = 'debug1'
+));
+$standby->start;
+
+# Wait until standby has replayed enough data
+$primary->wait_for_catchup($standby);
+
+$standby->stop;
+
+# Advance WAL by 5 segments (= 5MB) on primary
+$primary->advance_wal(5);
+
+# Wait for primary to generate requested WAL files
+$primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_waldir();")
+  or die "Timed out while waiting for primary to generate WAL";
+
+# Wait until generated WAL files have been stored on the archives of the
+# primary. This ensures that the standby created below will be able to restore
+# the WAL files.
+my $primary_archive = $primary->archive_dir;
+$primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+# Generate some data on the primary
+$primary->safe_psql('postgres',
+	"CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,5) AS a;");
+
+my $offset = -s $standby->logfile;
+
+# Standby initially fetches WAL from archive after the restart. Since it is
+# asked to retry fetching from primary after retry interval
+# (i.e. streaming_replication_retry_interval), it will do so. To mimic the
+# standby spending some time fetching from archive, we use apply delay
+# (i.e. recovery_min_apply_delay) greater than the retry interval, so that for
+# fetching the next WAL file the standby honours retry interval and fetches it
+# from primary.
+my $apply_delay = $retry_interval * 5;
+$standby->append_conf(
+	'postgresql.conf', qq(
+recovery_min_apply_delay = '${apply_delay}ms'
+));
+$standby->start;
+
+# Wait until standby has replayed enough data
+$primary->wait_for_catchup($standby);
+
+$standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $result =
+  $standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;");
+is($result, '5', 'data from primary is streamed to standby');
+
+$standby->stop;
+$primary->stop;
+
+done_testing();
-- 
2.34.1

#47

Japin Li

japinli@hotmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#46)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Feb 19, 2024 at 8:25 PM Japin Li <japinli@hotmail.com> wrote:
[2]
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+       autovacuum = off
+));
Keeping the same indentation might be better.
The autovacuum line looks mis-indented in the patch file. However, I
now ran src/tools/pgindent/perltidyrc
src/test/recovery/t/041_wal_source_switch.pl on it.

Thanks for updating the patch. It seems still with the wrong indent.

Attachments:

fix-indent.patchtext/x-diffDownload

diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl
index 082680bf4a..b5eddba1d5 100644
--- a/src/test/recovery/t/041_wal_source_switch.pl
+++ b/src/test/recovery/t/041_wal_source_switch.pl
@@ -18,9 +18,9 @@ $primary->init(
 # Ensure checkpoint doesn't come in our way
 $primary->append_conf(
 	'postgresql.conf', qq(
-    min_wal_size = 2MB
-    max_wal_size = 1GB
-    checkpoint_timeout = 1h
+	min_wal_size = 2MB
+	max_wal_size = 1GB
+	checkpoint_timeout = 1h
 	autovacuum = off
 ));
 $primary->start;
@@ -85,7 +85,7 @@ my $offset = -s $standby->logfile;
 my $apply_delay = $retry_interval * 5;
 $standby->append_conf(
 	'postgresql.conf', qq(
-recovery_min_apply_delay = '${apply_delay}ms'
+	recovery_min_apply_delay = '${apply_delay}ms'
 ));
 $standby->start;

#48

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Japin Li (#47)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Tue, Feb 20, 2024 at 11:54 AM Japin Li <japinli@hotmail.com> wrote:

On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Feb 19, 2024 at 8:25 PM Japin Li <japinli@hotmail.com> wrote:
[2]
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+       autovacuum = off
+));
Keeping the same indentation might be better.
The autovacuum line looks mis-indented in the patch file. However, I
now ran src/tools/pgindent/perltidyrc
src/test/recovery/t/041_wal_source_switch.pl on it.
Thanks for updating the patch. It seems still with the wrong indent.

Thanks. perltidyrc didn't complain about anything on v19. However, I
kept the alignment same as other TAP tests for multi-line append_conf.
If that's not correct, I'll leave it to the committer to decide. PSA
v20 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v20-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v20-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From beea9f0b8bbc76bb48dd1a5d64a5b52bafd09e6f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 20 Feb 2024 07:31:29 +0000
Subject: [PATCH v20] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  48 ++++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 115 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/041_wal_source_switch.pl  | 110 +++++++++++++++++
 8 files changed, 287 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/041_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ffd711b7f2..2fbf1ad6e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4872,6 +4872,54 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in
+        <filename>pg_wal</filename> directory before switching. If the standby
+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as
+        milliseconds. Default is <literal>5min</literal>. With a lower value
+        for this parameter, the standby makes frequent WAL source switch
+        attempts. To avoid this, it is recommended to set a reasonable value.
+        A setting of <literal>0</literal> disables the feature. When disabled,
+        the standby typically switches to stream mode only after receiving WAL
+        from archive finishes (i.e., no more WAL left there) or fails for any
+        reason. This parameter can only be set in the
+        <filename>postgresql.conf</filename> file or on the server command
+        line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc.. All of these can impact the recovery
+        performance on standby and increase the replication lag on primary. In
+        addition, the primary keeps accumulating WAL needed for the standby
+        while the standby reads WAL from archive because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 236c0af65f..ab2e4293bf 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bb472da27..7f83ea22a1 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 300000;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,8 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +443,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3526,8 +3531,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3547,6 +3555,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval milliseconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3570,19 +3584,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
-		 * advance the state machine if so. The failure to read might've
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to primary,
+		 * and advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;
 
 			switch (currentSource)
@@ -3714,9 +3729,27 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									(switchSource ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3745,13 +3778,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (!canSwitchSource)
+					canSwitchSource = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3759,6 +3802,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -3989,6 +4040,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc.. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static bool
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return false;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return false;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval))
+		return true;
+
+	return false;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 70652f0a3f..8c835a37a3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3232,6 +3232,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_MS
+		},
+		&streaming_replication_retry_interval,
+		300000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e10755972a..1aec929dbc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -360,6 +360,10 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 
 # - Subscribers -
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index bf087ac2a9..d891199462 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
+      't/041_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl
new file mode 100644
index 0000000000..1a9637ec86
--- /dev/null
+++ b/src/test/recovery/t/041_wal_source_switch.pl
@@ -0,0 +1,110 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+#
+# Test for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Initialize primary node, setting wal-segsize to 1MB
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(
+	allows_streaming => 1,
+	has_archiving => 1,
+	extra => ['--wal-segsize=1']);
+
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf(
+	'postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 1GB
+checkpoint_timeout = 1h
+autovacuum = off
+));
+$primary->start;
+
+$primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# Take backup
+my $backup_name = 'my_backup';
+$primary->backup($backup_name);
+
+# Create a streaming standby
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup(
+	$primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $retry_interval = 1;
+$standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'standby_slot'
+streaming_replication_retry_interval = '${retry_interval}ms'
+log_min_messages = 'debug1'
+));
+$standby->start;
+
+# Wait until standby has replayed enough data
+$primary->wait_for_catchup($standby);
+
+$standby->stop;
+
+# Advance WAL by 5 segments (= 5MB) on primary
+$primary->advance_wal(5);
+
+# Wait for primary to generate requested WAL files
+$primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_waldir();")
+  or die "Timed out while waiting for primary to generate WAL";
+
+# Wait until generated WAL files have been stored on the archives of the
+# primary. This ensures that the standby created below will be able to restore
+# the WAL files.
+my $primary_archive = $primary->archive_dir;
+$primary->poll_query_until('postgres',
+	"SELECT COUNT(*) >= 5 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+# Generate some data on the primary
+$primary->safe_psql('postgres',
+	"CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,5) AS a;");
+
+my $offset = -s $standby->logfile;
+
+# Standby initially fetches WAL from archive after the restart. Since it is
+# asked to retry fetching from primary after retry interval
+# (i.e. streaming_replication_retry_interval), it will do so. To mimic the
+# standby spending some time fetching from archive, we use apply delay
+# (i.e. recovery_min_apply_delay) greater than the retry interval, so that for
+# fetching the next WAL file the standby honours retry interval and fetches it
+# from primary.
+my $apply_delay = $retry_interval * 5;
+$standby->append_conf(
+	'postgresql.conf', qq(
+recovery_min_apply_delay = '${apply_delay}ms'
+));
+$standby->start;
+
+# Wait until standby has replayed enough data
+$primary->wait_for_catchup($standby);
+
+$standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $result =
+  $standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;");
+is($result, '5', 'data from primary is streamed to standby');
+
+$standby->stop;
+$primary->stop;
+
+done_testing();
-- 
2.34.1

#49

Nathan Bossart

nathandbossart@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#48)

Re: Switching XLog source from archive to streaming when primary available

cfbot claims that this one needs another rebase.

I've spent some time thinking about this one. I'll admit I'm a bit worried
about adding more complexity to this state machine, but I also haven't
thought of any other viable approaches, and this still seems like a useful
feature. So, for now, I think we should continue with the current
approach.

+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as
+        milliseconds. Default is <literal>5min</literal>. With a lower value

Does this really need to be milliseconds? I would think that any
reasonable setting would at least on the order of seconds.

+ attempts. To avoid this, it is recommended to set a reasonable value.

I think we might want to suggest what a "reasonable value" is.

+ static bool canSwitchSource = false;
+ bool switchSource = false;

IIUC "canSwitchSource" indicates that we are trying to force a switch to
streaming, but we are currently exhausting anything that's present in the
pg_wal directory, while "switchSource" indicates that we should force a
switch to streaming right now. Furthermore, "canSwitchSource" is static
while "switchSource" is not. Is there any way to simplify this? For
example, would it be possible to make an enum that tracks the
streaming_replication_retry_interval state?

 			/*
 			 * Don't allow any retry loops to occur during nonblocking
-			 * readahead.  Let the caller process everything that has been
-			 * decoded already first.
+			 * readahead if we failed to read from the current source. Let the
+			 * caller process everything that has been decoded already first.
 			 */
-			if (nonblocking)
+			if (nonblocking && lastSourceFailed)
 				return XLREAD_WOULDBLOCK;

Why do we skip this when "switchSource" is set?

+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}

How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE? Is
there no way it could be XLOG_FROM_PG_WAL?

+#streaming_replication_retry_interval = 5min	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication
+					# in milliseconds; 0 disables

I think we might want to turn this feature off by default, at least for the
first release.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#50

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Nathan Bossart (#49)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Tue, Mar 5, 2024 at 7:34 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

cfbot claims that this one needs another rebase.

Yeah, the conflict was with the new TAP test file name in
src/test/recovery/meson.build.

I've spent some time thinking about this one. I'll admit I'm a bit worried
about adding more complexity to this state machine, but I also haven't
thought of any other viable approaches,

Right. I understand that the WaitForWALToBecomeAvailable()'s state
machine is a complex piece.

and this still seems like a useful
feature. So, for now, I think we should continue with the current
approach.

Yes, the feature is useful like mentioned in the docs as below:

+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc. All of these can impact the recovery
+        performance on standby, and can increase the replication lag on
+        primary. In addition, the primary keeps accumulating WAL needed for the
+        standby while the standby reads WAL from archive, because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.

+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as
+        milliseconds. Default is <literal>5min</literal>. With a lower value
Does this really need to be milliseconds? I would think that any
reasonable setting would at least on the order of seconds.

Agreed. Done that way.

+ attempts. To avoid this, it is recommended to set a reasonable value.

I think we might want to suggest what a "reasonable value" is.

It really depends on the WAL generation rate on the primary. If the
WAL files grow faster, the disk runs out of space sooner, so setting a
value to make frequent WAL source switch attempts can help. It's hard
to suggest a one-size-fits-all value. Therefore, I've tweaked the docs
a bit to reflect the fact that it depends on the WAL generation rate.

+       static bool canSwitchSource = false;
+       bool            switchSource = false;
IIUC "canSwitchSource" indicates that we are trying to force a switch to
streaming, but we are currently exhausting anything that's present in the
pg_wal directory,

Right.

while "switchSource" indicates that we should force a
switch to streaming right now.

It's not indicating force switch, it says "previously I was asked to
switch source via canSwitchSource, now that I've exhausted all the WAL
from the pg_wal directory, I'll make a source switch attempt now".

Furthermore, "canSwitchSource" is static
while "switchSource" is not.

This is because the WaitForWALToBecomeAvailable() has to remember the
decision (that streaming_replication_retry_interval has occurred)
across the calls. And, switchSource is decided within
WaitForWALToBecomeAvailable() for every function call.

Is there any way to simplify this? For
example, would it be possible to make an enum that tracks the
streaming_replication_retry_interval state?

I guess the way it is right now looks simple IMHO. If the suggestion
is to have an enum like below; it looks overkill for just two states.

typedef enum
{
CAN_SWITCH_SOURCE,
SWITCH_SOURCE
} XLogSourceSwitchState;

/*
* Don't allow any retry loops to occur during nonblocking
-                        * readahead.  Let the caller process everything that has been
-                        * decoded already first.
+                        * readahead if we failed to read from the current source. Let the
+                        * caller process everything that has been decoded already first.
*/
-                       if (nonblocking)
+                       if (nonblocking && lastSourceFailed)
return XLREAD_WOULDBLOCK;

Why do we skip this when "switchSource" is set?

It was leftover from the initial version of the patch - I was then
encountering some issue and had that piece there. Removed it now.

+                       /* Reset the WAL source switch state */
+                       if (switchSource)
+                       {
+                               Assert(canSwitchSource);
+                               Assert(currentSource == XLOG_FROM_STREAM);
+                               Assert(oldSource == XLOG_FROM_ARCHIVE);
+                               switchSource = false;
+                               canSwitchSource = false;
+                       }

How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE? Is
there no way it could be XLOG_FROM_PG_WAL?

No. switchSource is set to true only when canSwitchSource is set to
true, which happens only when currentSource is XLOG_FROM_ARCHIVE (see
SwitchWALSourceToPrimary()).

+#streaming_replication_retry_interval = 5min   # time after which standby
+                                       # attempts to switch WAL source from archive to
+                                       # streaming replication
+                                       # in milliseconds; 0 disables

I think we might want to turn this feature off by default, at least for the
first release.

Agreed. Done that way.

Please see the attached v21 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v21-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v21-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 82eb49593a563295a7370aa1d87db94c8aa313db Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 5 Mar 2024 17:37:50 +0000
Subject: [PATCH v21] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  49 ++++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 107 +++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/042_wal_source_switch.pl  | 113 ++++++++++++++++++
 8 files changed, 286 insertions(+), 15 deletions(-)
 create mode 100644 src/test/recovery/t/042_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b38cbd714a..02e79f32fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5011,6 +5011,55 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in
+        <filename>pg_wal</filename> directory before switching. If the standby
+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as seconds.
+        With a lower value for this parameter, the standby makes frequent WAL
+        source switch attempts. To avoid this, it is recommended to set a
+        value depending on the rate of WAL generation on the primary. If the
+        WAL files grow faster, the disk runs out of space sooner, so setting a
+        value to make frequent WAL source switch attempts can help. The default
+        is zero, disabling this feature. When disabled, the standby typically
+        switches to stream mode only after receiving WAL from archive finishes
+        (i.e., no more WAL left there) or fails for any reason. This parameter
+        can only be set in the <filename>postgresql.conf</filename> file or on
+        the server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         the current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc. All of these can impact the recovery
+        performance on standby, and can increase the replication lag on
+        primary. In addition, the primary keeps accumulating WAL needed for the
+        standby while the standby reads WAL from archive, because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 236c0af65f..ab2e4293bf 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 853b540945..ca73234695 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 0;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,8 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +443,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static bool SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3541,8 +3546,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
+	static bool canSwitchSource = false;
+	bool		switchSource = false;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3562,6 +3570,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval seconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3585,12 +3599,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to stream, and
 		 * advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed || switchSource)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
@@ -3729,9 +3744,27 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									(switchSource ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (switchSource)
+			{
+				Assert(canSwitchSource);
+				Assert(currentSource == XLOG_FROM_STREAM);
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				switchSource = false;
+				canSwitchSource = false;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3760,13 +3793,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (!canSwitchSource)
+					canSwitchSource = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (canSwitchSource)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3774,6 +3817,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (canSwitchSource)
+					switchSource = true;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -4004,6 +4055,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static bool
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return false;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return false;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval * 1000))
+		return true;
+
+	return false;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 45013582a7..e54d82dd1c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3273,6 +3273,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_S
+		},
+		&streaming_replication_retry_interval,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index edcc0282b2..6f87209d62 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -369,6 +369,9 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 0	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication in seconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off			# enables slot synchronization on the physical standby from the primary
 
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index c67249500e..3a8ecd5e54 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
+      't/042_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/042_wal_source_switch.pl b/src/test/recovery/t/042_wal_source_switch.pl
new file mode 100644
index 0000000000..b00ed29f73
--- /dev/null
+++ b/src/test/recovery/t/042_wal_source_switch.pl
@@ -0,0 +1,113 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Checks for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+checkpoint_timeout = 1h
+autovacuum = off
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# And some content
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_int AS SELECT generate_series(1, 10) AS a");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $retry_interval = 1;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'standby_slot'
+streaming_replication_retry_interval = '${retry_interval}s'
+log_min_messages = 'debug2'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate some data on the primary while the standby is down
+$node_standby->stop;
+for my $i (1 .. 10)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO tab_int VALUES (generate_series(11, 20));");
+	$node_primary->safe_psql('postgres', "SELECT pg_switch_wal();");
+}
+
+# Now wait for replay to complete on standby. We're done waiting when the
+# standby has replayed up to the previously saved primary LSN.
+my $cur_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Generate 1 more WAL file so that we wait predictably for the archiving of
+# all WAL files.
+$node_primary->advance_wal(1);
+
+my $walfile_name =
+  $node_primary->safe_psql('postgres', "SELECT pg_walfile_name('$cur_lsn')");
+
+$node_primary->poll_query_until('postgres',
+	"SELECT count(*) = 1 FROM pg_stat_archiver WHERE last_archived_wal = '$walfile_name';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+my $offset = -s $node_standby->logfile;
+
+# Standby initially fetches WAL from archive after the restart. Since it is
+# asked to retry fetching from primary after retry interval
+# (i.e. streaming_replication_retry_interval), it will do so. To mimic the
+# standby spending some time fetching from archive, we use apply delay
+# (i.e. recovery_min_apply_delay) greater than the retry interval, so that for
+# fetching the next WAL file the standby honours retry interval and fetches it
+# from primary.
+my $delay = $retry_interval * 5;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+recovery_min_apply_delay = '${delay}s'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $row_cnt1 =
+  $node_primary->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+
+my $row_cnt2 =
+  $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+is($row_cnt1, $row_cnt2, 'data from primary is streamed to standby');
+
+done_testing();
-- 
2.34.1

#51

Nathan Bossart

nathandbossart@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#50)

Re: Switching XLog source from archive to streaming when primary available

On Tue, Mar 05, 2024 at 11:38:37PM +0530, Bharath Rupireddy wrote:

On Tue, Mar 5, 2024 at 7:34 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Is there any way to simplify this? For
example, would it be possible to make an enum that tracks the
streaming_replication_retry_interval state?

I guess the way it is right now looks simple IMHO. If the suggestion
is to have an enum like below; it looks overkill for just two states.

typedef enum
{
CAN_SWITCH_SOURCE,
SWITCH_SOURCE
} XLogSourceSwitchState;

I was thinking of something more like

typedef enum
{
NO_FORCE_SWITCH_TO_STREAMING, /* no switch necessary */
FORCE_SWITCH_TO_STREAMING_PENDING, /* exhausting pg_wal */
FORCE_SWITCH_TO_STREAMING, /* switch to streaming now */
} WALSourceSwitchState;

At least, that illustrates my mental model of the process here. IMHO
that's easier to follow than two similarly-named bool variables.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#52

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Nathan Bossart (#51)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Mar 6, 2024 at 1:22 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I was thinking of something more like

typedef enum
{
NO_FORCE_SWITCH_TO_STREAMING, /* no switch necessary */
FORCE_SWITCH_TO_STREAMING_PENDING, /* exhausting pg_wal */
FORCE_SWITCH_TO_STREAMING, /* switch to streaming now */
} WALSourceSwitchState;

At least, that illustrates my mental model of the process here. IMHO
that's easier to follow than two similarly-named bool variables.

I played with that idea and it came out very nice. Please see the
attached v22 patch. Note that personally I didn't like "FORCE" being
there in the names, so I've simplified them a bit.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v22-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v22-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From f2c9d1a170ce4cc536ce9139d7a2e0fdc268be69 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 6 Mar 2024 04:12:26 +0000
Subject: [PATCH v22] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  49 ++++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 117 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/042_wal_source_switch.pl  | 113 +++++++++++++++++
 8 files changed, 296 insertions(+), 15 deletions(-)
 create mode 100644 src/test/recovery/t/042_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b38cbd714a..02e79f32fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5011,6 +5011,55 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in
+        <filename>pg_wal</filename> directory before switching. If the standby
+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as seconds.
+        With a lower value for this parameter, the standby makes frequent WAL
+        source switch attempts. To avoid this, it is recommended to set a
+        value depending on the rate of WAL generation on the primary. If the
+        WAL files grow faster, the disk runs out of space sooner, so setting a
+        value to make frequent WAL source switch attempts can help. The default
+        is zero, disabling this feature. When disabled, the standby typically
+        switches to stream mode only after receiving WAL from archive finishes
+        (i.e., no more WAL left there) or fails for any reason. This parameter
+        can only be set in the <filename>postgresql.conf</filename> file or on
+        the server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         the current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc. All of these can impact the recovery
+        performance on standby, and can increase the replication lag on
+        primary. In addition, the primary keeps accumulating WAL needed for the
+        standby while the standby reads WAL from archive, because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 236c0af65f..ab2e4293bf 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 853b540945..14e846883e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 0;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,20 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Represents WAL source switch states.
+ */
+typedef enum
+{
+	SWITCH_TO_STREAMING_NONE,	/* no switch necessary */
+	SWITCH_TO_STREAMING_PENDING,	/* exhausting pg_wal */
+	SWITCH_TO_STREAMING,		/* switch to streaming now */
+}			WALSourceSwitchState;
+
+static WALSourceSwitchState wal_source_switch_state = SWITCH_TO_STREAMING_NONE;
+
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +455,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static WALSourceSwitchState SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3543,6 +3560,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3562,6 +3580,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval seconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3585,12 +3609,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to stream, and
 		 * advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed ||
+			wal_source_switch_state == SWITCH_TO_STREAMING)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
@@ -3729,9 +3755,26 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									((wal_source_switch_state == SWITCH_TO_STREAMING) ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (wal_source_switch_state == SWITCH_TO_STREAMING)
+			{
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				Assert(currentSource == XLOG_FROM_STREAM);
+
+				wal_source_switch_state = SWITCH_TO_STREAMING_NONE;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3760,13 +3803,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
+					wal_source_switch_state = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3774,6 +3827,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+					wal_source_switch_state = SWITCH_TO_STREAMING;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -4004,6 +4065,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static WALSourceSwitchState
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return SWITCH_TO_STREAMING_NONE;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return SWITCH_TO_STREAMING_NONE;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval * 1000))
+		return SWITCH_TO_STREAMING_PENDING;
+
+	return SWITCH_TO_STREAMING_NONE;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 45013582a7..e54d82dd1c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3273,6 +3273,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_S
+		},
+		&streaming_replication_retry_interval,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index edcc0282b2..6f87209d62 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -369,6 +369,9 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 0	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication in seconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off			# enables slot synchronization on the physical standby from the primary
 
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index c67249500e..3a8ecd5e54 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
+      't/042_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/042_wal_source_switch.pl b/src/test/recovery/t/042_wal_source_switch.pl
new file mode 100644
index 0000000000..b00ed29f73
--- /dev/null
+++ b/src/test/recovery/t/042_wal_source_switch.pl
@@ -0,0 +1,113 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Checks for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+checkpoint_timeout = 1h
+autovacuum = off
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# And some content
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_int AS SELECT generate_series(1, 10) AS a");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $retry_interval = 1;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'standby_slot'
+streaming_replication_retry_interval = '${retry_interval}s'
+log_min_messages = 'debug2'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate some data on the primary while the standby is down
+$node_standby->stop;
+for my $i (1 .. 10)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO tab_int VALUES (generate_series(11, 20));");
+	$node_primary->safe_psql('postgres', "SELECT pg_switch_wal();");
+}
+
+# Now wait for replay to complete on standby. We're done waiting when the
+# standby has replayed up to the previously saved primary LSN.
+my $cur_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Generate 1 more WAL file so that we wait predictably for the archiving of
+# all WAL files.
+$node_primary->advance_wal(1);
+
+my $walfile_name =
+  $node_primary->safe_psql('postgres', "SELECT pg_walfile_name('$cur_lsn')");
+
+$node_primary->poll_query_until('postgres',
+	"SELECT count(*) = 1 FROM pg_stat_archiver WHERE last_archived_wal = '$walfile_name';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+my $offset = -s $node_standby->logfile;
+
+# Standby initially fetches WAL from archive after the restart. Since it is
+# asked to retry fetching from primary after retry interval
+# (i.e. streaming_replication_retry_interval), it will do so. To mimic the
+# standby spending some time fetching from archive, we use apply delay
+# (i.e. recovery_min_apply_delay) greater than the retry interval, so that for
+# fetching the next WAL file the standby honours retry interval and fetches it
+# from primary.
+my $delay = $retry_interval * 5;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+recovery_min_apply_delay = '${delay}s'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $row_cnt1 =
+  $node_primary->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+
+my $row_cnt2 =
+  $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+is($row_cnt1, $row_cnt2, 'data from primary is streamed to standby');
+
+done_testing();
-- 
2.34.1

#53

Nathan Bossart

nathandbossart@gmail.com

almost 2 years ago

In reply to: Bharath Rupireddy (#52)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Mar 06, 2024 at 10:02:43AM +0530, Bharath Rupireddy wrote:

On Wed, Mar 6, 2024 at 1:22 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I was thinking of something more like

typedef enum
{
NO_FORCE_SWITCH_TO_STREAMING, /* no switch necessary */
FORCE_SWITCH_TO_STREAMING_PENDING, /* exhausting pg_wal */
FORCE_SWITCH_TO_STREAMING, /* switch to streaming now */
} WALSourceSwitchState;

At least, that illustrates my mental model of the process here. IMHO
that's easier to follow than two similarly-named bool variables.

I played with that idea and it came out very nice. Please see the
attached v22 patch. Note that personally I didn't like "FORCE" being
there in the names, so I've simplified them a bit.

Thanks. I'd like to spend some time testing this, but from a glance, the
code appears to be in decent shape.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#54

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Nathan Bossart (#53)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Wed, Mar 6, 2024 at 9:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

I played with that idea and it came out very nice. Please see the
attached v22 patch. Note that personally I didn't like "FORCE" being
there in the names, so I've simplified them a bit.

Thanks. I'd like to spend some time testing this, but from a glance, the
code appears to be in decent shape.

Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
conflict in meson.build. Please see the attached v23 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v23-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/x-patch; name=v23-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From b451e18b7b1f91eb3a6857efe552e7fe97cc7f39 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sun, 17 Mar 2024 05:48:38 +0000
Subject: [PATCH v23] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  49 ++++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 117 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wal_source_switch.pl  | 113 +++++++++++++++++
 8 files changed, 296 insertions(+), 15 deletions(-)
 create mode 100644 src/test/recovery/t/043_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..40c2ae93d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5050,6 +5050,55 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in
+        <filename>pg_wal</filename> directory before switching. If the standby
+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as seconds.
+        With a lower value for this parameter, the standby makes frequent WAL
+        source switch attempts. To avoid this, it is recommended to set a
+        value depending on the rate of WAL generation on the primary. If the
+        WAL files grow faster, the disk runs out of space sooner, so setting a
+        value to make frequent WAL source switch attempts can help. The default
+        is zero, disabling this feature. When disabled, the standby typically
+        switches to stream mode only after receiving WAL from archive finishes
+        (i.e., no more WAL left there) or fails for any reason. This parameter
+        can only be set in the <filename>postgresql.conf</filename> file or on
+        the server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         the current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc. All of these can impact the recovery
+        performance on standby, and can increase the replication lag on
+        primary. In addition, the primary keeps accumulating WAL needed for the
+        standby while the standby reads WAL from archive, because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b48209fc2f..a4e555c6a1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..6f696b3f7b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 0;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,20 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Represents WAL source switch states.
+ */
+typedef enum
+{
+	SWITCH_TO_STREAMING_NONE,	/* no switch necessary */
+	SWITCH_TO_STREAMING_PENDING,	/* exhausting pg_wal */
+	SWITCH_TO_STREAMING,		/* switch to streaming now */
+}			WALSourceSwitchState;
+
+static WALSourceSwitchState wal_source_switch_state = SWITCH_TO_STREAMING_NONE;
+
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +455,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static WALSourceSwitchState SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3546,6 +3563,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3565,6 +3583,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval seconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3588,12 +3612,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to stream, and
 		 * advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed ||
+			wal_source_switch_state == SWITCH_TO_STREAMING)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
@@ -3732,9 +3758,26 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									((wal_source_switch_state == SWITCH_TO_STREAMING) ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (wal_source_switch_state == SWITCH_TO_STREAMING)
+			{
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				Assert(currentSource == XLOG_FROM_STREAM);
+
+				wal_source_switch_state = SWITCH_TO_STREAMING_NONE;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3763,13 +3806,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
+					wal_source_switch_state = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+					readFrom = XLOG_FROM_PG_WAL;
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3777,6 +3830,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+					wal_source_switch_state = SWITCH_TO_STREAMING;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -4007,6 +4068,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static WALSourceSwitchState
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return SWITCH_TO_STREAMING_NONE;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return SWITCH_TO_STREAMING_NONE;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval * 1000))
+		return SWITCH_TO_STREAMING_PENDING;
+
+	return SWITCH_TO_STREAMING_NONE;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 57d9de4dd9..d206cb0849 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3273,6 +3273,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_S
+		},
+		&streaming_replication_retry_interval,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f7..deb4be7809 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -371,6 +371,9 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 0	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication in seconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off			# enables slot synchronization on the physical standby from the primary
 
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..fdf5814a22 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_source_switch.pl b/src/test/recovery/t/043_wal_source_switch.pl
new file mode 100644
index 0000000000..b00ed29f73
--- /dev/null
+++ b/src/test/recovery/t/043_wal_source_switch.pl
@@ -0,0 +1,113 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Checks for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+checkpoint_timeout = 1h
+autovacuum = off
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# And some content
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_int AS SELECT generate_series(1, 10) AS a");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $retry_interval = 1;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'standby_slot'
+streaming_replication_retry_interval = '${retry_interval}s'
+log_min_messages = 'debug2'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate some data on the primary while the standby is down
+$node_standby->stop;
+for my $i (1 .. 10)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO tab_int VALUES (generate_series(11, 20));");
+	$node_primary->safe_psql('postgres', "SELECT pg_switch_wal();");
+}
+
+# Now wait for replay to complete on standby. We're done waiting when the
+# standby has replayed up to the previously saved primary LSN.
+my $cur_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Generate 1 more WAL file so that we wait predictably for the archiving of
+# all WAL files.
+$node_primary->advance_wal(1);
+
+my $walfile_name =
+  $node_primary->safe_psql('postgres', "SELECT pg_walfile_name('$cur_lsn')");
+
+$node_primary->poll_query_until('postgres',
+	"SELECT count(*) = 1 FROM pg_stat_archiver WHERE last_archived_wal = '$walfile_name';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+my $offset = -s $node_standby->logfile;
+
+# Standby initially fetches WAL from archive after the restart. Since it is
+# asked to retry fetching from primary after retry interval
+# (i.e. streaming_replication_retry_interval), it will do so. To mimic the
+# standby spending some time fetching from archive, we use apply delay
+# (i.e. recovery_min_apply_delay) greater than the retry interval, so that for
+# fetching the next WAL file the standby honours retry interval and fetches it
+# from primary.
+my $delay = $retry_interval * 5;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+recovery_min_apply_delay = '${delay}s'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $row_cnt1 =
+  $node_primary->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+
+my $row_cnt2 =
+  $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+is($row_cnt1, $row_cnt2, 'data from primary is streamed to standby');
+
+done_testing();
-- 
2.34.1

#55

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Bharath Rupireddy (#54)

Re: Switching XLog source from archive to streaming when primary available

On Sun, Mar 17, 2024 at 11:37:58AM +0530, Bharath Rupireddy wrote:

Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
conflict in meson.build. Please see the attached v23 patch.

I've been reading this patch, and this is a very tricky one. Please
be *very* cautious.

+#streaming_replication_retry_interval = 0    # time after which standby
+                    # attempts to switch WAL source from archive to
+                    # streaming replication in seconds; 0 disables

This stuff allows a minimal retry interval of 1s. Could it be useful
to have more responsiveness here and allow lower values than that?
Why not switching the units to be milliseconds?

+    if (streaming_replication_retry_interval <= 0 ||
+        !StandbyMode ||
+        currentSource != XLOG_FROM_ARCHIVE)
+        return SWITCH_TO_STREAMING_NONE;

Hmm. Perhaps this should mention why we don't care about the
consistent point.

+                /* See if we can switch WAL source to streaming */
+                if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
+                    wal_source_switch_state = SwitchWALSourceToPrimary();

Rather than a routine that returns as result the value to use for the
GUC, I'd suggest to let this routine set the GUC as there is only one
caller of SwitchWALSourceToPrimary(). This can also include a check
on SWITCH_TO_STREAMING_NONE, based on what I'm reading that.

-       if (lastSourceFailed)
+       if (lastSourceFailed ||
+           wal_source_switch_state == SWITCH_TO_STREAMING)

Hmm. This one may be tricky. I'd recommend a separation between the
failure in reading from a source and the switch to a new "forced"
source.

+                if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+                    readFrom = XLOG_FROM_PG_WAL;
+                else
+                    readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+                        XLOG_FROM_ANY : currentSource;

WALSourceSwitchState looks confusing here, and are you sure that this
is actualy correct? Shouldn't we still try a READ_FROM_ANY or a read
from the archives even with a streaming pending.

By the way, I am not convinced that what you have is the best
interface ever. This assumes that we'd always want to switch to
streaming more aggressively. Could there be a point in also
controlling if we should switch to pg_wal/ or just to archiving more
aggressively as well, aka be able to do the opposite switch of WAL
source? This design looks somewhat limited to me. The origin of the
issue is that we don't have a way to control the order of the sources
consumed by WAL replay. Perhaps something like a replay_source_order
that uses a list would be better, with elements settable to archive,
pg_wal and streaming?
--
Michael

#56

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#55)

1 attachment(s)

Re: Switching XLog source from archive to streaming when primary available

On Mon, Mar 18, 2024 at 11:38 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Mar 17, 2024 at 11:37:58AM +0530, Bharath Rupireddy wrote:

Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
conflict in meson.build. Please see the attached v23 patch.

I've been reading this patch, and this is a very tricky one. Please
be *very* cautious.

Thanks for looking into this.

+#streaming_replication_retry_interval = 0    # time after which standby
+                    # attempts to switch WAL source from archive to
+                    # streaming replication in seconds; 0 disables
This stuff allows a minimal retry interval of 1s. Could it be useful
to have more responsiveness here and allow lower values than that?
Why not switching the units to be milliseconds?

Nathan had a different view on this to have it on the order of seconds
- /messages/by-id/20240305020452.GA3373526@nathanxps13.
If set to a too low value, the frequency of standby trying to connect
to primary increases. IMO, the order of seconds seems fine.

+    if (streaming_replication_retry_interval <= 0 ||
+        !StandbyMode ||
+        currentSource != XLOG_FROM_ARCHIVE)
+        return SWITCH_TO_STREAMING_NONE;
Hmm. Perhaps this should mention why we don't care about the
consistent point.

Are you asking why we don't care whether the standby reached a
consistent point when switching to streaming mode due to this new
parameter? If this is the ask, the same applies when a standby
typically switches to streaming replication (get WAL
from primary) today, that is when receive from WAL archive finishes
(no more WAL left there) or fails for any reason. The standby doesn't
care about the consistent point even today, it just trusts the WAL
source and makes the switch.

+                /* See if we can switch WAL source to streaming */
+                if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
+                    wal_source_switch_state = SwitchWALSourceToPrimary();
Rather than a routine that returns as result the value to use for the
GUC, I'd suggest to let this routine set the GUC as there is only one
caller of SwitchWALSourceToPrimary(). This can also include a check
on SWITCH_TO_STREAMING_NONE, based on what I'm reading that.

Firstly, wal_source_switch_state is not a GUC, it's a static variable
to be used across WaitForWALToBecomeAvailable calls. And, if you are
suggesting to turn SwitchWALSourceToPrimary so that it sets
wal_source_switch_state directly, I'd not do that because when
wal_source_switch_state is not SWITCH_TO_STREAMING_NONE, the function
gets called unnecessarily.

-       if (lastSourceFailed)
+       if (lastSourceFailed ||
+           wal_source_switch_state == SWITCH_TO_STREAMING)
Hmm. This one may be tricky. I'd recommend a separation between the
failure in reading from a source and the switch to a new "forced"
source.

Separation would just add duplicate code. Moreover, the code wrapped
within if (lastSourceFailed) doesn't do any error handling or such, it
just resets a few stuff from the previous source and sets the next
source.

FWIW, please check [1]/* * Data not here yet. Check for trigger, then wait for * walreceiver to wake us up when new WAL arrives. */ if (CheckForStandbyTrigger()) { /* * Note that we don't return XLREAD_FAIL immediately * here. After being triggered, we still want to * replay all the WAL that was already streamed. It's * in pg_wal now, so we just treat this as a failure, * and the state machine will move on to replay the * streamed WAL from pg_wal, and then recheck the * trigger and exit replay. */ lastSourceFailed = true; (and the discussion thereon) for how the
lastSourceFailed flag is being used to consume all the streamed WAL in
pg_wal directly upon detecting promotion trigger file. Therefore, I
see no problem with the way it is right now for this new feature.

[1]: /* * Data not here yet. Check for trigger, then wait for * walreceiver to wake us up when new WAL arrives. */ if (CheckForStandbyTrigger()) { /* * Note that we don't return XLREAD_FAIL immediately * here. After being triggered, we still want to * replay all the WAL that was already streamed. It's * in pg_wal now, so we just treat this as a failure, * and the state machine will move on to replay the * streamed WAL from pg_wal, and then recheck the * trigger and exit replay. */ lastSourceFailed = true;
/*
* Data not here yet. Check for trigger, then wait for
* walreceiver to wake us up when new WAL arrives.
*/
if (CheckForStandbyTrigger())
{
/*
* Note that we don't return XLREAD_FAIL immediately
* here. After being triggered, we still want to
* replay all the WAL that was already streamed. It's
* in pg_wal now, so we just treat this as a failure,
* and the state machine will move on to replay the
* streamed WAL from pg_wal, and then recheck the
* trigger and exit replay.
*/
lastSourceFailed = true;

+                if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+                    readFrom = XLOG_FROM_PG_WAL;
+                else
+                    readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+                        XLOG_FROM_ANY : currentSource;
WALSourceSwitchState looks confusing here, and are you sure that this
is actualy correct? Shouldn't we still try a READ_FROM_ANY or a read
from the archives even with a streaming pending.

Please see the discussion starting from
/messages/by-id/20221008215221.GA894639@nathanxps13.
We wanted to keep the existing behaviour the same when we
intentionally switch source to streaming from archive due to the
timeout. The existing behaviour is to exhaust WAL in pg_wal before
switching the source to streaming after failure to fetch from archive.

When wal_source_switch_state is SWITCH_TO_STREAMING_PENDING, the
currentSource is already XLOG_FROM_ARCHIVE (To clear the dust off
here, I've added an assert now in the attached new v24 patch). And, we
don't want to pass XLOG_FROM_ANY to XLogFileReadAnyTLI to again fetch
from the archive. Hence, we choose readFrom = XLOG_FROM_PG_WAL to
specifically tell XLogFileReadAnyTLI read from pg_wal directly.

By the way, I am not convinced that what you have is the best
interface ever. This assumes that we'd always want to switch to
streaming more aggressively. Could there be a point in also
controlling if we should switch to pg_wal/ or just to archiving more
aggressively as well, aka be able to do the opposite switch of WAL
source? This design looks somewhat limited to me. The origin of the
issue is that we don't have a way to control the order of the sources
consumed by WAL replay. Perhaps something like a replay_source_order
that uses a list would be better, with elements settable to archive,
pg_wal and streaming?

Intention of this feature is to provide a way for the streaming
standby to quickly detect when the primary is up and running without
having to wait until either all the WAL in the archive is over or a
failure to fetch from archive happens. Advantages of this feature are:
1) it can make the recovery a bit faster (if fetching from archive
adds up costs with different storage types, IO costs and network
delays), thus can reduce the replication lag 2) primary (if using
replication slot based streaming replication setup) doesn't have to
keep the required WAL for the standby for longer durations, thus
reducing the risk of no space left on disk issues.

IMHO, it makes sense to have something like replay_source_order if
there's any use case that arises in future requiring the standby to
intentionally switch to pg_wal or archive. But not as part of this
feature.

Please see the attached v24 patch. I've added an assertion that the
current source is archive before calling XLogFileReadAnyTLI if
wal_source_switch_state is SWITCH_TO_STREAMING_PENDING. I've also
added the new enum WALSourceSwitchState to typedefs.list to make
pgindent adjust it correctly.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v24-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchapplication/octet-stream; name=v24-0001-Allow-standby-to-switch-WAL-source-from-archive-.patchDownload

From 2fb59e8edfec6c35322f2abfd31c5cb967234093 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 23 Mar 2024 09:21:50 +0000
Subject: [PATCH v24] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Reviewed-by: Michael Paquier
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  49 +++++++
 doc/src/sgml/high-availability.sgml           |  15 ++-
 src/backend/access/transam/xlogrecovery.c     | 124 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +
 src/include/access/xlogrecovery.h             |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wal_source_switch.pl  | 113 ++++++++++++++++
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 304 insertions(+), 15 deletions(-)
 create mode 100644 src/test/recovery/t/043_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..40c2ae93d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5050,6 +5050,55 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval">
+      <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies amount of time after which standby attempts to switch WAL
+        source from archive to streaming replication (i.e., getting WAL from
+        primary). However, the standby exhausts all the WAL present in
+        <filename>pg_wal</filename> directory before switching. If the standby
+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as seconds.
+        With a lower value for this parameter, the standby makes frequent WAL
+        source switch attempts. To avoid this, it is recommended to set a
+        value depending on the rate of WAL generation on the primary. If the
+        WAL files grow faster, the disk runs out of space sooner, so setting a
+        value to make frequent WAL source switch attempts can help. The default
+        is zero, disabling this feature. When disabled, the standby typically
+        switches to stream mode only after receiving WAL from archive finishes
+        (i.e., no more WAL left there) or fails for any reason. This parameter
+        can only be set in the <filename>postgresql.conf</filename> file or on
+        the server command line.
+       </para>
+       <note>
+        <para>
+         Standby may not always attempt to switch source from WAL archive to
+         streaming replication at exact
+         <varname>streaming_replication_retry_interval</varname> intervals. For
+         example, if the parameter is set to <literal>1min</literal> and
+         fetching WAL file from archive takes about <literal>2min</literal>,
+         then the source switch attempt happens for the next WAL file after
+         the current WAL file fetched from archive is fully applied.
+        </para>
+       </note>
+       <para>
+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc. All of these can impact the recovery
+        performance on standby, and can increase the replication lag on
+        primary. In addition, the primary keeps accumulating WAL needed for the
+        standby while the standby reads WAL from archive, because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay">
       <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b48209fc2f..a4e555c6a1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order.
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
-    over a TCP connection (streaming replication). The standby server will
-    also attempt to restore any WAL found in the standby cluster's
-    <filename>pg_wal</filename> directory. That typically happens after a server
-    restart, when the standby replays again WAL that was streamed from the
-    primary before the restart, but you can also manually copy files to
-    <filename>pg_wal</filename> at any time to have them replayed.
+    over a TCP connection (streaming replication) or attempt to switch to
+    streaming replication after reading from archive when
+    <xref linkend="guc-streaming-replication-retry-interval"/> parameter is
+    set. The standby server will also attempt to restore any WAL found in the
+    standby cluster's <filename>pg_wal</filename> directory. That typically
+    happens after a server restart, when the standby replays again WAL that was
+    streamed from the primary before the restart, but you can also manually
+    copy files to <filename>pg_wal</filename> at any time to have them
+    replayed.
    </para>
 
    <para>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..013a28503d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -91,6 +91,7 @@ TimestampTz recoveryTargetTime;
 const char *recoveryTargetName;
 XLogRecPtr	recoveryTargetLSN;
 int			recovery_min_apply_delay = 0;
+int			streaming_replication_retry_interval = 0;
 
 /* options formerly taken from recovery.conf for XLOG streaming */
 char	   *PrimaryConnInfo = NULL;
@@ -297,6 +298,20 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/*
+ * Represents WAL source switch states.
+ */
+typedef enum
+{
+	SWITCH_TO_STREAMING_NONE,	/* no switch necessary */
+	SWITCH_TO_STREAMING_PENDING,	/* exhausting pg_wal */
+	SWITCH_TO_STREAMING,		/* switch to streaming now */
+} WALSourceSwitchState;
+
+static WALSourceSwitchState wal_source_switch_state = SWITCH_TO_STREAMING_NONE;
+
+/* Holds the timestamp at which standby switched WAL source to archive */
+static TimestampTz switched_to_archive_at = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -440,6 +455,8 @@ static bool HotStandbyActiveInReplay(void);
 static void SetCurrentChunkStartTime(TimestampTz xtime);
 static void SetLatestXTime(TimestampTz xtime);
 
+static WALSourceSwitchState SwitchWALSourceToPrimary(void);
+
 /*
  * Initialization of shared memory for WAL recovery
  */
@@ -3546,6 +3563,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
 	bool		streaming_reply_sent = false;
+	XLogSource	readFrom;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -3565,6 +3583,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * those actions are taken when reading from the previous source fails, as
 	 * part of advancing to the next state.
 	 *
+	 * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for
+	 * at least streaming_replication_retry_interval seconds. However,
+	 * exhaust all the WAL present in pg_wal before switching. If successful,
+	 * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+	 * back to XLOG_FROM_ARCHIVE state.
+	 *
 	 * If standby mode is turned off while reading WAL from stream, we move
 	 * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching
 	 * the files (which would be required at end of recovery, e.g., timeline
@@ -3588,12 +3612,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool		startWalReceiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
+		 * First check if we failed to read from the current source or we
+		 * intentionally want to switch the source from archive to stream, and
 		 * advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (lastSourceFailed ||
+			wal_source_switch_state == SWITCH_TO_STREAMING)
 		{
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
@@ -3732,9 +3758,26 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 
 		if (currentSource != oldSource)
-			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+		{
+			/* Save the timestamp at which we are switching to archive */
+			if (currentSource == XLOG_FROM_ARCHIVE)
+				switched_to_archive_at = GetCurrentTimestamp();
+
+			ereport(DEBUG1,
+					errmsg_internal("switched WAL source from %s to %s after %s",
+									xlogSourceNames[oldSource],
+									xlogSourceNames[currentSource],
+									((wal_source_switch_state == SWITCH_TO_STREAMING) ? "timeout" : (lastSourceFailed ? "failure" : "success"))));
+
+			/* Reset the WAL source switch state */
+			if (wal_source_switch_state == SWITCH_TO_STREAMING)
+			{
+				Assert(oldSource == XLOG_FROM_ARCHIVE);
+				Assert(currentSource == XLOG_FROM_STREAM);
+
+				wal_source_switch_state = SWITCH_TO_STREAMING_NONE;
+			}
+		}
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
@@ -3763,13 +3806,30 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* See if we can switch WAL source to streaming */
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
+					wal_source_switch_state = SwitchWALSourceToPrimary();
+
 				/*
 				 * Try to restore the file from archive, or read an existing
-				 * file from pg_wal.
+				 * file from pg_wal. However, before switching WAL source to
+				 * streaming, give it a chance to read all the WAL from
+				 * pg_wal.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
-											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+				{
+					/*
+					 * We already are in archive when we are trying to switch
+					 * to streaming, see SwitchWALSourceToPrimary.
+					 */
+					Assert(currentSource == XLOG_FROM_ARCHIVE);
+					readFrom = XLOG_FROM_PG_WAL;
+				}
+				else
+					readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+						XLOG_FROM_ANY : currentSource;
+
+				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);
 				if (readFile >= 0)
 					return XLREAD_SUCCESS;	/* success! */
 
@@ -3777,6 +3837,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 * Nope, not found in archive or pg_wal.
 				 */
 				lastSourceFailed = true;
+
+				/*
+				 * Read all the WAL in pg_wal. Now ready to switch to
+				 * streaming.
+				 */
+				if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+					wal_source_switch_state = SWITCH_TO_STREAMING;
+
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -4007,6 +4075,44 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	return XLREAD_FAIL;			/* not reached */
 }
 
+/*
+ * Check if standby can make an attempt to read WAL from primary after reading
+ * from archive for at least a configurable duration.
+ *
+ * Reading WAL from archive may not always be as efficient and fast as reading
+ * from primary. This can be due to the differences in disk types, IO costs,
+ * network latencies etc. All of these can impact the recovery performance on
+ * standby and increase the replication lag on primary. In addition, the
+ * primary keeps accumulating WAL needed for the standby while the standby
+ * reads WAL from archive because the standby replication slot stays inactive.
+ * To avoid these problems, the standby will try to switch to stream mode
+ * sooner.
+ */
+static WALSourceSwitchState
+SwitchWALSourceToPrimary(void)
+{
+	TimestampTz now;
+
+	if (streaming_replication_retry_interval <= 0 ||
+		!StandbyMode ||
+		currentSource != XLOG_FROM_ARCHIVE)
+		return SWITCH_TO_STREAMING_NONE;
+
+	now = GetCurrentTimestamp();
+
+	/* First time through */
+	if (switched_to_archive_at == 0)
+	{
+		switched_to_archive_at = now;
+		return SWITCH_TO_STREAMING_NONE;
+	}
+
+	if (TimestampDifferenceExceeds(switched_to_archive_at, now,
+								   streaming_replication_retry_interval * 1000))
+		return SWITCH_TO_STREAMING_PENDING;
+
+	return SWITCH_TO_STREAMING_NONE;
+}
 
 /*
  * Determine what log level should be used to report a corrupt WAL record
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1e71e7db4a..7bd7f4674f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3273,6 +3273,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the time after which standby attempts to switch WAL "
+						 "source from archive to streaming replication."),
+			gettext_noop("0 turns this feature off."),
+			GUC_UNIT_S
+		},
+		&streaming_replication_retry_interval,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the size of write ahead log segments."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f7..deb4be7809 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -371,6 +371,9 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#streaming_replication_retry_interval = 0	# time after which standby
+					# attempts to switch WAL source from archive to
+					# streaming replication in seconds; 0 disables
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off			# enables slot synchronization on the physical standby from the primary
 
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..73c5a86f4c 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName;
 extern PGDLLIMPORT char *recoveryRestoreCommand;
 extern PGDLLIMPORT char *recoveryEndCommand;
 extern PGDLLIMPORT char *archiveCleanupCommand;
+extern PGDLLIMPORT int streaming_replication_retry_interval;
 
 /* indirectly set via GUC system */
 extern PGDLLIMPORT TransactionId recoveryTargetXid;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..fdf5814a22 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_source_switch.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_source_switch.pl b/src/test/recovery/t/043_wal_source_switch.pl
new file mode 100644
index 0000000000..b00ed29f73
--- /dev/null
+++ b/src/test/recovery/t/043_wal_source_switch.pl
@@ -0,0 +1,113 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Checks for WAL source switch feature.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+
+# Ensure checkpoint doesn't come in our way
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+checkpoint_timeout = 1h
+autovacuum = off
+));
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('standby_slot')");
+
+# And some content
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE tab_int AS SELECT generate_series(1, 10) AS a");
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $retry_interval = 1;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+primary_slot_name = 'standby_slot'
+streaming_replication_retry_interval = '${retry_interval}s'
+log_min_messages = 'debug2'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+# Generate some data on the primary while the standby is down
+$node_standby->stop;
+for my $i (1 .. 10)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO tab_int VALUES (generate_series(11, 20));");
+	$node_primary->safe_psql('postgres', "SELECT pg_switch_wal();");
+}
+
+# Now wait for replay to complete on standby. We're done waiting when the
+# standby has replayed up to the previously saved primary LSN.
+my $cur_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Generate 1 more WAL file so that we wait predictably for the archiving of
+# all WAL files.
+$node_primary->advance_wal(1);
+
+my $walfile_name =
+  $node_primary->safe_psql('postgres', "SELECT pg_walfile_name('$cur_lsn')");
+
+$node_primary->poll_query_until('postgres',
+	"SELECT count(*) = 1 FROM pg_stat_archiver WHERE last_archived_wal = '$walfile_name';"
+) or die "Timed out while waiting for archiving of WAL by primary";
+
+my $offset = -s $node_standby->logfile;
+
+# Standby initially fetches WAL from archive after the restart. Since it is
+# asked to retry fetching from primary after retry interval
+# (i.e. streaming_replication_retry_interval), it will do so. To mimic the
+# standby spending some time fetching from archive, we use apply delay
+# (i.e. recovery_min_apply_delay) greater than the retry interval, so that for
+# fetching the next WAL file the standby honours retry interval and fetches it
+# from primary.
+my $delay = $retry_interval * 5;
+$node_standby->append_conf(
+	'postgresql.conf', qq(
+recovery_min_apply_delay = '${delay}s'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+$node_primary->wait_for_catchup($node_standby);
+
+$node_standby->wait_for_log(
+	qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source from archive to stream after timeout/,
+	$offset);
+$node_standby->wait_for_log(
+	qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/,
+	$offset);
+
+# Check that the data from primary is streamed to standby
+my $row_cnt1 =
+  $node_primary->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+
+my $row_cnt2 =
+  $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int;");
+is($row_cnt1, $row_cnt2, 'data from primary is streamed to standby');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e2a0525dd4..3a710dcb90 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3031,6 +3031,7 @@ WALReadError
 WALSegmentCloseCB
 WALSegmentContext
 WALSegmentOpenCB
+WALSourceSwitchState
 WCHAR
 WCOKind
 WFW_WaitOption
-- 
2.34.1

#57

John H

johnhyvr@gmail.com

over 1 year ago

In reply to: Bharath Rupireddy (#56)

Re: Switching XLog source from archive to streaming when primary available

Hi,

I took a brief look at the patch.

For a motivation aspect I can see this being useful
synchronous_replicas if you have commit set to flush mode.
So +1 on feature, easier configurability, although thinking about it
more you could probably have the restore script be smarter and provide
non-zero exit codes periodically.

The patch needs to be rebased but I tested this against an older 17 build.

+ ereport(DEBUG1,
+ errmsg_internal("switched WAL source from %s to %s after %s",
+ xlogSourceNames[oldSource],

Not sure if you're intentionally changing from DEBUG1 from DEBUG2.

* standby and increase the replication lag on primary.

Do you mean "increase replication lag on standby"?
nit: reading from archive *could* be faster since you could in theory
it's not single-processed/threaded.

However,
+ * exhaust all the WAL present in pg_wal before switching. If successful,
+ * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+ * back to XLOG_FROM_ARCHIVE state.

I think I'm missing how this happens. Or what "successful" means. If I'm reading
it right, no matter what happens we will always move to
XLOG_FROM_STREAM based on how
the state machine works?

I tested this in a basic RR setup without replication slots (e.g. log
shipping) where the
WAL is available in the archive but the primary always has the WAL
rotated out and
'streaming_replication_retry_interval = 1'. This leads the RR to
become stuck where it stops fetching from
archive and loops between XLOG_FROM_PG_WAL and XLOG_FROM_STREAM.

When 'streaming_replication_retry_interval' is breached, we transition
from {currentSource, wal_source_switch_state}

{XLOG_FROM_ARCHIVE, SWITCH_TO_STREAMING_NONE} -> {XLOG_FROM_ARCHIVE,
SWITCH_TO_STREAMING_PENDING} with readFrom = XLOG_FROM_PG_WAL.

That reads the last record successfully in pg_wal and then fails to
read the next one because it doesn't exist, transitioning to

{XLOG_FROM_STREAM, SWITCH_TO_STREAMING_PENDING}.

XLOG_FROM_STREAM fails because the WAL is no longer there on primary,
it sets it back to {XLOG_FROM_ARCHIVE, SWITCH_TO_STREAMING_PENDING}.

last_fail_time = now;
currentSource = XLOG_FROM_ARCHIVE;
break;

Since the state is still SWITCH_TO_STREAMING_PENDING from the previous
loops, it forces

Assert(currentSource == XLOG_FROM_ARCHIVE);
readFrom = XLOG_FROM_PG_WAL;
...
readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);

And this readFile call seems to always succeed since it can read the
latest WAL record but not the next one, which is in archive, leading
to transition back to XLOG_FROM_STREAMING and repeats.

/*
* Nope, not found in archive or pg_wal.
*/
lastSourceFailed = true;

I don't think this gets triggered for XLOG_FROM_PG_WAL case, which
means the safety
check you added doesn't actually kick in.

if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
{
wal_source_switch_state = SWITCH_TO_STREAMING;
elog(LOG, "SWITCH_TO_STREAMING_PENDING TO SWITCH_TO_STREAMING");
}

Thanks
--
John Hsu - Amazon Web Services

#58

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 1 year ago

In reply to: John H (#57)

Re: Switching XLog source from archive to streaming when primary available

Hi,

Thanks for looking into this.

On Fri, Aug 23, 2024 at 5:03 AM John H <johnhyvr@gmail.com> wrote:

For a motivation aspect I can see this being useful
synchronous_replicas if you have commit set to flush mode.

In synchronous replication setup, until standby finishes fetching WAL
from the archive, the commits on the primary have to wait which can
increase the query latency. If the standby can connect to the primary
as soon as the broken connection is restored, it can fetch the WAL
soon and transaction commits can continue on the primary. Is my
understanding correct? Is there anything more to this?

I talked to Michael Paquier at PGConf.Dev 2024 and got some concerns
about this feature for dealing with changing timelines. I can't think
of them right now.

And, there were some cautions raised upthread -
/messages/by-id/20240305020452.GA3373526@nathanxps13
and /messages/by-id/ZffaQt7UbM2Q9kYh@paquier.xyz.

So +1 on feature, easier configurability, although thinking about it
more you could probably have the restore script be smarter and provide
non-zero exit codes periodically.

Interesting. Yes, the restore script has to be smarter to detect the
broken connections and distinguish whether the server is performing
just the archive recovery/PITR or streaming from standby. Not doing it
right, perhaps, can cause data loss (?).

The patch needs to be rebased but I tested this against an older 17 build.

Will rebase soon.

+ ereport(DEBUG1,
+ errmsg_internal("switched WAL source from %s to %s after %s",
+ xlogSourceNames[oldSource],
Not sure if you're intentionally changing from DEBUG1 from DEBUG2.

Will change.

* standby and increase the replication lag on primary.

Do you mean "increase replication lag on standby"?
nit: reading from archive *could* be faster since you could in theory
it's not single-processed/threaded.

Yes. I think we can just say "All of these can impact the recovery
performance on
+ * standby and increase the replication lag."

However,
+ * exhaust all the WAL present in pg_wal before switching. If successful,
+ * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+ * back to XLOG_FROM_ARCHIVE state.
I think I'm missing how this happens. Or what "successful" means. If I'm reading
it right, no matter what happens we will always move to
XLOG_FROM_STREAM based on how
the state machine works?

Please have a look at some discussion upthread on exhausting pg_wal
before switching -
/messages/by-id/20230119005014.GA3838170@nathanxps13.
Even today, the standby exhausts pg_wal before switching to streaming
from the archive.

I tested this in a basic RR setup without replication slots (e.g. log
shipping) where the
WAL is available in the archive but the primary always has the WAL
rotated out and
'streaming_replication_retry_interval = 1'. This leads the RR to
become stuck where it stops fetching from
archive and loops between XLOG_FROM_PG_WAL and XLOG_FROM_STREAM.

Nice catch. This is a problem. One idea is to disable
streaming_replication_retry_interval feature for slot-less streaming
replication - either when primary_slot_name isn't specified disallow
the GUC to be set in assign_hook or when deciding to switch the wal
source. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#59

John H

johnhyvr@gmail.com

over 1 year ago

In reply to: Bharath Rupireddy (#58)

Re: Switching XLog source from archive to streaming when primary available

Hi,

On Thu, Aug 29, 2024 at 6:32 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

In synchronous replication setup, until standby finishes fetching WAL
from the archive, the commits on the primary have to wait which can
increase the query latency. If the standby can connect to the primary
as soon as the broken connection is restored, it can fetch the WAL
soon and transaction commits can continue on the primary. Is my
understanding correct? Is there anything more to this?

Yup, if you're running with synchronous_commit = 'on' with
synchronous_replicas, then you can
have the replica continue streaming changes into pg_wal faster than
WAL replay so commits
may be unblocked faster.

I talked to Michael Paquier at PGConf.Dev 2024 and got some concerns
about this feature for dealing with changing timelines. I can't think
of them right now.

I'm not sure what the risk would be if the WAL/history files we sync
from streaming is the same as
we replay from archive.

And, there were some cautions raised upthread -
/messages/by-id/20240305020452.GA3373526@nathanxps13
and /messages/by-id/ZffaQt7UbM2Q9kYh@paquier.xyz.

Yup agreed. I need to understand this area a lot better before I can
do a more in-depth review.

Interesting. Yes, the restore script has to be smarter to detect the
broken connections and distinguish whether the server is performing
just the archive recovery/PITR or streaming from standby. Not doing it
right, perhaps, can cause data loss (?).

I don't think there would be data-loss, only replay is stuck/slows down.
It wouldn't be any different today if the restore-script returned a
non-zero exit status.
The end-user could configure their restore-script to return a non-zero
status, based on some
condition, to move to streaming.

However,
+ * exhaust all the WAL present in pg_wal before switching. If successful,
+ * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
+ * back to XLOG_FROM_ARCHIVE state.
I think I'm missing how this happens. Or what "successful" means. If I'm reading
it right, no matter what happens we will always move to
XLOG_FROM_STREAM based on how
the state machine works?
Please have a look at some discussion upthread on exhausting pg_wal
before switching -
/messages/by-id/20230119005014.GA3838170@nathanxps13.
Even today, the standby exhausts pg_wal before switching to streaming
from the archive.

I'm getting caught on the word "successful". My rough understanding of
WaitForWALToBecomeAvailable is that once you're in XLOG_FROM_PG_WAL, if it was
unsuccessful for whatever reason, it will still transition to
XLOG_FROM_STREAMING.
It does not loop back to XLOG_FROM_ARCHIVE if XLOG_FROM_PG_WAL fails.

Nice catch. This is a problem. One idea is to disable
streaming_replication_retry_interval feature for slot-less streaming
replication - either when primary_slot_name isn't specified disallow
the GUC to be set in assign_hook or when deciding to switch the wal
source. Thoughts?

I don't think it's dependent on slot-less streaming. You would also run into the
issue if the WAL is no longer there on the primary, which can occur
with 'max_slot_wal_keep_size'
as well.
IMO the guarantee we need to make is that when we transition from
XLOG_FROM_STREAMING to
XLOG_FROM_ARCHIVE for a "fresh start", we should attempt to restore
from archive at least once.
I think this means that wal_source_switch_state should be reset back
to SWITCH_TO_STREAMING_NONE
whenever we transition to XLOG_FROM_ARCHIVE.
We've attempted the switch to streaming once, so let's not continually
re-try if it failed.

Thanks,

--
John Hsu - Amazon Web Services

#60

Andrey M. Borodin

x4mmm@yandex-team.ru

about 1 year ago

In reply to: Bharath Rupireddy (#56)

Re: Switching XLog source from archive to streaming when primary available

On 23 Mar 2024, at 14:22, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

IMHO, it makes sense to have something like replay_source_order if
there's any use case that arises in future requiring the standby to
intentionally switch to pg_wal or archive. But not as part of this
feature.

IMO, it's vital part of a feature.

In my observation restore from archive is many orders of magnitude faster than streaming replication. Advanced archive tools employ compression (x6 to speed), download parallelism (x4), are not constrained be primary's network limits (x3) and disk limits, do not depend on complicated FEBE protocol, etc.
When I have to cope with lagging replica, almost always I kill walreceiver and tweak server readahead.

But there might be cases where you still have to attach replica ASAP. I can think of releasing replication slot, transiently failed archive network or storage.

Finally, one might want to have many primary connections: cascading replica might want to stream from any available host from the group of HA hosts.

Best regards, Andrey Borodin.

#61

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Andrey M. Borodin (#60)

Re: Switching XLog source from archive to streaming when primary available

On Thu, Jan 02, 2025 at 11:12:37PM +0500, Andrey M. Borodin wrote:

In my observation restore from archive is many orders of magnitude
faster than streaming replication. Advanced archive tools employ
compression (x6 to speed), download parallelism (x4), are not
constrained be primary's network limits (x3) and disk limits, do not
depend on complicated FEBE protocol, etc.

This is a fair argument in terms of flexibility of what can be
achieved on a file-basis, yes, because you are not bottlenecked by
the existing replication protocol and can request them ahead of time
if necessary and can decide what you want within a single
restore_command or archive_command (or module for the latter).

It may be relevant to think in terms of what could be done at protocol
level to retrieve batches of WAL segments so as the backend has a
better control on how each segment is handled in a batch, or provide
better in-core tools to achieve that with the existing two command
GUCs for restore and archiving? Nathan has also proposed a couple of
months ago restore modules, because relying on commands can be very
fancy in terms of error handling. And we already have the archive
module part.
--
Michael