Unnecessary delay in streaming replication due to replay lag

Started by Asim R Palmost 6 years ago31 messages

apraveen@pivotal.io

almost 6 years ago

3 attachment(s)

Standby does not start walreceiver process until startup process
finishes WAL replay. The more WAL there is to replay, longer is the
delay in starting streaming replication. If replication connection is
temporarily disconnected, this delay becomes a major problem and we
are proposing a solution to avoid the delay.

WAL replay is likely to fall behind when master is processing
write-heavy workload, because WAL is generated by concurrently running
backends on master while only one startup process on standby replays WAL
records in sequence as new WAL is received from master.

Replication connection between walsender and walreceiver may break due
to reasons such as transient network issue, standby going through
restart, etc. The delay in resuming replication connection leads to
lack of high availability - only one copy of WAL is available during
this period.

The problem worsens when the replication is configured to be
synchronous. Commits on master must wait until the WAL replay is
finished on standby, walreceiver is then started and it confirms flush
of WAL upto the commit LSN. If synchronous_commit GUC is set to
remote_write, this behavior is equivalent to tacitly changing it to
remote_apply until the replication connection is re-established!

Has anyone encountered such a problem with streaming replication?

We propose to address this by starting walreceiver without waiting for
startup process to finish replay of WAL. Please see attached
patchset. It can be summarized as follows:

0001 - TAP test to demonstrate the problem.

0002 - The standby startup sequence is changed such that
walreceiver is started by startup process before it begins
to replay WAL.

0003 - Postmaster starts walreceiver if it finds that a
walreceiver process is no longer running and the state
indicates that it is operating as a standby.

This is a POC, we are looking for early feedback on whether the
problem is worth solving and if it makes sense to solve if along this
route.

Hao and Asim

Attachments:

0001-Test-that-replay-of-WAL-logs-on-standby-does-not-aff.patchapplication/octet-stream; name=0001-Test-that-replay-of-WAL-logs-on-standby-does-not-aff.patchDownload

From 5a1b1062dae7ab5f28b14907f493df106607c5d8 Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Thu, 16 Jan 2020 17:57:51 +0530
Subject: [PATCH 1/3] Test that replay of WAL logs on standby does not affect
 syncrep

A new debug GUC is introduced to slow down WAL replay on standby.  The
test sets up synchronous replication and sets the GUC, causing startup
process on standby to sleep for a few seconds before replaying each WAL
record.  The test then inserts data on master to build replay lag.

The replication connection is broken and new commits are made on master.
The test expects the commits to not block, in spite of the replay lag.
It fails if the commits take longer than a timeout.  The value of this
timeout is much less than the total replay lag.  If the commits do not
block, it is confirmed that the WAL streaming is re-established without
waiting for the startup process to finish replaying WAL already
available in pg_wal directory.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/access/transam/xlog.c             |   3 +
 src/backend/utils/misc/guc.c                  |  11 ++
 src/include/access/xlog.h                     |   1 +
 src/test/recovery/t/018_replay_lag_syncrep.pl | 169 ++++++++++++++++++++++++++
 4 files changed, 184 insertions(+)
 create mode 100644 src/test/recovery/t/018_replay_lag_syncrep.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..05d72bfd4b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -87,6 +87,7 @@ int			max_wal_size_mb = 1024; /* 1 GB */
 int			min_wal_size_mb = 80;	/* 80 MB */
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
+int			debug_replay_delay = 0;
 int			XLogArchiveTimeout = 0;
 int			XLogArchiveMode = ARCHIVE_MODE_OFF;
 char	   *XLogArchiveCommand = NULL;
@@ -7198,6 +7199,8 @@ StartupXLOG(void)
 
 				/* Now apply the WAL record itself */
 				RmgrTable[record->xl_rmid].rm_redo(xlogreader);
+				if (debug_replay_delay > 0)
+					pg_usleep(debug_replay_delay * 1000 * 1000);
 
 				/*
 				 * After redo, check whether the backup pages associated with
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e5f8a1301f..9d4e97d3d3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1987,6 +1987,17 @@ static struct config_bool ConfigureNamesBool[] =
 
 static struct config_int ConfigureNamesInt[] =
 {
+	{
+		{"debug_replay_delay", PGC_SUSET, WAL_ARCHIVING,
+			gettext_noop("Slow down replay process by sleeping for N seconds"
+						 "before starting to replay a WAL record."),
+			NULL,
+			GUC_UNIT_S
+		},
+		&debug_replay_delay,
+		0, 0, INT_MAX / 2,
+		NULL, NULL, NULL
+	},
 	{
 		{"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING,
 			gettext_noop("Forces a switch to the next WAL file if a "
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..3c8046f8a2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -110,6 +110,7 @@ extern int	max_wal_size_mb;
 extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
+extern int  debug_replay_delay;
 extern int	wal_retrieve_retry_interval;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
new file mode 100644
index 0000000000..7218e7e8d3
--- /dev/null
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -0,0 +1,169 @@
+# This test demonstrates that synchronous replication is affected by
+# replay process on standby.  If the replay process lags far behind
+# and the replication connection is broken (e.g. temporary network
+# problem) the connection is not established again until the replay
+# process finishes replaying all WAL.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+# Query checking sync_priority and sync_state of each standby
+my $check_sql =
+  "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Check that sync_state of a standby is expected (waiting till it is).
+# If $setting is given, synchronous_standby_names is set to it and
+# the configuration file is reloaded before the test.
+sub test_sync_state
+{
+	my ($self, $expected, $msg, $setting) = @_;
+
+	if (defined($setting))
+	{
+		$self->safe_psql('postgres',
+						 "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
+		$self->reload;
+	}
+
+	ok($self->poll_query_until('postgres', $check_sql, $expected), $msg);
+	return;
+}
+
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print("### Waiting for standby \"$standby_name\" on \"$master_name\"\n");
+	$master->poll_query_until('postgres', $query);
+	return;
+}
+
+# Initialize master node
+my $node_master = get_new_node('master');
+my @extra = (q[--wal-segsize], q[1]);
+$node_master->init(allows_streaming => 1, extra => \@extra);
+$node_master->start;
+my $backup_name = 'master_backup';
+
+# Setup physical replication slot for streaming replication
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot', true, false);]);
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby linking to master
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name,
+								has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+						   q[primary_slot_name = 'phys_slot']);
+# Enable debug logging in standby
+$node_standby->append_conf('postgresql.conf',
+						   q[log_min_messages = debug5]);
+
+start_standby_and_wait($node_master, $node_standby);
+
+# Make standby synchronous
+test_sync_state(
+	$node_master,
+	qq(standby|1|sync),
+	'standby is synchronous',
+	'standby');
+
+# Slow down WAL replay by inducing 10 seconds sleep before replaying
+# each WAL record.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set debug_replay_delay TO 10;');
+$node_standby->reload;
+
+# Load data on master and induce replay lag in standby.
+$node_master->safe_psql('postgres', 'CREATE TABLE replay_lag_test(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test select i from generate_series(1,100) i;');
+
+# Obtain WAL sender PID and kill it.
+my $walsender_pid = $node_master->safe_psql(
+	'postgres',
+	q[select active_pid from pg_get_replication_slots() where slot_name = 'phys_slot']);
+
+# Kill walsender, so that the replication connection breaks.
+kill 'SIGTERM', $walsender_pid;
+
+# The replication connection should be re-establised because
+# postmaster will restart WAL receiver in its main loop.  Try to
+# commit a transaction with a timeout of 2 seconds.  The test expects
+# that the commit does not timeout.
+my $timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (1);',
+	timeout => 2,
+	timed_out => \$timed_out);
+
+# The insert should not timeout because synchronous replication is
+# re-established, even when startup process is still replaying
+# WAL already fetched in pg_wal/.
+is($timed_out, 0, 'insert after WAL receiver restart');
+
+# Break the replication connection by restarting standby.
+$node_standby->restart;
+
+# The replication connection should be re-establised by upon standby
+# restart.  Try to commit a transaction with a 2 second timeout.  The
+# timeout should not be hit because synchronous replication should be
+# re-established before startup process finishes the replay of WAL
+# already available in pg_wal/.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (2);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Reset the debug GUC, so that the replay process is no longer slowed down.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set debug_replay_delay TO 0;');
+$node_standby->reload;
+
+# Ideally, the insert after standby restart should not
+# timeout but it currently does, causing the test to fail.
+is($timed_out, 0, 'insert after standby restart');
+
+# Switch to a new WAL file and see if things work well.
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Transactions should work fine on master.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (3);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Standby should also have identical content, now that we've reset the
+# replay delay.
+my $count_sql = q[select count(*) from replay_lag_test;];
+my $expected = q[103];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected), 'standby query');
+
+$node_standby->promote;
+$node_master->stop;
+$node_standby->safe_psql('postgres', 'insert into replay_lag_test values (4);');
+
+$expected = q[104];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected),
+   'standby query after promotion');
-- 
2.14.3 (Apple Git-98)

0003-Start-WAL-receiver-when-it-is-found-not-running.patchapplication/octet-stream; name=0003-Start-WAL-receiver-when-it-is-found-not-running.patchDownload

From 389db04863dd6f9964fbf71bac599294805d2f88 Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Thu, 16 Jan 2020 17:57:52 +0530
Subject: [PATCH 3/3] Start WAL receiver when it is found not running

Postmaster now starts WAL receiver as soon as it is found not running
from ServerLoop.  This helps to resume streaming replication sooner
when a temporary network disruption causes WAL receiver process to
exit.

As a consequence, the race condition addressed in e5d494d78cf is
eliminated.  Postmaster may start WAL receiver in states that allow a
standby to operate, except PM_STARTUP.  It is not possible to
distinguish whether the postmaster is operating as master or standby in
PM_STARTUP state.

Postmaster attempts to start WAL receiver as long as a promote
request is not received and the state permits.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/postmaster/postmaster.c   | 43 +++++++++++++++++++++--------------
 src/backend/replication/walreceiver.c | 20 ++++++++++++----
 2 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7a92dac525..5d0f8b0ebb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -360,8 +360,8 @@ static volatile sig_atomic_t start_autovac_launcher = false;
 /* the launcher needs to be signalled to communicate some condition */
 static volatile bool avlauncher_needs_signal = false;
 
-/* received START_WALRECEIVER signal */
-static volatile sig_atomic_t WalReceiverRequested = false;
+/* attempt to start WAL receiver, if not undergoing promotion */
+static volatile sig_atomic_t ReceivedPromoteRequest = false;
 
 /* set when there's a worker that needs to be started up */
 static volatile bool StartWorkerNeeded = true;
@@ -1795,8 +1795,11 @@ ServerLoop(void)
 				kill(AutoVacPID, SIGUSR2);
 		}
 
-		/* If we need to start a WAL receiver, try to do that now */
-		if (WalReceiverRequested)
+		/*
+		 * Start WAL receiver if it is not already running and standby mode
+		 * (or archive recovery) is enabled.
+		 */
+		if (!ReceivedPromoteRequest)
 			MaybeStartWalReceiver();
 
 		/* Get other worker processes running, if needed */
@@ -5282,8 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
 	if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER))
 	{
 		/* Startup Process wants us to start the walreceiver process. */
-		/* Start immediately if possible, else remember request for later. */
-		WalReceiverRequested = true;
 		MaybeStartWalReceiver();
 	}
 
@@ -5309,6 +5310,11 @@ sigusr1_handler(SIGNAL_ARGS)
 	{
 		/* Tell startup process to finish recovery */
 		signal_child(StartupPID, SIGUSR2);
+		/*
+		 * Do not attempt to restart wal receiver from now on.  Note that this
+		 * flag remains unchanged once set.
+		 */
+		ReceivedPromoteRequest = true;
 	}
 
 #ifdef WIN32
@@ -5604,26 +5610,29 @@ StartAutovacuumWorker(void)
  * MaybeStartWalReceiver
  *		Start the WAL receiver process, if not running and our state allows.
  *
- * Note: if WalReceiverPID is already nonzero, it might seem that we should
- * clear WalReceiverRequested.  However, there's a race condition if the
- * walreceiver terminates and the startup process immediately requests a new
- * one: it's quite possible to get the signal for the request before reaping
- * the dead walreceiver process.  Better to risk launching an extra
- * walreceiver than to miss launching one we need.  (The walreceiver code
- * has logic to recognize that it should go away if not needed.)
+ * Note: there is a race condition if the walreceiver terminates and the
+ * startup process immediately requests a new one: it's quite possible to get
+ * the signal for the request before reaping the dead walreceiver process.  It
+ * is alright to risk launching an extra walreceiver because the walreceiver
+ * code has logic to recognize that it should go away if not needed.
  */
 static void
 MaybeStartWalReceiver(void)
 {
 	if (WalReceiverPID == 0 &&
-		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
+		/*
+		 * Cannot include PM_STARTUP here because it leads to starting WAL
+		 * receiver even after a standby is promoted.  The objective is to
+		 * start WAL receiver only when standby mode is enabled.  However,
+		 * pmState is set to PM_RECOVERY when standby mode as well as archive
+		 * recovery is enabled.  That means, postmaster cannot distinguish
+		 * between the two.  TODO: if this is a problem, adderss it somehow!
+		 */
+		(pmState == PM_RECOVERY ||
 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
 		Shutdown == NoShutdown)
 	{
 		WalReceiverPID = StartWalReceiver();
-		if (WalReceiverPID != 0)
-			WalReceiverRequested = false;
-		/* else leave the flag set, so we'll try again later */
 	}
 }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e9147d8322..64b0321227 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -241,21 +241,33 @@ WalReceiverMain(void)
 	 * waiting for us to start up, until it times out.
 	 */
 	SpinLockAcquire(&walrcv->mutex);
+	/*
+	 * TODO: postmaster may start walreceiver from ServerLoop.  Startup
+	 * process requests postmaster to start walreceiver on a couple of
+	 * occasions.  The requests from startup process are handled inside
+	 * SIGUSR1 handler.  It is possible that more than one walreceiver
+	 * processes attempt to start nearly simultaneously.  If this is indeed a
+	 * possibility, one solution seems to not start walreceiver from the
+	 * signal handler but only set a flag to do so.
+	 */
 	Assert(walrcv->pid == 0);
 	switch (walrcv->walRcvState)
 	{
 		case WALRCV_STOPPING:
 			/* If we've already been requested to stop, don't start up. */
 			walrcv->walRcvState = WALRCV_STOPPED;
-			/* fall through */
-
-		case WALRCV_STOPPED:
 			SpinLockRelease(&walrcv->mutex);
 			proc_exit(1);
 			break;
 
+		case WALRCV_STOPPED:
+			/*
+			 * Postmaster, upon noticing that WAL receiver is not running,
+			 * starts us from ServerLoop.
+			 */
+			/* fall through */
 		case WALRCV_STARTING:
-			/* The usual case */
+			/* The usual case - startup process requests WAL streaming. */
 			break;
 
 		case WALRCV_WAITING:
-- 
2.14.3 (Apple Git-98)

0002-Start-WAL-receiver-before-startup-process-replays-ex.patchapplication/octet-stream; name=0002-Start-WAL-receiver-before-startup-process-replays-ex.patchDownload

From bfe2c1ef7abf372d19b7ea4ef581963513e5db8b Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Thu, 16 Jan 2020 17:57:52 +0530
Subject: [PATCH 2/3] Start WAL receiver before startup process replays
 existing WAL

If WAL receiver is started only after startup process finishes replaying
WAL already available in pg_wal, synchornous replication is impacted
adversly.  Consider a temporary network outage causing streaming
replication connection to break.  This leads to exit of WAL receiver
process.  If the startup process has fallen behind, it may take a long
time to finish replaying WAL and then start walreceiver again to
re-establish streaming replication.  Commits on master will have to wait
all this while for the standby to flush WAL upto commit LSN.

This experience can be alleviated if replication connection is
re-established as soon as it is found to be disconnected.  The patch
attempts to make this happen by starting WAL receiver from postmaster
ServerLoop as well as startup process, even before it WAL replay begins.

The start point to request streaming from is remembered in pg_control.
Before creating a new WAL segment file, WAL receiver records the new WAL
segment number in pg_control.  If the WAL receiver process exits and
must restart, the recorded segment number is used to generate a start
point, that is the first offset in the segment file to re-establish
streaming replication.

Alternatives we thought of (but did not implement) for persisting the
starting point: (1) postgresql.auto.conf file, similar to how
primary_conninfo is remembered.  This option requires creating a new GUC
that represents the starting point.  Start point is never set by a user,
so using a GUC to represent it does not seem appropriate.  (2) introduce
a new flat file.  This incurs the overhead to maintain an additional
flat file.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/access/transam/xlog.c          | 28 ++++++++++
 src/backend/replication/walreceiver.c      | 90 ++++++++++++++++++++++++++++++
 src/backend/replication/walreceiverfuncs.c | 20 +++++--
 src/bin/pg_controldata/pg_controldata.c    |  4 ++
 src/include/catalog/pg_control.h           |  7 +++
 5 files changed, 144 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 05d72bfd4b..da11129b09 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6974,6 +6974,34 @@ StartupXLOG(void)
 			}
 		}
 
+		/*
+		 * Start WAL receiver without waiting for startup process to finish
+		 * replay, so that streaming replication is established at the
+		 * earliest.  When the replication is configured to be synchronous
+		 * this would unblock commits waiting for WAL to be written and/or
+		 * flushed by synchronous standby.
+		 */
+		if (StandbyModeRequested)
+		{
+			XLogRecPtr startpoint;
+			XLogSegNo startseg;
+			TimeLineID startpointTLI;
+			LWLockAcquire(ControlFileLock, LW_SHARED);
+			startseg = ControlFile->lastFlushedSeg;
+			startpointTLI = ControlFile->lastFlushedSegTLI;
+			LWLockRelease(ControlFileLock);
+			if (startpointTLI > 0)
+			{
+				elog(LOG, "found last flushed segment %lu on time line %d, starting WAL receiver",
+					 startseg, startpointTLI);
+				XLogSegNoOffsetToRecPtr(startseg, 0, wal_segment_size, startpoint);
+				RequestXLogStreaming(startpointTLI,
+									 startpoint,
+									 PrimaryConnInfo,
+									 PrimarySlotName);
+			}
+		}
+
 		/* Initialize resource managers */
 		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a5e85d32f3..e9147d8322 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
 #include "funcapi.h"
@@ -82,6 +83,8 @@ bool		hot_standby_feedback;
 static WalReceiverConn *wrconn = NULL;
 WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 
+static ControlFileData *ControlFile = NULL;
+
 #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
 
 /*
@@ -163,6 +166,45 @@ ProcessWalRcvInterrupts(void)
 }
 
 
+/*
+ * Persist startpoint to pg_control file.  This is used to start replication
+ * without waiting for startup process to let us know where to start streaming
+ * from.
+ */
+static void
+SaveStartPoint(XLogRecPtr startpoint, TimeLineID startpointTLI)
+{
+	XLogSegNo oldseg, startseg;
+	TimeLineID oldTLI;
+
+	XLByteToSeg(startpoint, startseg, wal_segment_size);
+
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+#ifdef USE_ASSERT_CHECKING
+	/*
+	 * On a given timeline, the WAL segment to start streaming from should
+	 * never move backwards.
+	 */
+	if (ControlFile->lastFlushedSegTLI == startpointTLI)
+		Assert(ControlFile->lastFlushedSeg <= startseg);
+#endif
+
+	oldseg = ControlFile->lastFlushedSeg;
+	oldTLI = ControlFile->lastFlushedSegTLI;
+	if (oldseg < startseg || oldTLI != startpointTLI)
+	{
+		ControlFile->lastFlushedSeg = startseg;
+		ControlFile->lastFlushedSegTLI = startpointTLI;
+		UpdateControlFile();
+		elog(DEBUG3,
+			 "lastFlushedSeg (seg, TLI) old: (%lu, %u), new: (%lu, %u)",
+			 oldseg, oldTLI, startseg, startpointTLI);
+	}
+
+	LWLockRelease(ControlFileLock);
+}
+
 /* Main entry point for walreceiver process */
 void
 WalReceiverMain(void)
@@ -304,6 +346,30 @@ WalReceiverMain(void)
 	if (sender_host)
 		pfree(sender_host);
 
+	bool found;
+	ControlFile = ShmemInitStruct("Control File", sizeof(ControlFileData), &found);
+	Assert(found);
+
+	XLogSegNo startseg;
+	XLByteToSeg(startpoint, startseg, wal_segment_size);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	if (startpointTLI == ControlFile->lastFlushedSegTLI &&
+		startseg < ControlFile->lastFlushedSeg)
+	{
+		/*
+		 * Advance startpoint to the flush point in control file.  The
+		 * startpoint may be behind like this when WAL receiver is started by
+		 * postmaster upon noticing that an existing WAL receiver child
+		 * process exited.  Postmaster does not update WalRcv->startpoint,
+		 * similar to how it's done in RequestXLogStreaming, because it should
+		 * refrain from touching shared memory.
+		 */
+		XLogSegNoOffsetToRecPtr(
+			ControlFile->lastFlushedSeg, 0, wal_segment_size, startpoint);
+	}
+	LWLockRelease(ControlFileLock);
+
 	first_stream = true;
 	for (;;)
 	{
@@ -407,10 +473,13 @@ WalReceiverMain(void)
 		if (walrcv_startstreaming(wrconn, &options))
 		{
 			if (first_stream)
+			{
 				ereport(LOG,
 						(errmsg("started streaming WAL from primary at %X/%X on timeline %u",
 								(uint32) (startpoint >> 32), (uint32) startpoint,
 								startpointTLI)));
+				SaveStartPoint(startpoint, startpointTLI);
+			}
 			else
 				ereport(LOG,
 						(errmsg("restarted WAL streaming at %X/%X on timeline %u",
@@ -1055,6 +1124,27 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
+			/*
+			 * When a WAL segment file is completely filled,
+			 * LogstreamResult.Flush points to the beginning of the new WAL
+			 * segment file that will be created shortly.  Before sending a
+			 * reply with a LSN from the new WAL segment for the first time,
+			 * remember the LSN in pg_control.  The LSN is used as the
+			 * startpoint to start streaming again if the WAL receiver process
+			 * exits and starts again.
+			 *
+			 * It is important to update the LSN's segment number in
+			 * pg_control before including it in a replay back to the WAL
+			 * sender.  Once WAL sender receives the flush LSN from standby
+			 * reply, any older WAL segments that do not contain the flush LSN
+			 * may be cleaned up.  If the WAL receiver dies after sending a
+			 * reply but before updating pg_control, it is possible that the
+			 * starting segment saved in pg_control is no longer available on
+			 * master when it attempts to resume streaming.
+			 */
+			if (XLogSegmentOffset(LogstreamResult.Flush, wal_segment_size) == 0)
+				SaveStartPoint(LogstreamResult.Flush, ThisTimeLineID);
+
 			XLogWalRcvSendReply(false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..955b8fcf83 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -239,10 +239,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	SpinLockAcquire(&walrcv->mutex);
 
-	/* It better be stopped if we try to restart it */
-	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
-		   walrcv->walRcvState == WALRCV_WAITING);
-
 	if (conninfo != NULL)
 		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
 	else
@@ -253,12 +249,26 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	else
 		walrcv->slotname[0] = '\0';
 
+	/*
+	 * We used to assert that the WAL receiver is either in WALRCV_STOPPED or
+	 * in WALRCV_WAITING state.
+	 *
+	 * Such an assertion is not possible, now that this function is called by
+	 * startup process on two occasions.  One is just before starting to
+	 * replay WAL when starting up.  And the other is when it has finished
+	 * replaying all WAL in pg_xlog directory.  If the standby is starting up
+	 * after clean shutdown, there is not much WAL to be replayed and both
+	 * calls to this funcion can occur in quick succession.  By the time the
+	 * second request to start streaming is made, the WAL receiver can be in
+	 * any state.  We therefore cannot make any assertion on the state here.
+	 */
+
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
 		walrcv->walRcvState = WALRCV_STARTING;
 	}
-	else
+	else if (walrcv->walRcvState == WALRCV_WAITING)
 		walrcv->walRcvState = WALRCV_RESTARTING;
 	walrcv->startTime = now;
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 19e21ab491..f98f36ffe5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -234,6 +234,10 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified:             %s\n"),
 		   pgctime_str);
+	printf(_("Latest flushed WAL segment number:    %lu\n"),
+		   ControlFile->lastFlushedSeg);
+	printf(_("Latest flushed TimeLineID:            %u\n"),
+		   ControlFile->lastFlushedSegTLI);
 	printf(_("Latest checkpoint location:           %X/%X\n"),
 		   (uint32) (ControlFile->checkPoint >> 32),
 		   (uint32) ControlFile->checkPoint);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..27260bbea5 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -143,6 +143,11 @@ typedef struct ControlFileData
 	 * to disk, we mustn't start up until we reach X again. Zero when not
 	 * doing archive recovery.
 	 *
+	 * lastFlushedSeg is the WAL segment number of the most recently flushed
+	 * WAL file by walreceiver.  It is updated by walreceiver when a received
+	 * WAL record falls on a new WAL segment file.  This is used as the start
+	 * point to resume WAL streaming if it is stopped.
+	 *
 	 * backupStartPoint is the redo pointer of the backup start checkpoint, if
 	 * we are recovering from an online backup and haven't reached the end of
 	 * backup yet. It is reset to zero when the end of backup is reached, and
@@ -165,6 +170,8 @@ typedef struct ControlFileData
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
+	XLogSegNo	lastFlushedSeg;
+	TimeLineID	lastFlushedSegTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
-- 
2.14.3 (Apple Git-98)

Michael Paquier

michael@paquier.xyz

almost 6 years ago

In reply to: Asim R P (#1)

Re: Unnecessary delay in streaming replication due to replay lag

On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:

Standby does not start walreceiver process until startup process
finishes WAL replay. The more WAL there is to replay, longer is the
delay in starting streaming replication. If replication connection is
temporarily disconnected, this delay becomes a major problem and we
are proposing a solution to avoid the delay.

Yeah, that's documented:
/messages/by-id/20190910062325.GD11737@paquier.xyz

We propose to address this by starting walreceiver without waiting for
startup process to finish replay of WAL. Please see attached
patchset. It can be summarized as follows:

0001 - TAP test to demonstrate the problem.

There is no real need for debug_replay_delay because we have already
recovery_min_apply_delay, no? That would count only after consistency
has been reached, and only for COMMIT records, but your test would be
enough with that.

0002 - The standby startup sequence is changed such that
walreceiver is started by startup process before it begins
to replay WAL.

See below.

0003 - Postmaster starts walreceiver if it finds that a
walreceiver process is no longer running and the state
indicates that it is operating as a standby.

I have not checked in details, but I smell some race conditions
between the postmaster and the startup process here.

This is a POC, we are looking for early feedback on whether the
problem is worth solving and if it makes sense to solve if along this
route.

You are not the first person interested in this problem, we have a
patch registered in this CF to control the timing when a WAL receiver
is started at recovery:
https://commitfest.postgresql.org/26/1995/
/messages/by-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru

I am pretty sure that we should not change the default behavior to
start the WAL receiver after replaying everything from the archives to
avoid copying some WAL segments for nothing, so being able to use a
GUC switch should be the way to go, and Konstantin's latest patch was
using this approach. Your patch 0002 adds visibly a third mode: start
immediately on top of the two ones already proposed:
- Start after replaying all WAL available locally and in the
archives.
- Start after reaching a consistent point.
--
Michael

Asim R P

apraveen@pivotal.io

almost 6 years ago

In reply to: Michael Paquier (#2)

3 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

On Fri, Jan 17, 2020 at 11:08 AM Michael Paquier <michael@paquier.xyz>
wrote:

On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:

0001 - TAP test to demonstrate the problem.

There is no real need for debug_replay_delay because we have already
recovery_min_apply_delay, no? That would count only after consistency
has been reached, and only for COMMIT records, but your test would be
enough with that.

Indeed, we didn't know about recovery_min_apply_delay. Thank you for
the suggestion, the updated test is attached.

This is a POC, we are looking for early feedback on whether the
problem is worth solving and if it makes sense to solve if along this
route.

You are not the first person interested in this problem, we have a
patch registered in this CF to control the timing when a WAL receiver
is started at recovery:
https://commitfest.postgresql.org/26/1995/

/messages/by-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru

Great to know about this patch and the discussion. The test case and
the part that saves next start point in control file from our patch
can be combined with Konstantin's patch to solve this problem. Let me
work on that.

I am pretty sure that we should not change the default behavior to
start the WAL receiver after replaying everything from the archives to
avoid copying some WAL segments for nothing, so being able to use a
GUC switch should be the way to go, and Konstantin's latest patch was
using this approach. Your patch 0002 adds visibly a third mode: start
immediately on top of the two ones already proposed:
- Start after replaying all WAL available locally and in the
archives.
- Start after reaching a consistent point.

Consistent point should be reached fairly quickly, in spite of large
replay lag. Min recovery point is updated during XLOG flush and that
happens when a commit record is replayed. Commits should occur
frequently in the WAL stream. So I do not see much value in starting
WAL receiver immediately as compared to starting it after reaching a
consistent point. Does that make sense?

That said, is there anything obviously wrong with starting WAL receiver
immediately, even before reaching consistent state? A consequence is
that WAL receiver may overwrite a WAL segment while startup process is
reading and replaying WAL from it. But that doesn't appear to be a
problem because the overwrite should happen with identical content as
before.

Asim

Attachments:

v1-0001-Test-that-replay-of-WAL-logs-on-standby-does-not-.patchapplication/octet-stream; name=v1-0001-Test-that-replay-of-WAL-logs-on-standby-does-not-.patchDownload

From 233ad435986bb8c91b6948d66fdd840f434ae24b Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Fri, 17 Jan 2020 18:14:41 +0530
Subject: [PATCH v1 1/3] Test that replay of WAL logs on standby does not
 affect syncrep

The test sets up synchronous replication and induces replay lag.  The
replication connection is broken and new commits are made on master.
The test expects the commits to not block, in spite of the replay lag.
It fails if the commits take longer than a timeout.  The value of this
timeout is much less than the total replay lag.  If the commits do not
block, it is confirmed that the WAL streaming is re-established without
waiting for the startup process to finish replaying WAL already
available in pg_wal directory.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/test/recovery/t/018_replay_lag_syncrep.pl | 188 ++++++++++++++++++++++++++
 1 file changed, 188 insertions(+)
 create mode 100644 src/test/recovery/t/018_replay_lag_syncrep.pl

diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
new file mode 100644
index 0000000000..9cd79fdc89
--- /dev/null
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -0,0 +1,188 @@
+# Test impact of replay lag on synchronous replication.
+#
+# Replay lag is induced using recovery_min_apply_delay GUC.  Two ways
+# of breaking replication connection are covered - killing walsender
+# and restarting standby.  The test expects that replication
+# connection is restored without being affected due to replay lag.
+# This is validated by performing commits on master after replication
+# connection is disconnected and checking that they finish within a
+# few seconds.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+# Query checking sync_priority and sync_state of each standby
+my $check_sql =
+  "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Check that sync_state of a standby is expected (waiting till it is).
+# If $setting is given, synchronous_standby_names is set to it and
+# the configuration file is reloaded before the test.
+sub test_sync_state
+{
+	my ($self, $expected, $msg, $setting) = @_;
+
+	if (defined($setting))
+	{
+		$self->safe_psql('postgres',
+						 "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
+		$self->reload;
+	}
+
+	ok($self->poll_query_until('postgres', $check_sql, $expected), $msg);
+	return;
+}
+
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print("### Waiting for standby \"$standby_name\" on \"$master_name\"\n");
+	$master->poll_query_until('postgres', $query);
+	return;
+}
+
+# Initialize master node
+my $node_master = get_new_node('master');
+my @extra = (q[--wal-segsize], q[1]);
+$node_master->init(allows_streaming => 1, extra => \@extra);
+$node_master->start;
+my $backup_name = 'master_backup';
+
+# Setup physical replication slot for streaming replication
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot', true, false);]);
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby linking to master
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name,
+								has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+						   q[primary_slot_name = 'phys_slot']);
+# Enable debug logging in standby
+$node_standby->append_conf('postgresql.conf',
+						   q[log_min_messages = debug5]);
+
+start_standby_and_wait($node_master, $node_standby);
+
+# Make standby synchronous
+test_sync_state(
+	$node_master,
+	qq(standby|1|sync),
+	'standby is synchronous',
+	'standby');
+
+# Switch to a new WAL file after standby is created.  This gives the
+# standby a chance to save the new WAL file's beginning as replication
+# start point.
+$node_master->safe_psql('postgres',	'create table dummy(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Slow down WAL replay by inducing 10 seconds sleep before replaying
+# a commit WAL record.
+$node_standby->safe_psql('postgres',
+						 'ALTER SYSTEM set recovery_min_apply_delay TO 10000;');
+$node_standby->reload;
+
+# Commit some transactions on master to induce replay lag in standby.
+$node_master->safe_psql('postgres', 'CREATE TABLE replay_lag_test(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (101);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (102);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (103);');
+
+# Obtain WAL sender PID and kill it.
+my $walsender_pid = $node_master->safe_psql(
+	'postgres',
+	q[select active_pid from pg_get_replication_slots() where slot_name = 'phys_slot']);
+
+# Kill walsender, so that the replication connection breaks.
+kill 'SIGTERM', $walsender_pid;
+
+# The replication connection should be re-establised much earlier than
+# what it takes to finish replay.  Try to commit a transaction with a
+# timeout of 2 seconds.  The timeout should not be hit.
+my $timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (1);',
+	timeout => 2,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after WAL receiver restart');
+
+# Break the replication connection by restarting standby.
+$node_standby->restart;
+
+# Like in previous test, the replication connection should be
+# re-establised before pending WAL replay is finished.  Try to commit
+# a transaction with 2 second timeout.  The timeout should not be hit.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (2);',
+	timeout => 2,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after standby restart');
+
+# Reset the delay so that the replay process is no longer slowed down.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set recovery_min_apply_delay to 0;');
+$node_standby->reload;
+
+# Switch to a new WAL file and see if things work well.
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Transactions should work fine on master.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (3);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Standby should also have identical content.
+my $count_sql = q[select count(*) from replay_lag_test;];
+my $expected = q[6];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected), 'standby query');
+
+# Test that promotion followed by query works.
+$node_standby->promote;
+$node_master->stop;
+$node_standby->safe_psql('postgres', 'insert into replay_lag_test values (4);');
+
+$expected = q[7];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected),
+   'standby query after promotion');
-- 
2.14.3 (Apple Git-98)

v1-0003-Start-WAL-receiver-when-it-is-found-not-running.patchapplication/octet-stream; name=v1-0003-Start-WAL-receiver-when-it-is-found-not-running.patchDownload

From da18ad25c2b0308ff6ac87a1a63933acda1907cc Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Fri, 17 Jan 2020 18:14:44 +0530
Subject: [PATCH v1 3/3] Start WAL receiver when it is found not running

Postmaster now starts WAL receiver as soon as it is found not running
from ServerLoop.  This helps to resume streaming replication sooner
when a temporary network disruption causes WAL receiver process to
exit.

As a consequence, the race condition addressed in e5d494d78cf is
eliminated.  Postmaster may start WAL receiver in states that allow a
standby to operate, except PM_STARTUP.  It is not possible to
distinguish whether the postmaster is operating as master or standby in
PM_STARTUP state.

Postmaster attempts to start WAL receiver as long as a promote
request is not received and the state permits.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/postmaster/postmaster.c   | 43 +++++++++++++++++++++--------------
 src/backend/replication/walreceiver.c | 40 ++++++++++++++++++++++++++++----
 2 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7a92dac525..5d0f8b0ebb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -360,8 +360,8 @@ static volatile sig_atomic_t start_autovac_launcher = false;
 /* the launcher needs to be signalled to communicate some condition */
 static volatile bool avlauncher_needs_signal = false;
 
-/* received START_WALRECEIVER signal */
-static volatile sig_atomic_t WalReceiverRequested = false;
+/* attempt to start WAL receiver, if not undergoing promotion */
+static volatile sig_atomic_t ReceivedPromoteRequest = false;
 
 /* set when there's a worker that needs to be started up */
 static volatile bool StartWorkerNeeded = true;
@@ -1795,8 +1795,11 @@ ServerLoop(void)
 				kill(AutoVacPID, SIGUSR2);
 		}
 
-		/* If we need to start a WAL receiver, try to do that now */
-		if (WalReceiverRequested)
+		/*
+		 * Start WAL receiver if it is not already running and standby mode
+		 * (or archive recovery) is enabled.
+		 */
+		if (!ReceivedPromoteRequest)
 			MaybeStartWalReceiver();
 
 		/* Get other worker processes running, if needed */
@@ -5282,8 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
 	if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER))
 	{
 		/* Startup Process wants us to start the walreceiver process. */
-		/* Start immediately if possible, else remember request for later. */
-		WalReceiverRequested = true;
 		MaybeStartWalReceiver();
 	}
 
@@ -5309,6 +5310,11 @@ sigusr1_handler(SIGNAL_ARGS)
 	{
 		/* Tell startup process to finish recovery */
 		signal_child(StartupPID, SIGUSR2);
+		/*
+		 * Do not attempt to restart wal receiver from now on.  Note that this
+		 * flag remains unchanged once set.
+		 */
+		ReceivedPromoteRequest = true;
 	}
 
 #ifdef WIN32
@@ -5604,26 +5610,29 @@ StartAutovacuumWorker(void)
  * MaybeStartWalReceiver
  *		Start the WAL receiver process, if not running and our state allows.
  *
- * Note: if WalReceiverPID is already nonzero, it might seem that we should
- * clear WalReceiverRequested.  However, there's a race condition if the
- * walreceiver terminates and the startup process immediately requests a new
- * one: it's quite possible to get the signal for the request before reaping
- * the dead walreceiver process.  Better to risk launching an extra
- * walreceiver than to miss launching one we need.  (The walreceiver code
- * has logic to recognize that it should go away if not needed.)
+ * Note: there is a race condition if the walreceiver terminates and the
+ * startup process immediately requests a new one: it's quite possible to get
+ * the signal for the request before reaping the dead walreceiver process.  It
+ * is alright to risk launching an extra walreceiver because the walreceiver
+ * code has logic to recognize that it should go away if not needed.
  */
 static void
 MaybeStartWalReceiver(void)
 {
 	if (WalReceiverPID == 0 &&
-		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
+		/*
+		 * Cannot include PM_STARTUP here because it leads to starting WAL
+		 * receiver even after a standby is promoted.  The objective is to
+		 * start WAL receiver only when standby mode is enabled.  However,
+		 * pmState is set to PM_RECOVERY when standby mode as well as archive
+		 * recovery is enabled.  That means, postmaster cannot distinguish
+		 * between the two.  TODO: if this is a problem, adderss it somehow!
+		 */
+		(pmState == PM_RECOVERY ||
 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
 		Shutdown == NoShutdown)
 	{
 		WalReceiverPID = StartWalReceiver();
-		if (WalReceiverPID != 0)
-			WalReceiverRequested = false;
-		/* else leave the flag set, so we'll try again later */
 	}
 }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c862b65cae..9d91d9ca28 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -241,21 +241,33 @@ WalReceiverMain(void)
 	 * waiting for us to start up, until it times out.
 	 */
 	SpinLockAcquire(&walrcv->mutex);
+	/*
+	 * TODO: postmaster may start walreceiver from ServerLoop.  Startup
+	 * process requests postmaster to start walreceiver on a couple of
+	 * occasions.  The requests from startup process are handled inside
+	 * SIGUSR1 handler.  It is possible that more than one walreceiver
+	 * processes attempt to start nearly simultaneously.  If this is indeed a
+	 * possibility, one solution seems to not start walreceiver from the
+	 * signal handler but only set a flag to do so.
+	 */
 	Assert(walrcv->pid == 0);
 	switch (walrcv->walRcvState)
 	{
 		case WALRCV_STOPPING:
 			/* If we've already been requested to stop, don't start up. */
 			walrcv->walRcvState = WALRCV_STOPPED;
-			/* fall through */
-
-		case WALRCV_STOPPED:
 			SpinLockRelease(&walrcv->mutex);
 			proc_exit(1);
 			break;
 
+		case WALRCV_STOPPED:
+			/*
+			 * Postmaster, upon noticing that WAL receiver is not running,
+			 * starts us from ServerLoop.
+			 */
+			/* fall through */
 		case WALRCV_STARTING:
-			/* The usual case */
+			/* The usual case - startup process requests WAL streaming. */
 			break;
 
 		case WALRCV_WAITING:
@@ -350,6 +362,26 @@ WalReceiverMain(void)
 	ControlFile = ShmemInitStruct("Control File", sizeof(ControlFileData), &found);
 	Assert(found);
 
+	XLogSegNo startseg;
+	XLByteToSeg(startpoint, startseg, wal_segment_size);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	if (startpointTLI == ControlFile->lastFlushedSegTLI &&
+		startseg < ControlFile->lastFlushedSeg)
+	{
+		/*
+		 * Advance startpoint to the flush point in control file.  The
+		 * startpoint may be behind like this when WAL receiver is started by
+		 * postmaster upon noticing that an existing WAL receiver child
+		 * process exited.  Postmaster does not update WalRcv->startpoint,
+		 * similar to how it's done in RequestXLogStreaming, because it should
+		 * refrain from touching shared memory.
+		 */
+		XLogSegNoOffsetToRecPtr(
+			ControlFile->lastFlushedSeg, 0, wal_segment_size, startpoint);
+	}
+	LWLockRelease(ControlFileLock);
+
 	first_stream = true;
 	for (;;)
 	{
-- 
2.14.3 (Apple Git-98)

v1-0002-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v1-0002-Start-WAL-receiver-before-startup-process-replays.patchDownload

From edd12ca55a7454ccaaba5b7bf7d4a34a15ef4707 Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Fri, 17 Jan 2020 18:14:43 +0530
Subject: [PATCH v1 2/3] Start WAL receiver before startup process replays
 existing WAL

If WAL receiver is started only after startup process finishes replaying
WAL already available in pg_wal, synchornous replication is impacted
adversly.  Consider a temporary network outage causing streaming
replication connection to break.  This leads to exit of WAL receiver
process.  If the startup process has fallen behind, it may take a long
time to finish replaying WAL and then start walreceiver again to
re-establish streaming replication.  Commits on master will have to wait
all this while for the standby to flush WAL upto commit LSN.

This experience can be alleviated if replication connection is
re-established as soon as it is found to be disconnected.  The patch
attempts to make this happen by starting WAL receiver from postmaster
ServerLoop as well as startup process, even before it WAL replay begins.

The start point to request streaming from is remembered in pg_control.
Before creating a new WAL segment file, WAL receiver records the new WAL
segment number in pg_control.  If the WAL receiver process exits and
must restart, the recorded segment number is used to generate a start
point, that is the first offset in the segment file to re-establish
streaming replication.

Alternatives we thought of (but did not implement) for persisting the
starting point: (1) postgresql.auto.conf file, similar to how
primary_conninfo is remembered.  This option requires creating a new GUC
that represents the starting point.  Start point is never set by a user,
so using a GUC to represent it does not seem appropriate.  (2) introduce
a new flat file.  This incurs the overhead to maintain an additional
flat file.

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/access/transam/xlog.c          | 28 +++++++++++++
 src/backend/replication/walreceiver.c      | 67 ++++++++++++++++++++++++++++++
 src/backend/replication/walreceiverfuncs.c | 20 ++++++---
 src/bin/pg_controldata/pg_controldata.c    |  4 ++
 src/include/catalog/pg_control.h           |  7 ++++
 5 files changed, 121 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..a87bd78f96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6973,6 +6973,34 @@ StartupXLOG(void)
 			}
 		}
 
+		/*
+		 * Start WAL receiver without waiting for startup process to finish
+		 * replay, so that streaming replication is established at the
+		 * earliest.  When the replication is configured to be synchronous
+		 * this would unblock commits waiting for WAL to be written and/or
+		 * flushed by synchronous standby.
+		 */
+		if (StandbyModeRequested)
+		{
+			XLogRecPtr startpoint;
+			XLogSegNo startseg;
+			TimeLineID startpointTLI;
+			LWLockAcquire(ControlFileLock, LW_SHARED);
+			startseg = ControlFile->lastFlushedSeg;
+			startpointTLI = ControlFile->lastFlushedSegTLI;
+			LWLockRelease(ControlFileLock);
+			if (startpointTLI > 0)
+			{
+				elog(LOG, "found last flushed segment %lu on time line %d, starting WAL receiver",
+					 startseg, startpointTLI);
+				XLogSegNoOffsetToRecPtr(startseg, 0, wal_segment_size, startpoint);
+				RequestXLogStreaming(startpointTLI,
+									 startpoint,
+									 PrimaryConnInfo,
+									 PrimarySlotName);
+			}
+		}
+
 		/* Initialize resource managers */
 		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
 		{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a5e85d32f3..c862b65cae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
 #include "funcapi.h"
@@ -82,6 +83,8 @@ bool		hot_standby_feedback;
 static WalReceiverConn *wrconn = NULL;
 WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 
+static ControlFileData *ControlFile = NULL;
+
 #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
 
 /*
@@ -163,6 +166,45 @@ ProcessWalRcvInterrupts(void)
 }
 
 
+/*
+ * Persist startpoint to pg_control file.  This is used to start replication
+ * without waiting for startup process to let us know where to start streaming
+ * from.
+ */
+static void
+SaveStartPoint(XLogRecPtr startpoint, TimeLineID startpointTLI)
+{
+	XLogSegNo oldseg, startseg;
+	TimeLineID oldTLI;
+
+	XLByteToSeg(startpoint, startseg, wal_segment_size);
+
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+#ifdef USE_ASSERT_CHECKING
+	/*
+	 * On a given timeline, the WAL segment to start streaming from should
+	 * never move backwards.
+	 */
+	if (ControlFile->lastFlushedSegTLI == startpointTLI)
+		Assert(ControlFile->lastFlushedSeg <= startseg);
+#endif
+
+	oldseg = ControlFile->lastFlushedSeg;
+	oldTLI = ControlFile->lastFlushedSegTLI;
+	if (oldseg < startseg || oldTLI != startpointTLI)
+	{
+		ControlFile->lastFlushedSeg = startseg;
+		ControlFile->lastFlushedSegTLI = startpointTLI;
+		UpdateControlFile();
+		elog(DEBUG3,
+			 "lastFlushedSeg (seg, TLI) old: (%lu, %u), new: (%lu, %u)",
+			 oldseg, oldTLI, startseg, startpointTLI);
+	}
+
+	LWLockRelease(ControlFileLock);
+}
+
 /* Main entry point for walreceiver process */
 void
 WalReceiverMain(void)
@@ -304,6 +346,10 @@ WalReceiverMain(void)
 	if (sender_host)
 		pfree(sender_host);
 
+	bool found;
+	ControlFile = ShmemInitStruct("Control File", sizeof(ControlFileData), &found);
+	Assert(found);
+
 	first_stream = true;
 	for (;;)
 	{
@@ -1055,6 +1101,27 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
+			/*
+			 * When a WAL segment file is completely filled,
+			 * LogstreamResult.Flush points to the beginning of the new WAL
+			 * segment file that will be created shortly.  Before sending a
+			 * reply with a LSN from the new WAL segment for the first time,
+			 * remember the LSN in pg_control.  The LSN is used as the
+			 * startpoint to start streaming again if the WAL receiver process
+			 * exits and starts again.
+			 *
+			 * It is important to update the LSN's segment number in
+			 * pg_control before including it in a replay back to the WAL
+			 * sender.  Once WAL sender receives the flush LSN from standby
+			 * reply, any older WAL segments that do not contain the flush LSN
+			 * may be cleaned up.  If the WAL receiver dies after sending a
+			 * reply but before updating pg_control, it is possible that the
+			 * starting segment saved in pg_control is no longer available on
+			 * master when it attempts to resume streaming.
+			 */
+			if (XLogSegmentOffset(LogstreamResult.Flush, wal_segment_size) == 0)
+				SaveStartPoint(LogstreamResult.Flush, ThisTimeLineID);
+
 			XLogWalRcvSendReply(false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..955b8fcf83 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -239,10 +239,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	SpinLockAcquire(&walrcv->mutex);
 
-	/* It better be stopped if we try to restart it */
-	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
-		   walrcv->walRcvState == WALRCV_WAITING);
-
 	if (conninfo != NULL)
 		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
 	else
@@ -253,12 +249,26 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	else
 		walrcv->slotname[0] = '\0';
 
+	/*
+	 * We used to assert that the WAL receiver is either in WALRCV_STOPPED or
+	 * in WALRCV_WAITING state.
+	 *
+	 * Such an assertion is not possible, now that this function is called by
+	 * startup process on two occasions.  One is just before starting to
+	 * replay WAL when starting up.  And the other is when it has finished
+	 * replaying all WAL in pg_xlog directory.  If the standby is starting up
+	 * after clean shutdown, there is not much WAL to be replayed and both
+	 * calls to this funcion can occur in quick succession.  By the time the
+	 * second request to start streaming is made, the WAL receiver can be in
+	 * any state.  We therefore cannot make any assertion on the state here.
+	 */
+
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
 		walrcv->walRcvState = WALRCV_STARTING;
 	}
-	else
+	else if (walrcv->walRcvState == WALRCV_WAITING)
 		walrcv->walRcvState = WALRCV_RESTARTING;
 	walrcv->startTime = now;
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 19e21ab491..f98f36ffe5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -234,6 +234,10 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified:             %s\n"),
 		   pgctime_str);
+	printf(_("Latest flushed WAL segment number:    %lu\n"),
+		   ControlFile->lastFlushedSeg);
+	printf(_("Latest flushed TimeLineID:            %u\n"),
+		   ControlFile->lastFlushedSegTLI);
 	printf(_("Latest checkpoint location:           %X/%X\n"),
 		   (uint32) (ControlFile->checkPoint >> 32),
 		   (uint32) ControlFile->checkPoint);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..27260bbea5 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -143,6 +143,11 @@ typedef struct ControlFileData
 	 * to disk, we mustn't start up until we reach X again. Zero when not
 	 * doing archive recovery.
 	 *
+	 * lastFlushedSeg is the WAL segment number of the most recently flushed
+	 * WAL file by walreceiver.  It is updated by walreceiver when a received
+	 * WAL record falls on a new WAL segment file.  This is used as the start
+	 * point to resume WAL streaming if it is stopped.
+	 *
 	 * backupStartPoint is the redo pointer of the backup start checkpoint, if
 	 * we are recovering from an online backup and haven't reached the end of
 	 * backup yet. It is reset to zero when the end of backup is reached, and
@@ -165,6 +170,8 @@ typedef struct ControlFileData
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
+	XLogSegNo	lastFlushedSeg;
+	TimeLineID	lastFlushedSegTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
-- 
2.14.3 (Apple Git-98)

Asim Praveen

pasim@vmware.com

over 5 years ago

In reply to: Asim R P (#3)

1 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

I would like to revive this thready by submitting a rebased patch to start streaming replication without waiting for startup process to finish replaying all WAL. The start LSN for streaming is determined to be the LSN that points to the beginning of the most recently flushed WAL segment.

The patch passes tests under src/test/recovery and top level “make check”.

Attachments:

v2-0001-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v2-0001-Start-WAL-receiver-before-startup-process-replays.patchDownload

From df131729f4c7e86f5b441d0861d7195515939855 Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Sun, 9 Aug 2020 10:53:58 +0530
Subject: [PATCH v2] Start WAL receiver before startup process replays existing
 WAL

If WAL receiver is started only after startup process finishes replaying
WAL already available in pg_wal, synchornous replication is impacted
adversly.  Consider a temporary network outage causing streaming
replication connection to break.  This leads to exit of WAL receiver
process.  If the startup process has fallen behind, it may take a long
time to finish replaying WAL and then start walreceiver again to
re-establish streaming replication.  Commits on master will have to wait
all this while for the standby to flush WAL upto commit LSN.

This experience can be alleviated if replication connection is
re-established as soon as it is found to be disconnected.  The patch
attempts to do so by starting WAL receiver as soon as consistent state
is reached.

The start point to request streaming from is set to the beginning of the
most recently flushed WAL segment.  To determine this, the startup
process scans first page of segments, stating from the segment currently
being read, one file at a time.

A new GUC, wal_receiver_start_condition, controls the new behavior.
When set to 'consistency', the new behavior takes effect.  The default
value is 'replay', which keeps the current behavior.

A TAP test is added to demonstrate the problem and validate the fix.

Discussion:
https://www.postgresql.org/message-id/CANXE4TewY1WNgu5J5ek38RD%2B2m9F2K%3DfgbWubjv9yG0BeyFxRQ%40mail.gmail.com
https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/access/transam/xlog.c             | 119 ++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/replication/walreceiverfuncs.c    |  20 +-
 src/backend/utils/misc/guc.c                  |  17 ++
 src/include/replication/walreceiver.h         |   7 +
 src/test/recovery/t/018_replay_lag_syncrep.pl | 192 ++++++++++++++++++
 6 files changed, 345 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/018_replay_lag_syncrep.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 756b838e6a5..6192f5be347 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5642,6 +5642,98 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+static int
+XLogReadFirstPage(XLogRecPtr targetPagePtr, char *readBuf)
+{
+	int fd;
+	XLogSegNo segno;
+	char xlogfname[MAXFNAMELEN];
+
+	XLByteToSeg(targetPagePtr, segno, wal_segment_size);
+	elog(DEBUG3, "reading first page of segment %lu", segno);
+	fd = XLogFileReadAnyTLI(segno, LOG, XLOG_FROM_PG_WAL);
+	if (fd == -1)
+		return -1;
+
+	/* Seek to the beginning, we want to check if the first page is valid */
+	if (lseek(fd, (off_t) 0, SEEK_SET) < 0)
+	{
+		XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);
+		close(fd);
+		elog(ERROR, "could not seek XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	if (read(fd, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	{
+		close(fd);
+		elog(ERROR, "could not read from XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	close(fd);
+	return XLOG_BLCKSZ;
+}
+
+/*
+ * Find the LSN that points to the beginning of the segment file most recently
+ * flushed by WAL receiver.  It will be used as start point by new instance of
+ * WAL receiver.
+ *
+ * The XLogReaderState abstraction is not suited for this purpose.  The
+ * interface it offers is XLogReadRecord, which is not suited to read a
+ * specific page from WAL.
+ */
+static XLogRecPtr
+GetLastLSN(XLogRecPtr lsn)
+{
+	XLogSegNo lastValidSegNo;
+	char readBuf[XLOG_BLCKSZ];
+
+	XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+	/*
+	 * We know that lsn falls in a valid segment.  Start searching from the
+	 * next segment.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+
+	elog(LOG, "scanning WAL for last valid segment, starting from %X/%X",
+		 (uint32) (lsn >> 32), (uint32) lsn);
+
+	while (XLogReadFirstPage(lsn, readBuf) == XLOG_BLCKSZ)
+	{
+		/*
+		 * Validate page header, it must be a long header because we are
+		 * inspecting the first page in a segment file.  The big if condition
+		 * is modelled according to XLogReaderValidatePageHeader.
+		 */
+		XLogLongPageHeader longhdr = (XLogLongPageHeader) readBuf;
+		if ((longhdr->std.xlp_info & XLP_LONG_HEADER) == 0 ||
+			(longhdr->std.xlp_magic != XLOG_PAGE_MAGIC) ||
+			((longhdr->std.xlp_info & ~XLP_ALL_FLAGS) != 0) ||
+			(longhdr->xlp_sysid != ControlFile->system_identifier) ||
+			(longhdr->xlp_seg_size != wal_segment_size) ||
+			(longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ) ||
+			(longhdr->std.xlp_pageaddr != lsn) ||
+			(longhdr->std.xlp_tli != ThisTimeLineID))
+		{
+			break;
+		}
+		XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+		XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+	}
+
+	/*
+	 * The last valid segment number is previous to the one that was just
+	 * found to be invalid.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo, 0, wal_segment_size, lsn);
+
+	elog(LOG, "last valid segment number = %lu", lastValidSegNo);
+
+	return lsn;
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7205,6 +7297,27 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * Start WAL receiver without waiting for startup process to
+				 * finish replay, so that streaming replication is established
+				 * at the earliest.  When the replication is configured to be
+				 * synchronous this would unblock commits waiting for WAL to
+				 * be written and/or flushed by synchronous standby.
+				 */
+				if (StandbyModeRequested &&
+					reachedConsistency &&
+					wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY &&
+					!WalRcvStreaming())
+				{
+					XLogRecPtr startpoint = GetLastLSN(record->xl_prev);
+					elog(LOG, "starting WAL receiver, startpoint %X/%X",
+						 (uint32) (startpoint >> 32), (uint32) startpoint);
+					RequestXLogStreaming(ThisTimeLineID,
+										 startpoint,
+										 PrimaryConnInfo,
+										 PrimarySlotName);
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -12259,12 +12372,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d5a9b568a68..a1a144d7fd6 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -88,6 +88,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int 		wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index e6757573010..904327d8302 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -245,10 +245,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	SpinLockAcquire(&walrcv->mutex);
 
-	/* It better be stopped if we try to restart it */
-	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
-		   walrcv->walRcvState == WALRCV_WAITING);
-
 	if (conninfo != NULL)
 		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
 	else
@@ -271,12 +267,26 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 		walrcv->is_temp_slot = create_temp_slot;
 	}
 
+	/*
+	 * We used to assert that the WAL receiver is either in WALRCV_STOPPED or
+	 * in WALRCV_WAITING state.
+	 *
+	 * Such an assertion is not possible, now that this function is called by
+	 * startup process on two occasions.  One is just before starting to
+	 * replay WAL when starting up.  And the other is when it has finished
+	 * replaying all WAL in pg_xlog directory.  If the standby is starting up
+	 * after clean shutdown, there is not much WAL to be replayed and both
+	 * calls to this funcion can occur in quick succession.  By the time the
+	 * second request to start streaming is made, the WAL receiver can be in
+	 * any state.  We therefore cannot make any assertion on the state here.
+	 */
+
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
 		walrcv->walRcvState = WALRCV_STARTING;
 	}
-	else
+	else if (walrcv->walRcvState == WALRCV_WAITING)
 		walrcv->walRcvState = WALRCV_RESTARTING;
 	walrcv->startTime = now;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef70..e7ce9a4e87a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -237,6 +237,12 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context,
  * NOTE! Option values may not contain double quotes!
  */
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"catchup", WAL_RCV_START_AT_CATCHUP, true},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, true},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry bytea_output_options[] = {
 	{"escape", BYTEA_OUTPUT_ESCAPE, false},
 	{"hex", BYTEA_OUTPUT_HEX, false},
@@ -4784,6 +4790,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_CATCHUP,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbee549..db5aeed74ce 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -23,10 +23,17 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_CATCHUP, /* start a WAL receiver  after replaying all WAL files */
+	WAL_RCV_START_AT_CONSISTENCY /* start a WAL receiver once consistency has been reached */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int  wal_receiver_start_condition;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
new file mode 100644
index 00000000000..e82d8a0a64b
--- /dev/null
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -0,0 +1,192 @@
+# Test impact of replay lag on synchronous replication.
+#
+# Replay lag is induced using recovery_min_apply_delay GUC.  Two ways
+# of breaking replication connection are covered - killing walsender
+# and restarting standby.  The test expects that replication
+# connection is restored without being affected due to replay lag.
+# This is validated by performing commits on master after replication
+# connection is disconnected and checking that they finish within a
+# few seconds.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Query checking sync_priority and sync_state of each standby
+my $check_sql =
+  "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Check that sync_state of a standby is expected (waiting till it is).
+# If $setting is given, synchronous_standby_names is set to it and
+# the configuration file is reloaded before the test.
+sub test_sync_state
+{
+	my ($self, $expected, $msg, $setting) = @_;
+
+	if (defined($setting))
+	{
+		$self->safe_psql('postgres',
+						 "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
+		$self->reload;
+	}
+
+	ok($self->poll_query_until('postgres', $check_sql, $expected), $msg);
+	return;
+}
+
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print("### Waiting for standby \"$standby_name\" on \"$master_name\"\n");
+	$master->poll_query_until('postgres', $query);
+	return;
+}
+
+# Initialize master node
+my $node_master = get_new_node('master');
+my @extra = (q[--wal-segsize], q[1]);
+$node_master->init(allows_streaming => 1, extra => \@extra);
+$node_master->start;
+my $backup_name = 'master_backup';
+
+# Setup physical replication slot for streaming replication
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot', true, false);]);
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby linking to master
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name,
+								has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+						   q[primary_slot_name = 'phys_slot']);
+# Enable debug logging in standby
+$node_standby->append_conf('postgresql.conf',
+						   q[log_min_messages = debug5]);
+# Enable early WAL receiver startup
+$node_standby->append_conf('postgresql.conf',
+						   q[wal_receiver_start_condition = 'consistency']);
+
+start_standby_and_wait($node_master, $node_standby);
+
+# Make standby synchronous
+test_sync_state(
+	$node_master,
+	qq(standby|1|sync),
+	'standby is synchronous',
+	'standby');
+
+# Slow down WAL replay by inducing 10 seconds sleep before replaying
+# a commit WAL record.
+$node_standby->safe_psql('postgres',
+						 'ALTER SYSTEM set recovery_min_apply_delay TO 10000;');
+$node_standby->reload;
+
+# Commit some transactions on master to induce replay lag in standby.
+$node_master->safe_psql('postgres', 'CREATE TABLE replay_lag_test(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (101);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (102);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (103);');
+
+# Obtain WAL sender PID and kill it.
+my $walsender_pid = $node_master->safe_psql(
+	'postgres',
+	q[select active_pid from pg_get_replication_slots() where slot_name = 'phys_slot']);
+
+# Kill walsender, so that the replication connection breaks.
+kill 'SIGTERM', $walsender_pid;
+
+# The replication connection should be re-establised much earlier than
+# what it takes to finish replay.  Try to commit a transaction with a
+# timeout of recovery_min_apply_delay + 2 seconds.  The timeout should
+# not be hit.
+my $timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (1);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after WAL receiver restart');
+
+my $replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after WAL receiver restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication resumes in spite of replay lag');
+
+# Break the replication connection by restarting standby.
+$node_standby->restart;
+
+# Like in previous test, the replication connection should be
+# re-establised before pending WAL replay is finished.  Try to commit
+# a transaction with recovery_min_apply_delay + 2 second timeout.  The
+# timeout should not be hit.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (2);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after standby restart');
+$replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after standby restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication starts in spite of replay lag');
+
+# Reset the delay so that the replay process is no longer slowed down.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set recovery_min_apply_delay to 0;');
+$node_standby->reload;
+
+# Switch to a new WAL file and see if things work well.
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Transactions should work fine on master.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (3);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Standby should also have identical content.
+my $count_sql = q[select count(*) from replay_lag_test;];
+my $expected = q[6];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected), 'standby query');
+
+# Test that promotion followed by query works.
+$node_standby->promote;
+$node_master->stop;
+$node_standby->safe_psql('postgres', 'insert into replay_lag_test values (4);');
+
+$expected = q[7];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected),
+   'standby query after promotion');
-- 
2.24.2 (Apple Git-127)

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Asim Praveen (#4)

Re: Unnecessary delay in streaming replication due to replay lag

On Sun, Aug 09, 2020 at 05:54:32AM +0000, Asim Praveen wrote:

I would like to revive this thready by submitting a rebased patch to
start streaming replication without waiting for startup process to
finish replaying all WAL. The start LSN for streaming is determined
to be the LSN that points to the beginning of the most recently
flushed WAL segment.

The patch passes tests under src/test/recovery and top level “make check”.

I have not really looked at the proposed patch, but it would be good
to have some documentation.
--
Michael

Asim Praveen

pasim@vmware.com

over 5 years ago

In reply to: Michael Paquier (#5)

Re: Unnecessary delay in streaming replication due to replay lag

On 09-Aug-2020, at 2:11 PM, Michael Paquier <michael@paquier.xyz> wrote:

I have not really looked at the proposed patch, but it would be good
to have some documentation.

Ah, right. The basic idea is to reuse the logic to allow read-only connections to also start WAL streaming. The patch borrows a new GUC “wal_receiver_start_condition” introduced by another patch alluded to upthread. It affects when to start WAL receiver process on a standby. By default, the GUC is set to “replay”, which means no change in current behavior - WAL receiver is started only after replaying all WAL already available in pg_wal. When set to “consistency”, WAL receiver process is started earlier, as soon as consistent state is reached during WAL replay.

The LSN where to start streaming from is determined to be the LSN that points at the beginning of the WAL segment file that was most recently flushed in pg_wal. To find the most recently flushed WAL segment, first blocks of all WAL segment files in pg_wal, starting from the segment that contains currently replayed record, are inspected. The search stops when a first page with no valid header is found.

The benefits of starting WAL receiver early are mentioned upthread but allow me to reiterate: as WAL streaming starts, any commits that are waiting for synchronous replication on the master are unblocked. The benefit of this is apparent in situations where significant replay lag has been built up and the replication is configured to be synchronous.

Asim

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Asim Praveen (#4)

Re: Unnecessary delay in streaming replication due to replay lag

On Sun, 9 Aug 2020 at 14:54, Asim Praveen <pasim@vmware.com> wrote:

I would like to revive this thready by submitting a rebased patch to start streaming replication without waiting for startup process to finish replaying all WAL. The start LSN for streaming is determined to be the LSN that points to the beginning of the most recently flushed WAL segment.

The patch passes tests under src/test/recovery and top level “make check”.

The patch can be applied cleanly to the current HEAD but I got the
error on building the code with this patch:

xlog.c: In function 'StartupXLOG':
xlog.c:7315:6: error: too few arguments to function 'RequestXLogStreaming'
7315 | RequestXLogStreaming(ThisTimeLineID,
| ^~~~~~~~~~~~~~~~~~~~
In file included from xlog.c:59:
../../../../src/include/replication/walreceiver.h:463:13: note: declared here
463 | extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
| ^~~~~~~~~~~~~~~~~~~~

cfbot also complaints this.

Could you please update the patch?

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Asim Praveen

pasim@vmware.com

over 5 years ago

In reply to: Masahiko Sawada (#7)

1 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

On 10-Aug-2020, at 12:27 PM, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

The patch can be applied cleanly to the current HEAD but I got the
error on building the code with this patch:

xlog.c: In function 'StartupXLOG':
xlog.c:7315:6: error: too few arguments to function 'RequestXLogStreaming'
7315 | RequestXLogStreaming(ThisTimeLineID,
| ^~~~~~~~~~~~~~~~~~~~
In file included from xlog.c:59:
../../../../src/include/replication/walreceiver.h:463:13: note: declared here
463 | extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
| ^~~~~~~~~~~~~~~~~~~~

cfbot also complaints this.

Could you please update the patch?

Thank you for trying the patch and apologies for the compiler error. I missed adding a hunk earlier, it should be fixed in the version attached here.

Attachments:

v3-0001-Start-WAL-receiver-before-startup-process-replays.patchapplication/octet-stream; name=v3-0001-Start-WAL-receiver-before-startup-process-replays.patchDownload

From ff8ff45693d4b938c6ba49342f92185be4427b91 Mon Sep 17 00:00:00 2001
From: Wu Hao <hawu@pivotal.io>
Date: Mon, 10 Aug 2020 14:19:54 +0530
Subject: [PATCH v3] Start WAL receiver before startup process replays existing
 WAL

If WAL receiver is started only after startup process finishes replaying
WAL already available in pg_wal, synchornous replication is impacted
adversly.  Consider a temporary network outage causing streaming
replication connection to break.  This leads to exit of WAL receiver
process.  If the startup process has fallen behind, it may take a long
time to finish replaying WAL and then start walreceiver again to
re-establish streaming replication.  Commits on master will have to wait
all this while for the standby to flush WAL upto commit LSN.

This experience can be alleviated if replication connection is
re-established as soon as it is found to be disconnected.  The patch
attempts to do so by starting WAL receiver as soon as consistent state
is reached.

The start point to request streaming from is set to the beginning of the
most recently flushed WAL segment.  To determine this, the startup
process scans first page of segments, stating from the segment currently
being read, one file at a time.

A new GUC, wal_receiver_start_condition, controls the new behavior.
When set to 'consistency', the new behavior takes effect.  The default
value is 'replay', which keeps the current behavior.

A TAP test is added to demonstrate the problem and validate the fix.

Discussion:
https://www.postgresql.org/message-id/CANXE4TewY1WNgu5J5ek38RD%2B2m9F2K%3DfgbWubjv9yG0BeyFxRQ%40mail.gmail.com
https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru

Co-authored-by: Asim R P <apraveen@pivotal.io>
---
 src/backend/access/transam/xlog.c             | 120 ++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/replication/walreceiverfuncs.c    |  20 +-
 src/backend/utils/misc/guc.c                  |  17 ++
 src/include/replication/walreceiver.h         |   7 +
 src/test/recovery/t/018_replay_lag_syncrep.pl | 192 ++++++++++++++++++
 6 files changed, 346 insertions(+), 11 deletions(-)
 create mode 100644 src/test/recovery/t/018_replay_lag_syncrep.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 756b838e6a5..a6f5d595ea8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5642,6 +5642,98 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 			(errmsg("archive recovery complete")));
 }
 
+static int
+XLogReadFirstPage(XLogRecPtr targetPagePtr, char *readBuf)
+{
+	int fd;
+	XLogSegNo segno;
+	char xlogfname[MAXFNAMELEN];
+
+	XLByteToSeg(targetPagePtr, segno, wal_segment_size);
+	elog(DEBUG3, "reading first page of segment %lu", segno);
+	fd = XLogFileReadAnyTLI(segno, LOG, XLOG_FROM_PG_WAL);
+	if (fd == -1)
+		return -1;
+
+	/* Seek to the beginning, we want to check if the first page is valid */
+	if (lseek(fd, (off_t) 0, SEEK_SET) < 0)
+	{
+		XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);
+		close(fd);
+		elog(ERROR, "could not seek XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	if (read(fd, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	{
+		close(fd);
+		elog(ERROR, "could not read from XLOG file %s, segment %lu: %m",
+			 xlogfname, segno);
+	}
+
+	close(fd);
+	return XLOG_BLCKSZ;
+}
+
+/*
+ * Find the LSN that points to the beginning of the segment file most recently
+ * flushed by WAL receiver.  It will be used as start point by new instance of
+ * WAL receiver.
+ *
+ * The XLogReaderState abstraction is not suited for this purpose.  The
+ * interface it offers is XLogReadRecord, which is not suited to read a
+ * specific page from WAL.
+ */
+static XLogRecPtr
+GetLastLSN(XLogRecPtr lsn)
+{
+	XLogSegNo lastValidSegNo;
+	char readBuf[XLOG_BLCKSZ];
+
+	XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+	/*
+	 * We know that lsn falls in a valid segment.  Start searching from the
+	 * next segment.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+
+	elog(LOG, "scanning WAL for last valid segment, starting from %X/%X",
+		 (uint32) (lsn >> 32), (uint32) lsn);
+
+	while (XLogReadFirstPage(lsn, readBuf) == XLOG_BLCKSZ)
+	{
+		/*
+		 * Validate page header, it must be a long header because we are
+		 * inspecting the first page in a segment file.  The big if condition
+		 * is modelled according to XLogReaderValidatePageHeader.
+		 */
+		XLogLongPageHeader longhdr = (XLogLongPageHeader) readBuf;
+		if ((longhdr->std.xlp_info & XLP_LONG_HEADER) == 0 ||
+			(longhdr->std.xlp_magic != XLOG_PAGE_MAGIC) ||
+			((longhdr->std.xlp_info & ~XLP_ALL_FLAGS) != 0) ||
+			(longhdr->xlp_sysid != ControlFile->system_identifier) ||
+			(longhdr->xlp_seg_size != wal_segment_size) ||
+			(longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ) ||
+			(longhdr->std.xlp_pageaddr != lsn) ||
+			(longhdr->std.xlp_tli != ThisTimeLineID))
+		{
+			break;
+		}
+		XLByteToSeg(lsn, lastValidSegNo, wal_segment_size);
+		XLogSegNoOffsetToRecPtr(lastValidSegNo+1, 0, wal_segment_size, lsn);
+	}
+
+	/*
+	 * The last valid segment number is previous to the one that was just
+	 * found to be invalid.
+	 */
+	XLogSegNoOffsetToRecPtr(lastValidSegNo, 0, wal_segment_size, lsn);
+
+	elog(LOG, "last valid segment number = %lu", lastValidSegNo);
+
+	return lsn;
+}
+
 /*
  * Extract timestamp from WAL record.
  *
@@ -7205,6 +7297,28 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * Start WAL receiver without waiting for startup process to
+				 * finish replay, so that streaming replication is established
+				 * at the earliest.  When the replication is configured to be
+				 * synchronous this would unblock commits waiting for WAL to
+				 * be written and/or flushed by synchronous standby.
+				 */
+				if (StandbyModeRequested &&
+					reachedConsistency &&
+					wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY &&
+					!WalRcvStreaming())
+				{
+					XLogRecPtr startpoint = GetLastLSN(record->xl_prev);
+					elog(LOG, "starting WAL receiver, startpoint %X/%X",
+						 (uint32) (startpoint >> 32), (uint32) startpoint);
+					RequestXLogStreaming(ThisTimeLineID,
+										 startpoint,
+										 PrimaryConnInfo,
+										 PrimarySlotName,
+										 wal_receiver_create_temp_slot);
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -12259,12 +12373,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d5a9b568a68..a1a144d7fd6 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -88,6 +88,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int 		wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index e6757573010..904327d8302 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -245,10 +245,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	SpinLockAcquire(&walrcv->mutex);
 
-	/* It better be stopped if we try to restart it */
-	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
-		   walrcv->walRcvState == WALRCV_WAITING);
-
 	if (conninfo != NULL)
 		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
 	else
@@ -271,12 +267,26 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 		walrcv->is_temp_slot = create_temp_slot;
 	}
 
+	/*
+	 * We used to assert that the WAL receiver is either in WALRCV_STOPPED or
+	 * in WALRCV_WAITING state.
+	 *
+	 * Such an assertion is not possible, now that this function is called by
+	 * startup process on two occasions.  One is just before starting to
+	 * replay WAL when starting up.  And the other is when it has finished
+	 * replaying all WAL in pg_xlog directory.  If the standby is starting up
+	 * after clean shutdown, there is not much WAL to be replayed and both
+	 * calls to this funcion can occur in quick succession.  By the time the
+	 * second request to start streaming is made, the WAL receiver can be in
+	 * any state.  We therefore cannot make any assertion on the state here.
+	 */
+
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
 		walrcv->walRcvState = WALRCV_STARTING;
 	}
-	else
+	else if (walrcv->walRcvState == WALRCV_WAITING)
 		walrcv->walRcvState = WALRCV_RESTARTING;
 	walrcv->startTime = now;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef70..e7ce9a4e87a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -237,6 +237,12 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context,
  * NOTE! Option values may not contain double quotes!
  */
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"catchup", WAL_RCV_START_AT_CATCHUP, true},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, true},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry bytea_output_options[] = {
 	{"escape", BYTEA_OUTPUT_ESCAPE, false},
 	{"hex", BYTEA_OUTPUT_HEX, false},
@@ -4784,6 +4790,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_CATCHUP,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c2d5dbee549..db5aeed74ce 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -23,10 +23,17 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_CATCHUP, /* start a WAL receiver  after replaying all WAL files */
+	WAL_RCV_START_AT_CONSISTENCY /* start a WAL receiver once consistency has been reached */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int  wal_receiver_start_condition;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
diff --git a/src/test/recovery/t/018_replay_lag_syncrep.pl b/src/test/recovery/t/018_replay_lag_syncrep.pl
new file mode 100644
index 00000000000..e82d8a0a64b
--- /dev/null
+++ b/src/test/recovery/t/018_replay_lag_syncrep.pl
@@ -0,0 +1,192 @@
+# Test impact of replay lag on synchronous replication.
+#
+# Replay lag is induced using recovery_min_apply_delay GUC.  Two ways
+# of breaking replication connection are covered - killing walsender
+# and restarting standby.  The test expects that replication
+# connection is restored without being affected due to replay lag.
+# This is validated by performing commits on master after replication
+# connection is disconnected and checking that they finish within a
+# few seconds.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Query checking sync_priority and sync_state of each standby
+my $check_sql =
+  "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Check that sync_state of a standby is expected (waiting till it is).
+# If $setting is given, synchronous_standby_names is set to it and
+# the configuration file is reloaded before the test.
+sub test_sync_state
+{
+	my ($self, $expected, $msg, $setting) = @_;
+
+	if (defined($setting))
+	{
+		$self->safe_psql('postgres',
+						 "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
+		$self->reload;
+	}
+
+	ok($self->poll_query_until('postgres', $check_sql, $expected), $msg);
+	return;
+}
+
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print("### Waiting for standby \"$standby_name\" on \"$master_name\"\n");
+	$master->poll_query_until('postgres', $query);
+	return;
+}
+
+# Initialize master node
+my $node_master = get_new_node('master');
+my @extra = (q[--wal-segsize], q[1]);
+$node_master->init(allows_streaming => 1, extra => \@extra);
+$node_master->start;
+my $backup_name = 'master_backup';
+
+# Setup physical replication slot for streaming replication
+$node_master->safe_psql('postgres',
+	q[SELECT pg_create_physical_replication_slot('phys_slot', true, false);]);
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby linking to master
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, $backup_name,
+								has_streaming => 1);
+$node_standby->append_conf('postgresql.conf',
+						   q[primary_slot_name = 'phys_slot']);
+# Enable debug logging in standby
+$node_standby->append_conf('postgresql.conf',
+						   q[log_min_messages = debug5]);
+# Enable early WAL receiver startup
+$node_standby->append_conf('postgresql.conf',
+						   q[wal_receiver_start_condition = 'consistency']);
+
+start_standby_and_wait($node_master, $node_standby);
+
+# Make standby synchronous
+test_sync_state(
+	$node_master,
+	qq(standby|1|sync),
+	'standby is synchronous',
+	'standby');
+
+# Slow down WAL replay by inducing 10 seconds sleep before replaying
+# a commit WAL record.
+$node_standby->safe_psql('postgres',
+						 'ALTER SYSTEM set recovery_min_apply_delay TO 10000;');
+$node_standby->reload;
+
+# Commit some transactions on master to induce replay lag in standby.
+$node_master->safe_psql('postgres', 'CREATE TABLE replay_lag_test(a int);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (101);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (102);');
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (103);');
+
+# Obtain WAL sender PID and kill it.
+my $walsender_pid = $node_master->safe_psql(
+	'postgres',
+	q[select active_pid from pg_get_replication_slots() where slot_name = 'phys_slot']);
+
+# Kill walsender, so that the replication connection breaks.
+kill 'SIGTERM', $walsender_pid;
+
+# The replication connection should be re-establised much earlier than
+# what it takes to finish replay.  Try to commit a transaction with a
+# timeout of recovery_min_apply_delay + 2 seconds.  The timeout should
+# not be hit.
+my $timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (1);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after WAL receiver restart');
+
+my $replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after WAL receiver restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication resumes in spite of replay lag');
+
+# Break the replication connection by restarting standby.
+$node_standby->restart;
+
+# Like in previous test, the replication connection should be
+# re-establised before pending WAL replay is finished.  Try to commit
+# a transaction with recovery_min_apply_delay + 2 second timeout.  The
+# timeout should not be hit.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (2);',
+	timeout => 12,
+	timed_out => \$timed_out);
+
+is($timed_out, 0, 'insert after standby restart');
+$replay_lag = $node_master->safe_psql(
+	'postgres',
+	'select flush_lsn - replay_lsn from pg_stat_replication');
+print("replay lag after standby restart: $replay_lag\n");
+ok($replay_lag > 0, 'replication starts in spite of replay lag');
+
+# Reset the delay so that the replay process is no longer slowed down.
+$node_standby->safe_psql('postgres', 'ALTER SYSTEM set recovery_min_apply_delay to 0;');
+$node_standby->reload;
+
+# Switch to a new WAL file and see if things work well.
+$node_master->safe_psql(
+	'postgres',
+	'select pg_switch_wal();');
+
+# Transactions should work fine on master.
+$timed_out = 0;
+$node_master->safe_psql(
+	'postgres',
+	'insert into replay_lag_test values (3);',
+	timeout => 1,
+	timed_out => \$timed_out);
+
+# Wait for standby to replay all WAL.
+$node_master->wait_for_catchup('standby', 'replay',
+							   $node_master->lsn('insert'));
+
+# Standby should also have identical content.
+my $count_sql = q[select count(*) from replay_lag_test;];
+my $expected = q[6];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected), 'standby query');
+
+# Test that promotion followed by query works.
+$node_standby->promote;
+$node_master->stop;
+$node_standby->safe_psql('postgres', 'insert into replay_lag_test values (4);');
+
+$expected = q[7];
+ok($node_standby->poll_query_until('postgres', $count_sql, $expected),
+   'standby query after promotion');
-- 
2.24.2 (Apple Git-127)

lchch1990@sina.cn

over 5 years ago

In reply to: Asim R P (#1)

Re: Unnecessary delay in streaming replication due to replay lag

Hello

I read the code and test the patch, it run well on my side, and I have several issues on the
patch.

1. When call RequestXLogStreaming() during replay, you pick timeline straightly from control
file, do you think it should pick timeline from timeline history file?

2. In archive recovery mode which will never turn to a stream mode, I think in current code it
will call RequestXLogStreaming() too which can avoid.

3. I found two 018_xxxxx.pl when I do make check, maybe rename the new one?

Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

#10

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: lchch1990@sina.cn (#9)

Re: Unnecessary delay in streaming replication due to replay lag

On Tue, Sep 15, 2020 at 05:30:22PM +0800, lchch1990@sina.cn wrote:

I read the code and test the patch, it run well on my side, and I have several issues on the
patch.

+                   RequestXLogStreaming(ThisTimeLineID,
+                                        startpoint,
+                                        PrimaryConnInfo,
+                                        PrimarySlotName,
+                                        wal_receiver_create_temp_slot);

This patch thinks that it is fine to request streaming even if
PrimaryConnInfo is not set, but that's not fine.

Anyway, I don't quite understand what you are trying to achieve here.
"startpoint" is used to request the beginning of streaming. It is
roughly the consistency LSN + some alpha with some checks on WAL
pages (those WAL page checks are not acceptable as they make
maintenance harder). What about the case where consistency is
reached but there are many segments still ahead that need to be
replayed? Your patch would cause streaming to begin too early, and
a manual copy of segments is not a rare thing as in some environments
a bulk copy of segments can make the catchup of a standby faster than
streaming.

It seems to me that what you are looking for here is some kind of
pre-processing before entering the redo loop to determine the LSN
that could be reused for the fast streaming start, which should match
the end of the WAL present locally. In short, you would need a
XLogReaderState that begins a scan of WAL from the redo point until it
cannot find anything more, and use the last LSN found as a base to
begin requesting streaming. The question of timeline jumps can also
be very tricky, but it could also be possible to not allow this option
if a timeline jump happens while attempting to guess the end of WAL
ahead of time. Another thing: could it be useful to have an extra
mode to begin streaming without waiting for consistency to finish?
--
Michael

#11

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

about 5 years ago

In reply to: Michael Paquier (#10)

Re: Unnecessary delay in streaming replication due to replay lag

On 20.11.2020 11:21, Michael Paquier wrote:

On Tue, Sep 15, 2020 at 05:30:22PM +0800, lchch1990@sina.cn wrote:

I read the code and test the patch, it run well on my side, and I have several issues on the
patch.
+                   RequestXLogStreaming(ThisTimeLineID,
+                                        startpoint,
+                                        PrimaryConnInfo,
+                                        PrimarySlotName,
+                                        wal_receiver_create_temp_slot);
This patch thinks that it is fine to request streaming even if
PrimaryConnInfo is not set, but that's not fine.

Anyway, I don't quite understand what you are trying to achieve here.
"startpoint" is used to request the beginning of streaming. It is
roughly the consistency LSN + some alpha with some checks on WAL
pages (those WAL page checks are not acceptable as they make
maintenance harder). What about the case where consistency is
reached but there are many segments still ahead that need to be
replayed? Your patch would cause streaming to begin too early, and
a manual copy of segments is not a rare thing as in some environments
a bulk copy of segments can make the catchup of a standby faster than
streaming.

It seems to me that what you are looking for here is some kind of
pre-processing before entering the redo loop to determine the LSN
that could be reused for the fast streaming start, which should match
the end of the WAL present locally. In short, you would need a
XLogReaderState that begins a scan of WAL from the redo point until it
cannot find anything more, and use the last LSN found as a base to
begin requesting streaming. The question of timeline jumps can also
be very tricky, but it could also be possible to not allow this option
if a timeline jump happens while attempting to guess the end of WAL
ahead of time. Another thing: could it be useful to have an extra
mode to begin streaming without waiting for consistency to finish?
--
Michael

Status update for a commitfest entry.

This entry was "Waiting On Author" during this CF, so I've marked it as
returned with feedback. Feel free to resubmit an updated version to a
future commitfest.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#12

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

over 4 years ago

In reply to: Anastasia Lubennikova (#11)

1 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Hello,

Ashwin and I recently got a chance to work on this and we addressed all
outstanding feedback and suggestions. PFA a significantly reworked patch.

On 20.11.2020 11:21, Michael Paquier wrote:

This patch thinks that it is fine to request streaming even if
PrimaryConnInfo is not set, but that's not fine.

We introduced a check to ensure that PrimaryConnInfo is set up before we
request the WAL stream eagerly.

Anyway, I don't quite understand what you are trying to achieve here.
"startpoint" is used to request the beginning of streaming. It is
roughly the consistency LSN + some alpha with some checks on WAL
pages (those WAL page checks are not acceptable as they make
maintenance harder). What about the case where consistency is
reached but there are many segments still ahead that need to be
replayed? Your patch would cause streaming to begin too early, and
a manual copy of segments is not a rare thing as in some environments
a bulk copy of segments can make the catchup of a standby faster than
streaming.

It seems to me that what you are looking for here is some kind of
pre-processing before entering the redo loop to determine the LSN
that could be reused for the fast streaming start, which should match
the end of the WAL present locally. In short, you would need a
XLogReaderState that begins a scan of WAL from the redo point until it
cannot find anything more, and use the last LSN found as a base to
begin requesting streaming. The question of timeline jumps can also
be very tricky, but it could also be possible to not allow this option
if a timeline jump happens while attempting to guess the end of WAL
ahead of time. Another thing: could it be useful to have an extra
mode to begin streaming without waiting for consistency to finish?

1. When wal_receiver_start_condition='consistency', we feel that the
stream start point calculation should be done only when we reach
consistency. Imagine the situation where consistency is reached 2 hours
after start, and within that 2 hours a lot of WAL has been manually
copied over into the standby's pg_wal. If we pre-calculated the stream
start location before we entered the main redo apply loop, we would be
starting the stream from a much earlier location (minus the 2 hours
worth of WAL), leading to wasted work.

2. We have significantly changed the code to calculate the WAL stream
start location. We now traverse pg_wal, find the latest valid WAL
segment and start the stream from the segment's start. This is much
more performant than reading from the beginning of the locally available
WAL.

3. To perform the validation check, we no longer have duplicate code -
as we can now rely on the XLogReaderState(), XLogReaderValidatePageHeader()
and friends.

4. We have an extra mode: wal_receiver_start_condition='startup', which
will start the WAL receiver before the startup process reaches
consistency. We don't fully understand the utility of having 'startup' over
'consistency' though.

5. During the traversal of pg_wal, if we find WAL segments on differing
timelines, we bail out and abandon attempting to start the WAL stream
eagerly.

6. To handle the cases where a lot of WAL is copied over after the
WAL receiver has started at consistency:
i) Don't recommend wal_receiver_start_condition='startup|consistency'.

ii) Copy over the WAL files and then start the standby, so that the WAL
stream starts from a fresher point.

iii) Have an LSN/segment# target to start the WAL receiver from?

7. We have significantly changed the test. It is much more simplified
and focused.

8. We did not test wal_receiver_start_condition='startup' in the test.
It's actually hard to assert that the walreceiver has started at
startup. recovery_min_apply_delay only kicks in once we reach
consistency, and thus there is no way I could think of to reliably halt
the startup process and check: "Has the wal receiver started even
though the standby hasn't reached consistency?" Only way we could think
of is to generate a large workload during the course of the backup so
that the standby has significant WAL to replay before it reaches
consistency. But that will make the test flaky as we will have no
absolutely precise wait condition. That said, we felt that checking
for 'consistency' is enough as it covers the majority of the added
code.

9. We added a documentation section describing the GUC.

Regards,
Ashwin and Soumyadeep (VMware)

Attachments:

v1-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From b3703fb16a352bd9166ed75de7b68599c735ac63 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Date: Fri, 30 Jul 2021 18:21:55 -0700
Subject: [PATCH v1 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Co-authors: Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                     |  33 +++++
 src/backend/access/transam/xlog.c            | 141 +++++++++++++++++--
 src/backend/replication/walreceiver.c        |   1 +
 src/backend/utils/misc/guc.c                 |  18 +++
 src/include/replication/walreceiver.h        |  10 ++
 src/test/recovery/t/026_walreceiver_start.pl |  96 +++++++++++++
 6 files changed, 291 insertions(+), 8 deletions(-)
 create mode 100644 src/test/recovery/t/026_walreceiver_start.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2c31c35a6b1..91ab13e54d2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4602,6 +4602,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_condition">
+      <term><varname>wal_receiver_start_condition</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_condition</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_condition</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_condition</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
       <term><varname>wal_receiver_status_interval</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03ec..698ccd4072c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -975,6 +975,7 @@ static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 static void checkXLogConsistency(XLogReaderState *record);
+static void StartWALReceiverEagerly(void);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1545,6 +1546,117 @@ checkXLogConsistency(XLogReaderState *record)
 	}
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning.
+ */
+static void
+StartWALReceiverEagerly()
+{
+	DIR		   *dir;
+	struct dirent *de;
+	XLogSegNo	startsegno = -1;
+	XLogSegNo	endsegno = -1;
+
+	Assert(wal_receiver_start_condition <= WAL_RCV_START_AT_CONSISTENCY);
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			int			logSegNo;
+			int			tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != ThisTimeLineID)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("Could not start streaming WAL eagerly"),
+						 errdetail("There are timeline changes in the locally available WAL files."),
+						 errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in our pg_wal, for the
+	 * standby to have started.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState *state;
+		XLogRecPtr	startptr;
+		WALReadError errinfo;
+		char		xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, ThisTimeLineID, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory"),
+					 errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 ThisTimeLineID,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("Requesting stream from beginning of: %s",
+						   xlogfname));
+			RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
+								 PrimarySlotName, wal_receiver_create_temp_slot);
+			XLogReaderFree(state);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("Invalid WAL segment found while calculating stream start: %s. Skipping.",
+					   xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here. We should have at least one valid WAL
+	 * segment in our pg_wal, for the standby to have started.
+	 */
+	Assert(false);
+}
+
 /*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
@@ -7053,6 +7165,14 @@ StartupXLOG(void)
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/*
+	 * Start WAL receiver eagerly if requested.
+	 */
+	if (StandbyModeRequested && !WalRcvStreaming() &&
+		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+		wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
+		StartWALReceiverEagerly();
+
 	/*
 	 * Clear out any old relcache cache files.  This is *necessary* if we do
 	 * any WAL replay, since that would probably result in the cache files
@@ -8403,6 +8523,15 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/*
+	 * Start WAL receiver eagerly if requested and upon reaching a consistent
+	 * state.
+	 */
+	if (StandbyModeRequested && !WalRcvStreaming() && reachedConsistency &&
+		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+		wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY)
+		StartWALReceiverEagerly();
 }
 
 /*
@@ -12630,10 +12759,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12743,12 +12874,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c2..feb98add01b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -89,6 +89,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 467b0fd6fe7..7d55a83bdb8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -556,6 +556,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -4970,6 +4977,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_EXHAUST,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0b607ed777b..a0d14791f13 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -24,9 +24,19 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
+extern int	wal_receiver_start_condition;
 extern bool hot_standby_feedback;
 
 /*
diff --git a/src/test/recovery/t/026_walreceiver_start.pl b/src/test/recovery/t/026_walreceiver_start.pl
new file mode 100644
index 00000000000..f031ec2122f
--- /dev/null
+++ b/src/test/recovery/t/026_walreceiver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_condition = 'consistency'
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgresNode->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgresNode->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_condition = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.25.1

#13

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

about 4 years ago

In reply to: Soumyadeep Chakraborty (#12)

Re: Unnecessary delay in streaming replication due to replay lag

Rebased and added a CF entry for Nov CF:
https://commitfest.postgresql.org/35/3376/.

#14

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Soumyadeep Chakraborty (#12)

Re: Unnecessary delay in streaming replication due to replay lag

On Tue, Aug 24, 2021 at 09:51:25PM -0700, Soumyadeep Chakraborty wrote:

Ashwin and I recently got a chance to work on this and we addressed all
outstanding feedback and suggestions. PFA a significantly reworked patch.

+static void
+StartWALReceiverEagerly()
+{
The patch fails to apply because of the recent changes from Robert to
eliminate ThisTimeLineID. The correct thing to do would be to add one
TimeLineID argument, passing down the local ThisTimeLineID in
StartupXLOG() and using XLogCtl->lastReplayedTLI in
CheckRecoveryConsistency().

+	/*
+	 * We should never reach here. We should have at least one valid WAL
+	 * segment in our pg_wal, for the standby to have started.
+	 */
+	Assert(false);
The reason behind that is not that we have a standby, but that we read
at least the segment that included the checkpoint record we are
replaying from, at least (it is possible for a standby to start
without any contents in pg_wal/ as long as recovery is configured),
and because StartWALReceiverEagerly() is called just after that.

It would be better to make sure that StartWALReceiverEagerly() gets
only called from the startup process, perhaps?

+	RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
+			     PrimarySlotName, wal_receiver_create_temp_slot);
+	XLogReaderFree(state);
XLogReaderFree() should happen before RequestXLogStreaming().  The
tipping point of the patch is here, where the WAL receiver is started
based on the location of the first valid WAL record found.

wal_receiver_start_condition is missing in postgresql.conf.sample.

+	/*
+	 * Start WAL receiver eagerly if requested.
+	 */
+	if (StandbyModeRequested && !WalRcvStreaming() &&
+		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+		wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
+		StartWALReceiverEagerly();
[...]
+	if (StandbyModeRequested && !WalRcvStreaming() && reachedConsistency &&
+		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+		wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY)
+		StartWALReceiverEagerly();
This repeats two times the same set of conditions, which does not look
like a good idea to me.  I think that you'd better add an extra
argument to StartWALReceiverEagerly to track the start timing expected
in this code path, that will be matched with the GUC in the routine.
It would be better to document the reasons behind each check done, as
well.

+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
[ ... ]
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
[...]
+		XLogReaderFree(state);
+		endsegno--;
+	}
So, this reads the contents of pg_wal/ for any files that exist, then
goes down to the first segment found with a valid beginning.  That's
going to be expensive with a large max_wal_size.  When searching for a
point like that, a dichotomy method would be better to calculate a LSN
you'd like to start from.  Anyway, I think that there is a problem
with the approach: what should we do if there are holes in the
segments present in pg_wal/?  As of HEAD, or
wal_receiver_start_condition = 'exhaust' in this patch, we would
switch across local pg_wal/, archive and stream in a linear way,
thanks to WaitForWALToBecomeAvailable().  For example, imagine that we
have a standby with the following set of valid segments, because of
the buggy way a base backup has been taken:
000000010000000000000001
000000010000000000000003
000000010000000000000005
What the patch would do is starting a WAL receiver from segment 5,
which is in contradiction with the existing logic where we should try
to look for the segment once we are waiting for something in segment
2.  This would be dangerous once the startup process waits for some
WAL to become available, because we have a WAL receiver started, but
we cannot fetch the segment we have.  Perhaps a deployment has
archiving, in which case it would be able to grab segment 2 (if no
archiving, recovery would not be able to move on, so that would be
game over).

         /*
          * Move to XLOG_FROM_STREAM state, and set to start a
-         * walreceiver if necessary.
+         * walreceiver if necessary. The WAL receiver may have
+         * already started (if it was configured to start
+         * eagerly).
          */
         currentSource = XLOG_FROM_STREAM;
-        startWalReceiver = true;
+        startWalReceiver = !WalRcvStreaming();
         break;
     case XLOG_FROM_ARCHIVE:
     case XLOG_FROM_PG_WAL:

- /*
- * WAL receiver must not be running when reading WAL from
- * archive or pg_wal.
- */
- Assert(!WalRcvStreaming());

These parts should IMO not be changed. They are strong assumptions we
rely on in the startup process, and this comes down to the fact that
it is not a good idea to mix a WAL receiver started while
currentSource could be pointing at a WAL source completely different.
That's going to bring a lot of racy conditions, I am afraid, as we
rely on currentSource a lot during recovery, in combination that we
expect the code to be able to retrieve WAL in a linear fashion from
the LSN position that recovery is looking for.

So, I think that deciding if a WAL receiver should be started blindly
outside of the code path deciding if the startup process is waiting
for some WAL is not a good idea, and the position we may begin to
stream from may be something that we may have zero need for at the
end (this is going to be tricky if we detect a TLI jump while
replaying the local WAL, also?). The issue is that I am not sure what
a good design for that should be. We have no idea when the startup
process will need WAL from a different source until replay comes
around, but what we want here is to anticipate othis LSN :)

I am wondering if there should be a way to work out something with the
control file, though, but things can get very fancy with HA
and base backup deployments and the various cases we support thanks to
the current way recovery works, as well. We could also go simpler and
rework the priority order if both archiving and streaming are options
wanted by the user.
--
Michael

#15

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

about 4 years ago

In reply to: Michael Paquier (#14)

1 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Hi Michael,

Thanks for the detailed review! Attached is a rebased patch that addresses
most of the feedback.

On Mon, Nov 8, 2021 at 1:41 AM Michael Paquier <michael@paquier.xyz> wrote:

+static void
+StartWALReceiverEagerly()
+{
The patch fails to apply because of the recent changes from Robert to
eliminate ThisTimeLineID. The correct thing to do would be to add one
TimeLineID argument, passing down the local ThisTimeLineID in
StartupXLOG() and using XLogCtl->lastReplayedTLI in
CheckRecoveryConsistency().

Rebased.

+       /*
+        * We should never reach here. We should have at least one valid

WAL

+        * segment in our pg_wal, for the standby to have started.
+        */
+       Assert(false);
The reason behind that is not that we have a standby, but that we read
at least the segment that included the checkpoint record we are
replaying from, at least (it is possible for a standby to start
without any contents in pg_wal/ as long as recovery is configured),
and because StartWALReceiverEagerly() is called just after that.

Fair, comment updated.

It would be better to make sure that StartWALReceiverEagerly() gets
only called from the startup process, perhaps?

Added Assert(AmStartupProcess()) at the beginning of
StartWALReceiverEagerly().

+       RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
+                            PrimarySlotName,

wal_receiver_create_temp_slot);

+ XLogReaderFree(state);
XLogReaderFree() should happen before RequestXLogStreaming(). The
tipping point of the patch is here, where the WAL receiver is started
based on the location of the first valid WAL record found.

Done.

wal_receiver_start_condition is missing in postgresql.conf.sample.

Fixed.

+       /*
+        * Start WAL receiver eagerly if requested.
+        */
+       if (StandbyModeRequested && !WalRcvStreaming() &&
+               PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+               wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
+               StartWALReceiverEagerly();
[...]
+       if (StandbyModeRequested && !WalRcvStreaming() &&

reachedConsistency &&

+               PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+               wal_receiver_start_condition ==

WAL_RCV_START_AT_CONSISTENCY)

+ StartWALReceiverEagerly();
This repeats two times the same set of conditions, which does not look
like a good idea to me. I think that you'd better add an extra
argument to StartWALReceiverEagerly to track the start timing expected
in this code path, that will be matched with the GUC in the routine.
It would be better to document the reasons behind each check done, as
well.

Done.

So, this reads the contents of pg_wal/ for any files that exist, then
goes down to the first segment found with a valid beginning. That's
going to be expensive with a large max_wal_size. When searching for a
point like that, a dichotomy method would be better to calculate a LSN
you'd like to start from.

Even if there is a large max_wal_size, do we expect that there will be
a lot of invalid high-numbered WAL files? If that is not the case, most
of the time we would be looking at the last 1 or 2 WAL files to
determine the start point, making it efficient?

Anyway, I think that there is a problem
with the approach: what should we do if there are holes in the
segments present in pg_wal/? As of HEAD, or
wal_receiver_start_condition = 'exhaust' in this patch, we would
switch across local pg_wal/, archive and stream in a linear way,
thanks to WaitForWALToBecomeAvailable(). For example, imagine that we
have a standby with the following set of valid segments, because of
the buggy way a base backup has been taken:
000000010000000000000001
000000010000000000000003
000000010000000000000005
What the patch would do is starting a WAL receiver from segment 5,
which is in contradiction with the existing logic where we should try
to look for the segment once we are waiting for something in segment
2. This would be dangerous once the startup process waits for some
WAL to become available, because we have a WAL receiver started, but
we cannot fetch the segment we have. Perhaps a deployment has
archiving, in which case it would be able to grab segment 2 (if no
archiving, recovery would not be able to move on, so that would be
game over).

We could easily check for holes while we are doing the ReadDir() and
bail fron the early start if there are holes, just like we do if there
is a timeline jump in any of the WAL segments.

/*
* Move to XLOG_FROM_STREAM state, and set to start a
-         * walreceiver if necessary.
+         * walreceiver if necessary. The WAL receiver may have
+         * already started (if it was configured to start
+         * eagerly).
*/
currentSource = XLOG_FROM_STREAM;
-        startWalReceiver = true;
+        startWalReceiver = !WalRcvStreaming();
break;
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
- /*
- * WAL receiver must not be running when reading WAL from
- * archive or pg_wal.
- */
- Assert(!WalRcvStreaming());

These parts should IMO not be changed. They are strong assumptions we
rely on in the startup process, and this comes down to the fact that
it is not a good idea to mix a WAL receiver started while
currentSource could be pointing at a WAL source completely different.
That's going to bring a lot of racy conditions, I am afraid, as we
rely on currentSource a lot during recovery, in combination that we
expect the code to be able to retrieve WAL in a linear fashion from
the LSN position that recovery is looking for.

So, I think that deciding if a WAL receiver should be started blindly
outside of the code path deciding if the startup process is waiting
for some WAL is not a good idea, and the position we may begin to
stream from may be something that we may have zero need for at the
end (this is going to be tricky if we detect a TLI jump while
replaying the local WAL, also?). The issue is that I am not sure what
a good design for that should be. We have no idea when the startup
process will need WAL from a different source until replay comes
around, but what we want here is to anticipate othis LSN :)

Can you elaborate on the race conditions that you are thinking about?
Do the race conditions manifest only when we mix archiving and streaming?
If yes, how do you feel about making the GUC a no-op with a WARNING
while we are in WAL archiving mode?

I am wondering if there should be a way to work out something with the
control file, though, but things can get very fancy with HA
and base backup deployments and the various cases we support thanks to
the current way recovery works, as well. We could also go simpler and
rework the priority order if both archiving and streaming are options
wanted by the user.

Agreed, it would be much better to depend on the state in pg_wal,
namely the files that are available there.

Reworking the priority order seems like an appealing fix - if we can say
streaming > archiving in terms of priority, then the race that you are
referring to will not happen?

Also, what are some use cases where one would give priority to streaming
replication over archive recovery, if both sources have the same WAL
segments?

Regards,
Ashwin & Soumyadeep

Attachments:

v3-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From 7e301866e0468ed8def5b34df2d4570d6678baf3 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Date: Fri, 30 Jul 2021 18:21:55 -0700
Subject: [PATCH v3 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Co-authors: Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlog.c             | 170 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc.c                  |  18 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 src/include/replication/walreceiver.h         |  10 ++
 src/test/recovery/t/027_walreceiver_start.pl  |  96 ++++++++++
 7 files changed, 321 insertions(+), 10 deletions(-)
 create mode 100644 src/test/recovery/t/027_walreceiver_start.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5d..b911615bf04 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4604,6 +4604,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_condition">
+      <term><varname>wal_receiver_start_condition</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_condition</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_condition</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_condition</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
       <term><varname>wal_receiver_status_interval</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cda30836f8..bcd5e84fb65 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -989,6 +989,8 @@ static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 static void checkXLogConsistency(XLogReaderState *record);
+static void StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+											  TimeLineID currentTLI);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1544,6 +1546,152 @@ checkXLogConsistency(XLogReaderState *record)
 	}
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	XLogSegNo	startsegno = -1;
+	XLogSegNo	endsegno = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_condition == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_condition)
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY &&
+		!reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			int			logSegNo;
+			int			tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("Could not start streaming WAL eagerly"),
+						 errdetail("There are timeline changes in the locally available WAL files."),
+						 errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState *state;
+		XLogRecPtr	startptr;
+		WALReadError errinfo;
+		char		xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory"),
+					 errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("Requesting stream from beginning of: %s",
+						   xlogfname));
+			XLogReaderFree(state);
+			RequestXLogStreaming(currentTLI, startptr, PrimaryConnInfo,
+								 PrimarySlotName, wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("Invalid WAL segment found while calculating stream start: %s. Skipping.",
+					   xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
@@ -7061,6 +7209,9 @@ StartupXLOG(void)
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, ThisTimeLineID);
+
 	/*
 	 * Clear out any old relcache cache files.  This is *necessary* if we do
 	 * any WAL replay, since that would probably result in the cache files
@@ -8246,7 +8397,8 @@ StartupXLOG(void)
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
- * that it can start accepting read-only connections.
+ * that it can start accepting read-only connections. Also, attempt to start
+ * the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -8337,6 +8489,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  XLogCtl->lastReplayedTLI);
 }
 
 /*
@@ -12731,10 +12887,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12844,12 +13002,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e7..ad623a47f43 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -89,6 +89,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e91d5a3cfda..598f89c841b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -557,6 +557,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -5007,6 +5014,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_EXHAUST,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1cbc9feeb6f..faca3156afb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -343,7 +343,8 @@
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
-
+#wal_receiver_start_condition = 'exhaust'#	'exhaust', 'consistency', or 'startup'
+					# (change requires restart)
 # - Subscribers -
 
 # These settings are ignored on a publisher.
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0b607ed777b..a0d14791f13 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -24,9 +24,19 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
+extern int	wal_receiver_start_condition;
 extern bool hot_standby_feedback;
 
 /*
diff --git a/src/test/recovery/t/027_walreceiver_start.pl b/src/test/recovery/t/027_walreceiver_start.pl
new file mode 100644
index 00000000000..991d9bb6658
--- /dev/null
+++ b/src/test/recovery/t/027_walreceiver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_condition = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_condition = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.25.1

#16

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: Soumyadeep Chakraborty (#15)

Re: Unnecessary delay in streaming replication due to replay lag

On 10 Nov 2021, at 00:41, Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:

Thanks for the detailed review! Attached is a rebased patch that addresses
most of the feedback.

This patch no longer applies after e997a0c64 and associated follow-up commits,
please submit a rebased version.

--
Daniel Gustafsson https://vmware.com/

#17

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

about 4 years ago

In reply to: Daniel Gustafsson (#16)

1 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Hi Daniel,

Thanks for checking in on this patch.
Attached rebased version.

Regards,
Soumyadeep (VMware)

Attachments:

v4-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From 51149e4f877dc2f8bf47a1356fc8b0ec2512ac6d Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Date: Fri, 19 Nov 2021 00:33:17 -0800
Subject: [PATCH v4 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Co-authors: Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlog.c             | 170 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc.c                  |  18 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 src/include/replication/walreceiver.h         |  10 ++
 src/test/recovery/t/027_walreceiver_start.pl  |  96 ++++++++++
 7 files changed, 321 insertions(+), 10 deletions(-)
 create mode 100644 src/test/recovery/t/027_walreceiver_start.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5d..b911615bf04 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4604,6 +4604,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_condition">
+      <term><varname>wal_receiver_start_condition</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_condition</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_condition</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_condition</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
       <term><varname>wal_receiver_status_interval</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 16164483688..9c1243a2f85 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -983,6 +983,8 @@ static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 static void checkXLogConsistency(XLogReaderState *record);
+static void StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+											  TimeLineID currentTLI);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1538,6 +1540,152 @@ checkXLogConsistency(XLogReaderState *record)
 	}
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	XLogSegNo	startsegno = -1;
+	XLogSegNo	endsegno = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_condition == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_condition)
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY &&
+		!reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			int			logSegNo;
+			int			tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("Could not start streaming WAL eagerly"),
+						 errdetail("There are timeline changes in the locally available WAL files."),
+						 errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState *state;
+		XLogRecPtr	startptr;
+		WALReadError errinfo;
+		char		xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory"),
+					 errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("Requesting stream from beginning of: %s",
+						   xlogfname));
+			XLogReaderFree(state);
+			RequestXLogStreaming(currentTLI, startptr, PrimaryConnInfo,
+								 PrimarySlotName, wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("Invalid WAL segment found while calculating stream start: %s. Skipping.",
+					   xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
@@ -7056,6 +7204,9 @@ StartupXLOG(void)
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, replayTLI);
+
 	/*
 	 * Clear out any old relcache cache files.  This is *necessary* if we do
 	 * any WAL replay, since that would probably result in the cache files
@@ -8241,7 +8392,8 @@ StartupXLOG(void)
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
- * that it can start accepting read-only connections.
+ * that it can start accepting read-only connections. Also, attempt to start
+ * the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -8332,6 +8484,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  XLogCtl->lastReplayedTLI);
 }
 
 /*
@@ -12716,10 +12872,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12829,12 +12987,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e7..ad623a47f43 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -89,6 +89,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e91d5a3cfda..598f89c841b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -557,6 +557,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -5007,6 +5014,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_EXHAUST,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1cbc9feeb6f..faca3156afb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -343,7 +343,8 @@
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
-
+#wal_receiver_start_condition = 'exhaust'#	'exhaust', 'consistency', or 'startup'
+					# (change requires restart)
 # - Subscribers -
 
 # These settings are ignored on a publisher.
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0b607ed777b..a0d14791f13 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -24,9 +24,19 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
+extern int	wal_receiver_start_condition;
 extern bool hot_standby_feedback;
 
 /*
diff --git a/src/test/recovery/t/027_walreceiver_start.pl b/src/test/recovery/t/027_walreceiver_start.pl
new file mode 100644
index 00000000000..991d9bb6658
--- /dev/null
+++ b/src/test/recovery/t/027_walreceiver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_condition = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_condition = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.25.1

#18

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 4 years ago

In reply to: Soumyadeep Chakraborty (#17)

Re: Unnecessary delay in streaming replication due to replay lag

On Fri, Nov 19, 2021 at 2:05 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

Hi Daniel,

Thanks for checking in on this patch.
Attached rebased version.

Hi, I've not gone through the patch or this thread entirely, yet, can
you please confirm if there's any relation between this thread and
another one at [1]/messages/by-id/CAFiTN-vzbcSM_qZ+-mhS3OWecxupDCR5DkhQUTy+TKfrCMQLKQ@mail.gmail.com

[1]: /messages/by-id/CAFiTN-vzbcSM_qZ+-mhS3OWecxupDCR5DkhQUTy+TKfrCMQLKQ@mail.gmail.com

#19

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

about 4 years ago

In reply to: Bharath Rupireddy (#18)

Re: Unnecessary delay in streaming replication due to replay lag

Hi Bharath,

Yes, that thread has been discussed here. Asim had x-posted the patch to
[1]: /messages/by-id/CANXE4TeinQdw+M2Or0kTR24eRgWCOg479N8=gRvj9Ouki-tZFg@mail.gmail.com
was more recent when Ashwin and I picked up the patch in Aug 2021, so we
continued here.
The patch has been significantly updated by us, addressing Michael's long
outstanding feedback.

Regards,
Soumyadeep (VMware)

[1]: /messages/by-id/CANXE4TeinQdw+M2Or0kTR24eRgWCOg479N8=gRvj9Ouki-tZFg@mail.gmail.com
/messages/by-id/CANXE4TeinQdw+M2Or0kTR24eRgWCOg479N8=gRvj9Ouki-tZFg@mail.gmail.com

#20

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 4 years ago

In reply to: Soumyadeep Chakraborty (#19)

Re: Unnecessary delay in streaming replication due to replay lag

On Tue, Nov 23, 2021 at 1:39 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

Hi Bharath,

Yes, that thread has been discussed here. Asim had x-posted the patch to [1]. This thread
was more recent when Ashwin and I picked up the patch in Aug 2021, so we continued here.
The patch has been significantly updated by us, addressing Michael's long outstanding feedback.

Thanks for the patch. I reviewed it a bit, here are some comments:

1) A memory leak: add FreeDir(dir); before returning.
+ ereport(LOG,
+ (errmsg("Could not start streaming WAL eagerly"),
+ errdetail("There are timeline changes in the locally available WAL files."),
+ errhint("WAL streaming will begin once all local WAL and archives
are exhausted.")));
+ return;
+ }

2) Is there a guarantee that while we traverse the pg_wal directory to
find startsegno and endsegno, the new wal files arrive from the
primary or archive location or old wal files get removed/recycled by
the standby? Especially when wal_receiver_start_condition=consistency?
+ startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+ endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+ }

3) I think the errmsg text format isn't correct. Note that the errmsg
text starts with lowercase and doesn't end with "." whereas errdetail
or errhint starts with uppercase and ends with ".". Please check other
messages for reference.
The following should be changed.
+ errmsg("Requesting stream from beginning of: %s",
+ errmsg("Invalid WAL segment found while calculating stream start:
%s. Skipping.",
+ (errmsg("Could not start streaming WAL eagerly"),

4) I think you also need to have wal files names in double quotes,
something like below:
errmsg("could not close file \"%s\": %m", xlogfname)));

5) It is ".....stream start: \"%s\", skipping..",
+ errmsg("Invalid WAL segment found while calculating stream start:
%s. Skipping.",

4) I think the patch can make the startup process significantly slow,
especially when there are lots of wal files that exist in the standby
pg_wal directory. This is because of the overhead
StartWALReceiverEagerlyIfPossible adds i.e. doing two while loops to
figure out the start position of the
streaming in advance. This might end up the startup process doing the
loop over in the directory rather than the important thing of doing
crash recovery or standby recovery.

5) What happens if this new GUC is enabled in case of a synchronous standby?
What happens if this new GUC is enabled in case of a crash recovery?
What happens if this new GUC is enabled in case a restore command is
set i.e. standby performing archive recovery?

6) How about bgwriter/checkpointer which gets started even before the
startup process (or a new bg worker? of course it's going to be an
overkill) finding out the new start pos for the startup process and
then we could get rid of <literal>startup</literal> behaviour of the
patch? This avoids an extra burden on the startup process. Many times,
users will be complaining about why recovery is taking more time now,
after the GUC wal_receiver_start_condition=startup.

7) I think we can just have 'consistency' and 'exhaust' behaviours and
let the bgwrite or checkpointer find out the start position for the
startup process, so the startup process whenever reaches a consistent
point, it sees if the other process has calculated
start pos for it or not, if yes it starts wal receiver other wise it
goes with its usual recovery. I'm not sure if this will be a good
idea.

8) Can we have a better GUC name than wal_receiver_start_condition?
Something like wal_receiver_start_at or wal_receiver_start or some
other?

Regards,
Bharath Rupireddy.

#21

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

about 4 years ago

In reply to: Bharath Rupireddy (#20)

1 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Hi Bharath,

Thanks for the review!

On Sat, Nov 27, 2021 at 6:36 PM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:

1) A memory leak: add FreeDir(dir); before returning.
+ ereport(LOG,
+ (errmsg("Could not start streaming WAL eagerly"),
+ errdetail("There are timeline changes in the locally available WAL

files."),

+ errhint("WAL streaming will begin once all local WAL and archives
are exhausted.")));
+ return;
+ }

Thanks for catching that. Fixed.

2) Is there a guarantee that while we traverse the pg_wal directory to
find startsegno and endsegno, the new wal files arrive from the
primary or archive location or old wal files get removed/recycled by
the standby? Especially when wal_receiver_start_condition=consistency?
+ startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+ endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+ }

Even if newer wal files arrive after the snapshot of the dir listing
taken by AllocateDir()/ReadDir(), we will in effect start from a
slightly older location, which should be fine. It shouldn't matter if
an older file is recycled. If the last valid WAL segment is recycled,
we will ERROR out in StartWALReceiverEagerlyIfPossible() and the eager
start can be retried by the startup process when
CheckRecoveryConsistency() is called again.

3) I think the errmsg text format isn't correct. Note that the errmsg
text starts with lowercase and doesn't end with "." whereas errdetail
or errhint starts with uppercase and ends with ".". Please check other
messages for reference.
The following should be changed.
+ errmsg("Requesting stream from beginning of: %s",
+ errmsg("Invalid WAL segment found while calculating stream start:
%s. Skipping.",
+ (errmsg("Could not start streaming WAL eagerly"),

Fixed.

4) I think you also need to have wal files names in double quotes,
something like below:
errmsg("could not close file \"%s\": %m", xlogfname)));

Fixed.

5) It is ".....stream start: \"%s\", skipping..",
+ errmsg("Invalid WAL segment found while calculating stream start:
%s. Skipping.",

Fixed.

4) I think the patch can make the startup process significantly slow,
especially when there are lots of wal files that exist in the standby
pg_wal directory. This is because of the overhead
StartWALReceiverEagerlyIfPossible adds i.e. doing two while loops to
figure out the start position of the
streaming in advance. This might end up the startup process doing the
loop over in the directory rather than the important thing of doing
crash recovery or standby recovery.

Well, 99% of the time we can expect that the second loop finishes after
1 or 2 iterations, as the last valid WAL segment would most likely be
the highest numbered WAL file or thereabouts. I don't think that the
overhead will be significant as we are just looking up a directory
listing and not reading any files.

5) What happens if this new GUC is enabled in case of a synchronous

standby?

What happens if this new GUC is enabled in case of a crash recovery?
What happens if this new GUC is enabled in case a restore command is
set i.e. standby performing archive recovery?

The GUC would behave the same way for all of these cases. If we have
chosen 'startup'/'consistency', we would be starting the WAL receiver
eagerly. There might be certain race conditions when one combines this
GUC with archive recovery, which was discussed upthread. [1]/messages/by-id/CAE-ML+-8KnuJqXKHz0mrC7-qFMQJ3ArDC78X3-AjGKos7Ceocw@mail.gmail.com

6) How about bgwriter/checkpointer which gets started even before the
startup process (or a new bg worker? of course it's going to be an
overkill) finding out the new start pos for the startup process and
then we could get rid of <literal>startup</literal> behaviour of the
patch? This avoids an extra burden on the startup process. Many times,
users will be complaining about why recovery is taking more time now,
after the GUC wal_receiver_start_condition=startup.

Hmm, then we would be needing additional synchronization. There will
also be an added dependency on checkpoint_timeout. I don't think that
the performance hit is significant enough to warrant this change.

8) Can we have a better GUC name than wal_receiver_start_condition?
Something like wal_receiver_start_at or wal_receiver_start or some
other?

Sure, that makes more sense. Fixed.

Regards,
Soumyadeep (VMware)

[1]: /messages/by-id/CAE-ML+-8KnuJqXKHz0mrC7-qFMQJ3ArDC78X3-AjGKos7Ceocw@mail.gmail.com
/messages/by-id/CAE-ML+-8KnuJqXKHz0mrC7-qFMQJ3ArDC78X3-AjGKos7Ceocw@mail.gmail.com

Attachments:

v5-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From e6ffb5400fd841e4285939133739692c9cb4ba17 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Date: Fri, 19 Nov 2021 00:33:17 -0800
Subject: [PATCH v5 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Co-authors: Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlog.c             | 171 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc.c                  |  18 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 src/include/replication/walreceiver.h         |  10 +
 src/test/recovery/t/027_walreceiver_start.pl  |  96 ++++++++++
 7 files changed, 322 insertions(+), 10 deletions(-)
 create mode 100644 src/test/recovery/t/027_walreceiver_start.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index afbb6c35e30..244bb98424c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4607,6 +4607,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_at">
+      <term><varname>wal_receiver_start_at</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_at</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_at</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_at</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
       <term><varname>wal_receiver_status_interval</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 72aeb42961f..5107e997087 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -993,6 +993,8 @@ static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 static void checkXLogConsistency(XLogReaderState *record);
+static void StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+											  TimeLineID currentTLI);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1548,6 +1550,153 @@ checkXLogConsistency(XLogReaderState *record)
 	}
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	XLogSegNo	startsegno = -1;
+	XLogSegNo	endsegno = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_at)
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_CONSISTENCY &&
+		!reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			int			logSegNo;
+			int			tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("could not start streaming WAL eagerly"),
+						 errdetail("There are timeline changes in the locally available WAL files."),
+						 errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				FreeDir(dir);
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState *state;
+		XLogRecPtr	startptr;
+		WALReadError errinfo;
+		char		xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory"),
+					 errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("requesting stream from beginning of: \"%s\"",
+						   xlogfname));
+			XLogReaderFree(state);
+			RequestXLogStreaming(currentTLI, startptr, PrimaryConnInfo,
+								 PrimarySlotName, wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("invalid WAL segment found while calculating stream start: \"%s\". skipping..",
+					   xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
@@ -7065,6 +7214,9 @@ StartupXLOG(void)
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, replayTLI);
+
 	/*
 	 * Clear out any old relcache cache files.  This is *necessary* if we do
 	 * any WAL replay, since that would probably result in the cache files
@@ -8250,7 +8402,8 @@ StartupXLOG(void)
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
- * that it can start accepting read-only connections.
+ * that it can start accepting read-only connections. Also, attempt to start
+ * the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -8341,6 +8494,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  XLogCtl->lastReplayedTLI);
 }
 
 /*
@@ -12685,10 +12842,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12798,12 +12957,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e7..22e8d12354b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -89,6 +89,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_at;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f736e8d8725..275bbbf9933 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -559,6 +559,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -5015,6 +5022,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_at", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_at,
+		WAL_RCV_START_AT_EXHAUST,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a1acd46b611..fd986af8c09 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -343,7 +343,8 @@
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
-
+#wal_receiver_start_at = 'exhaust'#	'exhaust', 'consistency', or 'startup'
+					# (change requires restart)
 # - Subscribers -
 
 # These settings are ignored on a publisher.
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0b607ed777b..bde3687f931 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -24,9 +24,19 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
+extern int	wal_receiver_start_at;
 extern bool hot_standby_feedback;
 
 /*
diff --git a/src/test/recovery/t/027_walreceiver_start.pl b/src/test/recovery/t/027_walreceiver_start.pl
new file mode 100644
index 00000000000..da31c470867
--- /dev/null
+++ b/src/test/recovery/t/027_walreceiver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_at = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.25.1

#22

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Soumyadeep Chakraborty (#21)

Re: Unnecessary delay in streaming replication due to replay lag

At Wed, 15 Dec 2021 17:01:24 -0800, Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote in

Sure, that makes more sense. Fixed.

As I played with this briefly. I started a standby from a backup that
has an access to archive. I had the following log lines steadily.

[139535:postmaster] LOG: database system is ready to accept read-only connections
[139542:walreceiver] LOG: started streaming WAL from primary at 0/2000000 on timeline 1
cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003': No such file or directory
[139542:walreceiver] FATAL: could not open file "pg_wal/000000010000000000000003": No such file or directory
cp: cannot stat '/home/horiguti/data/arc_work/00000002.history': No such file or directory
cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003': No such file or directory
[139548:walreceiver] LOG: started streaming WAL from primary at 0/3000000 on timeline 1

The "FATAL: could not open file" message from walreceiver means that
the walreceiver was operationally prohibited to install a new wal
segment at the time. Thus the walreceiver ended as soon as started.
In short, the eager replication is not working at all.

I have a comment on the behavior and objective of this feature.

In the case where archive recovery is started from a backup, this
feature lets walreceiver start while the archive recovery is ongoing.
If walreceiver (or the eager replication) worked as expected, it would
write wal files while archive recovery writes the same set of WAL
segments to the same directory. I don't think that is a sane behavior.
Or, if putting more modestly, an unintended behavior.

In common cases, I believe archive recovery is faster than
replication. If a segment is available from archive, we don't need to
prefetch it via stream.

If this feature is intended to use only for crash recovery of a
standby, it should fire only when it is needed.

If not, that is, if it is intended to work also for archive recovery,
I think the eager replication should start from the next segment of
the last WAL in archive but that would invite more complex problems.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#23

sunil s

sunilfeb26@gmail.com

6 months ago

In reply to: Kyotaro Horiguchi (#22)

3 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Hello Hackers,

I recently had the opportunity to continue the effort originally led by a
valued contributor.
I’ve addressed most of the previously reported feedback and issues, and
would like to share the updated patch with the community.

IMHO starting WAL receiver eagerly offers significant advantages because of
following reasons

If recovery_min_apply_delay is set high (for various operational
reasons) and the primary crashes, the mirror can recover quickly, thereby
improving overall High Availability.
2.

For setups without archive-based recovery, restore and recovery
operations complete faster.
3.

When synchronous_commit is enabled, faster mirror recovery reduces
offline time and helps avoid prolonged commit/query wait times during
failover/recovery.
4.

This approach also improves resilience by limiting the impact of network
interruptions on replication.

In common cases, I believe archive recovery is faster than

replication. If a segment is available from archive, we don't need to
prefetch it via stream.

I completely agree — restoring from the archive is significantly faster
than streaming.
Attempting to stream from the last available WAL in the archive would
introduce complexity and risk.
Therefore, we can limit this feature to crash recovery scenarios and skip
it when archiving is enabled.

The "FATAL: could not open file" message from walreceiver means that

the walreceiver was operationally prohibited to install a new wal
segment at the time.
This was caused by an additional fix added in upstream to address a race
condition between the archiver and checkpointer.
It has been resolved in the latest patch, which also includes a TAP test to
verify the fix. Thanks for testing and bringing this to our attention.
For now we will skip wal receiver early start since enabling the write
access for wal receiver will reintroduce the bug, which the
commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
<https://github.com/postgres/postgres/commit/cc2c7d65fc27e877c9f407587b0b92d46cd6dd16>
fixed
previously.

I've attached the rebased patch with the necessary fix.

Thanks & Regards,
Sunil S (Broadcom)

On Tue, Jul 8, 2025 at 11:01 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

Show quoted text

At Wed, 15 Dec 2021 17:01:24 -0800, Soumyadeep Chakraborty <
soumyadeep2007@gmail.com> wrote in

Sure, that makes more sense. Fixed.

As I played with this briefly. I started a standby from a backup that
has an access to archive. I had the following log lines steadily.

[139535:postmaster] LOG: database system is ready to accept read-only
connections
[139542:walreceiver] LOG: started streaming WAL from primary at 0/2000000
on timeline 1
cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
No such file or directory
[139542:walreceiver] FATAL: could not open file
"pg_wal/000000010000000000000003": No such file or directory
cp: cannot stat '/home/horiguti/data/arc_work/00000002.history': No such
file or directory
cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
No such file or directory
[139548:walreceiver] LOG: started streaming WAL from primary at 0/3000000
on timeline 1

The "FATAL: could not open file" message from walreceiver means that
the walreceiver was operationally prohibited to install a new wal
segment at the time. Thus the walreceiver ended as soon as started.
In short, the eager replication is not working at all.

I have a comment on the behavior and objective of this feature.

In the case where archive recovery is started from a backup, this
feature lets walreceiver start while the archive recovery is ongoing.
If walreceiver (or the eager replication) worked as expected, it would
write wal files while archive recovery writes the same set of WAL
segments to the same directory. I don't think that is a sane behavior.
Or, if putting more modestly, an unintended behavior.

In common cases, I believe archive recovery is faster than
replication. If a segment is available from archive, we don't need to
prefetch it via stream.

If this feature is intended to use only for crash recovery of a
standby, it should fire only when it is needed.

If not, that is, if it is intended to work also for archive recovery,
I think the eager replication should start from the next segment of
the last WAL in archive but that would invite more complex problems.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v6-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchapplication/octet-stream; name=v6-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From 38880b9e62bc5d87b146acc9cbacbe07e9ad8dd0 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:27:58 +0530
Subject: [PATCH v6 1/3] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Archiving from the restore command does not holds the control lock
and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
Hence skipping early start of the wal receiver in case of archive recovery.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlogrecovery.c     | 176 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc_tables.c           |  18 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/replication/walreceiver.h         |  10 +
 6 files changed, 230 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 59a0874528a..403a7e70395 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5067,6 +5067,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
         the new setting.
        </para>
       </listitem>
+
+    </varlistentry>
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_at">
+      <term><varname>wal_receiver_start_at</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_at</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_at</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_at</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
      </varlistentry>
 
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 23878b2dd91..b197247da70 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -494,6 +494,161 @@ EnableStandbyMode(void)
 	disable_startup_progress_timeout();
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR           *dir;
+	struct dirent *de;
+	XLogSegNo     startsegno = -1;
+	XLogSegNo     endsegno   = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 *
+	 * Archiving from the restore command does not holds the control lock
+	 * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
+	 * will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
+	 * Hence skipping early start of the wal receiver in case of archive recovery.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_at ||
+		(ArchiveRecoveryRequested &&
+			recoveryRestoreCommand != NULL && strcmp(recoveryRestoreCommand, "") != 0))
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_CONSISTENCY && !reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir        = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			XLogSegNo logSegNo;
+			TimeLineID tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("could not start streaming WAL eagerly"),
+							errdetail("There are timeline changes in the locally available WAL files."),
+							errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				FreeDir(dir);
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState * state;
+		XLogRecPtr   startptr;
+		WALReadError errinfo;
+		char         xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+						errmsg("out of memory"),
+						errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("requesting stream from beginning of: \"%s\"", xlogfname));
+			XLogReaderFree(state);
+			SetInstallXLogFileSegmentActive();
+			RequestXLogStreaming(currentTLI,
+								 startptr,
+								 PrimaryConnInfo,
+								 PrimarySlotName,
+								 wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("invalid WAL segment found while calculating stream start: \"%s\". skipping..", xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Prepare the system for WAL recovery, if needed.
  *
@@ -805,6 +960,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, recoveryTargetTLI);
+
 	if (ArchiveRecoveryRequested)
 	{
 		if (StandbyModeRequested)
@@ -2180,6 +2338,7 @@ CheckTablespaceDirectory(void)
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
  * that it can start accepting read-only connections.
+ * Also, attempt to start the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -2277,6 +2436,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  lastReplayedTLI);
 }
 
 /*
@@ -3652,10 +3815,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();;
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -3769,13 +3934,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
-
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b6281101711..b6bc6a15370 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -88,6 +88,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_at = WAL_RCV_START_AT_EXHAUST;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 511dc32d519..86332e5f771 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -483,6 +483,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry file_copy_method_options[] = {
 	{"copy", FILE_COPY_METHOD_COPY, false},
 #if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE) || defined(HAVE_COPY_FILE_RANGE)
@@ -5418,6 +5425,17 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, assign_io_method, NULL
 	},
 
+	{
+		{"wal_receiver_start_at", PGC_POSTMASTER, REPLICATION_STANDBY,
+		 	gettext_noop("When to start WAL receiver."),
+		 	NULL,
+		 },
+		 &wal_receiver_start_at,
+		 WAL_RCV_START_AT_EXHAUST,
+		 wal_rcv_start_options,
+		 NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 341f88adc87..b49b098d5b1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -385,6 +385,7 @@
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off		# enables slot synchronization on the physical standby from the primary
+#wal_receiver_start_at = 'exhaust'#	'exhaust', 'consistency', or 'startup'			# (change requires restart)
 
 # - Subscribers -
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 89f63f908f8..0bf1f96c5ca 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -28,6 +28,7 @@
 extern PGDLLIMPORT int wal_receiver_status_interval;
 extern PGDLLIMPORT int wal_receiver_timeout;
 extern PGDLLIMPORT bool hot_standby_feedback;
+extern PGDLLIMPORT int	wal_receiver_start_at;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -53,6 +54,15 @@ typedef enum
 	WALRCV_STOPPING,			/* requested to stop, but still running */
 } WalRcvState;
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* Shared memory area for management of walreceiver process */
 typedef struct
 {
-- 
2.49.0

v6-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchapplication/octet-stream; name=v6-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchDownload

From e34b2d3a447a2b72aa1b51ed6778b5d03294ace0 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:29:02 +0530
Subject: [PATCH v6 2/3] Test WAL receiver early start upon reaching
 consistency

This test ensures that when a standby reaches consistency,
the WAL receiver starts immediately and begins streaming using the latest valid
 WAL segment already available on disk.
This behavior minimizes delay and avoids waiting for WAL file once all the
locally available WAL file is restored and helps in providing `HIGH AVAILABILITY`
incase of Primary crash/failure.

More it helps quicker recovery when `recovery_min_apply_delay` in large and saves
Primary from running out of space.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 src/test/recovery/t/046_walreciver_start.pl | 97 +++++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 src/test/recovery/t/046_walreciver_start.pl

diff --git a/src/test/recovery/t/046_walreciver_start.pl b/src/test/recovery/t/046_walreciver_start.pl
new file mode 100644
index 00000000000..12037281f0e
--- /dev/null
+++ b/src/test/recovery/t/046_walreciver_start.pl
@@ -0,0 +1,97 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_at = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
+
-- 
2.49.0

v6-0003-Test-archive-recovery-takes-precedence-over-strea.patchapplication/octet-stream; name=v6-0003-Test-archive-recovery-takes-precedence-over-strea.patchDownload

From a53547b859eb5c48d919c7e0d794bc2fd12f12b9 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:36:56 +0530
Subject: [PATCH v6 3/3] Test archive recovery takes precedence over streaming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Since archive recovery is typically faster and more efficient than streaming,
initiating the WAL receiver early—before attempting recovery from archived WAL
files—is not ideal.
Furthermore, determining the correct starting point for streaming by examining
the last valid WAL segment restored from the archive adds complexity and potential risk.

Therefore, even when the configuration parameter wal_receiver_start_at is set to consistency,
archive recovery should take precedence, and the WAL receiver should only be
started after archive recovery is exhausted or deemed unavailable.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 .../recovery/t/049_archive_enabled_standby.pl | 77 +++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 src/test/recovery/t/049_archive_enabled_standby.pl

diff --git a/src/test/recovery/t/049_archive_enabled_standby.pl b/src/test/recovery/t/049_archive_enabled_standby.pl
new file mode 100644
index 00000000000..5dddcb0b00c
--- /dev/null
+++ b/src/test/recovery/t/049_archive_enabled_standby.pl
@@ -0,0 +1,77 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test that a standby with archive recovery enabled does not start WAL receiver early:
+# - Verifies that wal_receiver_start_at = 'consistency'
+# - Ensures WAL receiver is not started when restore_command is specified
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 1;
+
+# Initialize primary node with archving enabled and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(2);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup with restore enabled from archived WAL
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,  has_restoring => 1);
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should not have started
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;", 'f')
+	or die "Timed out while waiting for streaming to start";
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+	'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
\ No newline at end of file
-- 
2.49.0

#24

sunil s

sunilfeb26@gmail.com

6 months ago

In reply to: sunil s (#23)

Re: Unnecessary delay in streaming replication due to replay lag

Added patch to upcoming commitfest
https://commitfest.postgresql.org/patch/5908/

Thanks & Regards,
Sunil S

On Wed, Jul 9, 2025 at 12:01 AM sunil s <sunilfeb26@gmail.com> wrote:

Show quoted text

Hello Hackers,

I recently had the opportunity to continue the effort originally led by a
valued contributor.
I’ve addressed most of the previously reported feedback and issues, and
would like to share the updated patch with the community.

IMHO starting WAL receiver eagerly offers significant advantages because
of following reasons

1.

If recovery_min_apply_delay is set high (for various operational
reasons) and the primary crashes, the mirror can recover quickly, thereby
improving overall High Availability.
2.

For setups without archive-based recovery, restore and recovery
operations complete faster.
3.

When synchronous_commit is enabled, faster mirror recovery reduces
offline time and helps avoid prolonged commit/query wait times during
failover/recovery.
4.

This approach also improves resilience by limiting the impact of
network interruptions on replication.

In common cases, I believe archive recovery is faster than

replication. If a segment is available from archive, we don't need to
prefetch it via stream.

I completely agree — restoring from the archive is significantly faster
than streaming.
Attempting to stream from the last available WAL in the archive would
introduce complexity and risk.
Therefore, we can limit this feature to crash recovery scenarios and skip
it when archiving is enabled.

The "FATAL: could not open file" message from walreceiver means that

the walreceiver was operationally prohibited to install a new wal
segment at the time.
This was caused by an additional fix added in upstream to address a race
condition between the archiver and checkpointer.
It has been resolved in the latest patch, which also includes a TAP test
to verify the fix. Thanks for testing and bringing this to our attention.
For now we will skip wal receiver early start since enabling the write
access for wal receiver will reintroduce the bug, which the
commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
<https://github.com/postgres/postgres/commit/cc2c7d65fc27e877c9f407587b0b92d46cd6dd16> fixed
previously.

I've attached the rebased patch with the necessary fix.

Thanks & Regards,
Sunil S (Broadcom)

On Tue, Jul 8, 2025 at 11:01 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Wed, 15 Dec 2021 17:01:24 -0800, Soumyadeep Chakraborty <
soumyadeep2007@gmail.com> wrote in

Sure, that makes more sense. Fixed.

As I played with this briefly. I started a standby from a backup that
has an access to archive. I had the following log lines steadily.

[139535:postmaster] LOG: database system is ready to accept read-only
connections
[139542:walreceiver] LOG: started streaming WAL from primary at
0/2000000 on timeline 1
cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
No such file or directory
[139542:walreceiver] FATAL: could not open file
"pg_wal/000000010000000000000003": No such file or directory
cp: cannot stat '/home/horiguti/data/arc_work/00000002.history': No such
file or directory
cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
No such file or directory
[139548:walreceiver] LOG: started streaming WAL from primary at
0/3000000 on timeline 1

The "FATAL: could not open file" message from walreceiver means that
the walreceiver was operationally prohibited to install a new wal
segment at the time. Thus the walreceiver ended as soon as started.
In short, the eager replication is not working at all.

I have a comment on the behavior and objective of this feature.

In the case where archive recovery is started from a backup, this
feature lets walreceiver start while the archive recovery is ongoing.
If walreceiver (or the eager replication) worked as expected, it would
write wal files while archive recovery writes the same set of WAL
segments to the same directory. I don't think that is a sane behavior.
Or, if putting more modestly, an unintended behavior.

In common cases, I believe archive recovery is faster than
replication. If a segment is available from archive, we don't need to
prefetch it via stream.

If this feature is intended to use only for crash recovery of a
standby, it should fire only when it is needed.

If not, that is, if it is intended to work also for archive recovery,
I think the eager replication should start from the next segment of
the last WAL in archive but that would invite more complex problems.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#25

Huansong Fu

huansong.fu.info@gmail.com

6 months ago

In reply to: sunil s (#24)

Re: Unnecessary delay in streaming replication due to replay lag

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: tested, failed
Spec compliant: not tested
Documentation: not tested

Hi,

I've been playing with the patch. It worked as intended. I have a few minor review comments on the code and test:

1. There was some indent issue when applying the v6-0001 patch:
v6-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patch:342: space before tab in indent.
gettext_noop("When to start WAL receiver."),
v6-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patch:343: space before tab in indent.
NULL,
warning: 2 lines add whitespace errors.

2. There was a whitespace issue when applying the v6-0002 test patch:
v6-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patch:126: new blank line at EOF.
+
warning: 1 line adds whitespace errors.

3. Test number for "046_walreciver_start.pl" collided with a recently added test "046_checkpoint_logical_slot.pl" so needs another number.

4. Some text needs wraparound:
         * Archiving from the restore command does not holds the control lock
-        * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
-        * will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
-        * Hence skipping early start of the wal receiver in case of archive recovery.
+        * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever
+        * early start will create a race condition with the checkpointer process
+        * as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16. Hence skipping
+        * early start of the wal receiver in case of archive recovery.
         */

5. Extra ";"
@@ -3820,7 +3821,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                                         * eagerly).
                                         */
                                        currentSource = XLOG_FROM_STREAM;
-                                       startWalReceiver = !WalRcvStreaming();;
+                                       startWalReceiver = !WalRcvStreaming();

Thanks,
Huansong
Broadcom Inc.

The new status of this patch is: Waiting on Author

#26

sunil s

sunilfeb26@gmail.com

6 months ago

In reply to: Huansong Fu (#25)

3 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Thanks Huansong for reviewing the patch, and I have addressed all the
above-mentioned points.

PFA rebased patch.

Thanks & Regards,
Sunil S

Attachments:

v7-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchapplication/octet-stream; name=v7-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From a6692a0a3056c02ecbcf0b5db2c34624b99e228e Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:27:58 +0530
Subject: [PATCH v7 1/3] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Archiving from the restore command does not holds the control lock
and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
Hence skipping early start of the wal receiver in case of archive recovery.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlogrecovery.c     | 177 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc_tables.c           |  18 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/replication/walreceiver.h         |  10 +
 6 files changed, 231 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 20ccb2d6b54..bce23c9dc14 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5071,6 +5071,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
         the new setting.
        </para>
       </listitem>
+
+    </varlistentry>
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_at">
+      <term><varname>wal_receiver_start_at</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_at</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_at</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_at</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
      </varlistentry>
 
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index e8f3ba00caa..d04ec2e0187 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -494,6 +494,162 @@ EnableStandbyMode(void)
 	disable_startup_progress_timeout();
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR           *dir;
+	struct dirent *de;
+	XLogSegNo     startsegno = -1;
+	XLogSegNo     endsegno   = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 *
+	 * Archiving from the restore command does not holds the control lock
+	 * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever
+	 * early start will create a race condition with the checkpointer process
+	 * as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16. Hence skipping
+	 * early start of the wal receiver in case of archive recovery.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_at ||
+		(ArchiveRecoveryRequested &&
+			recoveryRestoreCommand != NULL && strcmp(recoveryRestoreCommand, "") != 0))
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_CONSISTENCY && !reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir        = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			XLogSegNo logSegNo;
+			TimeLineID tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("could not start streaming WAL eagerly"),
+							errdetail("There are timeline changes in the locally available WAL files."),
+							errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				FreeDir(dir);
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState * state;
+		XLogRecPtr   startptr;
+		WALReadError errinfo;
+		char         xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+						errmsg("out of memory"),
+						errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("requesting stream from beginning of: \"%s\"", xlogfname));
+			XLogReaderFree(state);
+			SetInstallXLogFileSegmentActive();
+			RequestXLogStreaming(currentTLI,
+								 startptr,
+								 PrimaryConnInfo,
+								 PrimarySlotName,
+								 wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("invalid WAL segment found while calculating stream start: \"%s\". skipping..", xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Prepare the system for WAL recovery, if needed.
  *
@@ -805,6 +961,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, recoveryTargetTLI);
+
 	if (ArchiveRecoveryRequested)
 	{
 		if (StandbyModeRequested)
@@ -2180,6 +2339,7 @@ CheckTablespaceDirectory(void)
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
  * that it can start accepting read-only connections.
+ * Also, attempt to start the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -2277,6 +2437,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  lastReplayedTLI);
 }
 
 /*
@@ -3652,10 +3816,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -3769,13 +3935,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
-
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b6281101711..b6bc6a15370 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -88,6 +88,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_at = WAL_RCV_START_AT_EXHAUST;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..a78ce36ac9e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -483,6 +483,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry file_copy_method_options[] = {
 	{"copy", FILE_COPY_METHOD_COPY, false},
 #if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE) || defined(HAVE_COPY_FILE_RANGE)
@@ -5418,6 +5425,17 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, assign_io_method, NULL
 	},
 
+	{
+		{"wal_receiver_start_at", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_at,
+		WAL_RCV_START_AT_EXHAUST,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a9d8293474a..1568d3fe857 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -385,6 +385,7 @@
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off		# enables slot synchronization on the physical standby from the primary
+#wal_receiver_start_at = 'exhaust'#	'exhaust', 'consistency', or 'startup'			# (change requires restart)
 
 # - Subscribers -
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 89f63f908f8..0bf1f96c5ca 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -28,6 +28,7 @@
 extern PGDLLIMPORT int wal_receiver_status_interval;
 extern PGDLLIMPORT int wal_receiver_timeout;
 extern PGDLLIMPORT bool hot_standby_feedback;
+extern PGDLLIMPORT int	wal_receiver_start_at;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -53,6 +54,15 @@ typedef enum
 	WALRCV_STOPPING,			/* requested to stop, but still running */
 } WalRcvState;
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* Shared memory area for management of walreceiver process */
 typedef struct
 {
-- 
2.49.0

v7-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchapplication/octet-stream; name=v7-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchDownload

From 1500f0ebff1f1717253dd77c61d00a96db8d6554 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:29:02 +0530
Subject: [PATCH v7 2/3] Test WAL receiver early start upon reaching
 consistency

This test ensures that when a standby reaches consistency,
the WAL receiver starts immediately and begins streaming using the latest valid
 WAL segment already available on disk.
This behavior minimizes delay and avoids waiting for WAL file once all the
locally available WAL file is restored and helps in providing `HIGH AVAILABILITY`
incase of Primary crash/failure.

More it helps quicker recovery when `recovery_min_apply_delay` in large and saves
Primary from running out of space.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 src/test/recovery/t/049_walreciver_start.pl | 96 +++++++++++++++++++++
 1 file changed, 96 insertions(+)
 create mode 100644 src/test/recovery/t/049_walreciver_start.pl

diff --git a/src/test/recovery/t/049_walreciver_start.pl b/src/test/recovery/t/049_walreciver_start.pl
new file mode 100644
index 00000000000..da31c470867
--- /dev/null
+++ b/src/test/recovery/t/049_walreciver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_at = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.49.0

v7-0003-Test-archive-recovery-takes-precedence-over-strea.patchapplication/octet-stream; name=v7-0003-Test-archive-recovery-takes-precedence-over-strea.patchDownload

From f2dc4abe12e95d10c50c8fc8776fad62d4354731 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:36:56 +0530
Subject: [PATCH v7 3/3] Test archive recovery takes precedence over streaming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Since archive recovery is typically faster and more efficient than streaming,
initiating the WAL receiver early—before attempting recovery from archived WAL
files—is not ideal.
Furthermore, determining the correct starting point for streaming by examining
the last valid WAL segment restored from the archive adds complexity and potential risk.

Therefore, even when the configuration parameter wal_receiver_start_at is set to consistency,
archive recovery should take precedence, and the WAL receiver should only be
started after archive recovery is exhausted or deemed unavailable.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 .../recovery/t/050_archive_enabled_standby.pl | 77 +++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 src/test/recovery/t/050_archive_enabled_standby.pl

diff --git a/src/test/recovery/t/050_archive_enabled_standby.pl b/src/test/recovery/t/050_archive_enabled_standby.pl
new file mode 100644
index 00000000000..5dddcb0b00c
--- /dev/null
+++ b/src/test/recovery/t/050_archive_enabled_standby.pl
@@ -0,0 +1,77 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test that a standby with archive recovery enabled does not start WAL receiver early:
+# - Verifies that wal_receiver_start_at = 'consistency'
+# - Ensures WAL receiver is not started when restore_command is specified
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 1;
+
+# Initialize primary node with archving enabled and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(2);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup with restore enabled from archived WAL
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,  has_restoring => 1);
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should not have started
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;", 'f')
+	or die "Timed out while waiting for streaming to start";
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+	'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
\ No newline at end of file
-- 
2.49.0

#27

sunil s

sunilfeb26@gmail.com

4 months ago

In reply to: sunil s (#26)

3 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Hello Hackers,

PFA rebased patch due to the code changes done in upstream commit
63599896545c7869f7dd28cd593e8b548983d613
<https://github.com/postgres/postgres/commit/63599896545c7869f7dd28cd593e8b548983d613>
.

The current status of the patch registered in Commit Fest
<https://commitfest.postgresql.org/patch/5908/>is "*Ready for Committer*".

Thanks & Regards,
Sunil S
Broadcom Inc

Show quoted text

Attachments:

v8-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchapplication/octet-stream; name=v8-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From bb41529e25c5f284438a86c097a78aca3a1998e1 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:27:58 +0530
Subject: [PATCH v8 1/3] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Archiving from the restore command does not holds the control lock
and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
Hence skipping early start of the wal receiver in case of archive recovery.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlogrecovery.c     | 177 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/guc_tables.c           |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/replication/walreceiver.h         |  10 +
 7 files changed, 227 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2a3685f474a..a91aafeb821 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5072,6 +5072,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
         the new setting.
        </para>
       </listitem>
+
+    </varlistentry>
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_at">
+      <term><varname>wal_receiver_start_at</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_at</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_at</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_at</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
      </varlistentry>
 
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a0..a3fd74b09ba 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -494,6 +494,162 @@ EnableStandbyMode(void)
 	disable_startup_progress_timeout();
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR           *dir;
+	struct dirent *de;
+	XLogSegNo     startsegno = -1;
+	XLogSegNo     endsegno   = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 *
+	 * Archiving from the restore command does not holds the control lock
+	 * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever
+	 * early start will create a race condition with the checkpointer process
+	 * as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16. Hence skipping
+	 * early start of the wal receiver in case of archive recovery.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_at ||
+		(ArchiveRecoveryRequested &&
+			recoveryRestoreCommand != NULL && strcmp(recoveryRestoreCommand, "") != 0))
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_CONSISTENCY && !reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir        = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			XLogSegNo logSegNo;
+			TimeLineID tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("could not start streaming WAL eagerly"),
+							errdetail("There are timeline changes in the locally available WAL files."),
+							errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				FreeDir(dir);
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState * state;
+		XLogRecPtr   startptr;
+		WALReadError errinfo;
+		char         xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+						errmsg("out of memory"),
+						errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("requesting stream from beginning of: \"%s\"", xlogfname));
+			XLogReaderFree(state);
+			SetInstallXLogFileSegmentActive();
+			RequestXLogStreaming(currentTLI,
+								 startptr,
+								 PrimaryConnInfo,
+								 PrimarySlotName,
+								 wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("invalid WAL segment found while calculating stream start: \"%s\". skipping..", xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Prepare the system for WAL recovery, if needed.
  *
@@ -805,6 +961,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, recoveryTargetTLI);
+
 	if (ArchiveRecoveryRequested)
 	{
 		if (StandbyModeRequested)
@@ -2181,6 +2340,7 @@ CheckTablespaceDirectory(void)
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
  * that it can start accepting read-only connections.
+ * Also, attempt to start the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -2278,6 +2438,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  lastReplayedTLI);
 }
 
 /*
@@ -3653,10 +3817,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -3770,13 +3936,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
-
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7361ffc9dcf..e1c8e7ef6e1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -88,6 +88,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_at = WAL_RCV_START_AT_EXHAUST;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0da01627cfe..a2ba701e0f8 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -3475,4 +3475,11 @@
   assign_hook => 'assign_io_method',
 },
 
+{ name => 'wal_receiver_start_at', type => 'enum', context => 'PGC_POSTMASTER', group => 'REPLICATION_STANDBY',
+  short_desc => 'When to start WAL receiver',
+  variable => 'wal_receiver_start_at',
+  boot_val => 'WAL_RCV_START_AT_EXHAUST',
+  options => 'wal_rcv_start_options',
+},
+
 ]
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..255f081c977 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -483,6 +483,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry file_copy_method_options[] = {
 	{"copy", FILE_COPY_METHOD_COPY, false},
 #if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE) || defined(HAVE_COPY_FILE_RANGE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 26c08693564..6360800b26b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -385,6 +385,7 @@
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off		# enables slot synchronization on the physical standby from the primary
+#wal_receiver_start_at = 'exhaust'#	'exhaust', 'consistency', or 'startup'			# (change requires restart)
 
 # - Subscribers -
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 89f63f908f8..0bf1f96c5ca 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -28,6 +28,7 @@
 extern PGDLLIMPORT int wal_receiver_status_interval;
 extern PGDLLIMPORT int wal_receiver_timeout;
 extern PGDLLIMPORT bool hot_standby_feedback;
+extern PGDLLIMPORT int	wal_receiver_start_at;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -53,6 +54,15 @@ typedef enum
 	WALRCV_STOPPING,			/* requested to stop, but still running */
 } WalRcvState;
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* Shared memory area for management of walreceiver process */
 typedef struct
 {
-- 
2.50.1

v8-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchapplication/octet-stream; name=v8-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchDownload

From e3b7c5fd36c9bb069bf958b35ebcdc66e1d61d30 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:29:02 +0530
Subject: [PATCH v8 2/3] Test WAL receiver early start upon reaching
 consistency

This test ensures that when a standby reaches consistency,
the WAL receiver starts immediately and begins streaming using the latest valid
 WAL segment already available on disk.
This behavior minimizes delay and avoids waiting for WAL file once all the
locally available WAL file is restored and helps in providing `HIGH AVAILABILITY`
incase of Primary crash/failure.

More it helps quicker recovery when `recovery_min_apply_delay` in large and saves
Primary from running out of space.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 src/test/recovery/t/049_walreciver_start.pl | 96 +++++++++++++++++++++
 1 file changed, 96 insertions(+)
 create mode 100644 src/test/recovery/t/049_walreciver_start.pl

diff --git a/src/test/recovery/t/049_walreciver_start.pl b/src/test/recovery/t/049_walreciver_start.pl
new file mode 100644
index 00000000000..da31c470867
--- /dev/null
+++ b/src/test/recovery/t/049_walreciver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_at = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.50.1

v8-0003-Test-archive-recovery-takes-precedence-over-strea.patchapplication/octet-stream; name=v8-0003-Test-archive-recovery-takes-precedence-over-strea.patchDownload

From 582b0d0a0ae6d5423972182e924f4dd115ccdfc0 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:36:56 +0530
Subject: [PATCH v8 3/3] Test archive recovery takes precedence over streaming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Since archive recovery is typically faster and more efficient than streaming,
initiating the WAL receiver early—before attempting recovery from archived WAL
files—is not ideal.
Furthermore, determining the correct starting point for streaming by examining
the last valid WAL segment restored from the archive adds complexity and potential risk.

Therefore, even when the configuration parameter wal_receiver_start_at is set to consistency,
archive recovery should take precedence, and the WAL receiver should only be
started after archive recovery is exhausted or deemed unavailable.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 .../recovery/t/050_archive_enabled_standby.pl | 77 +++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 src/test/recovery/t/050_archive_enabled_standby.pl

diff --git a/src/test/recovery/t/050_archive_enabled_standby.pl b/src/test/recovery/t/050_archive_enabled_standby.pl
new file mode 100644
index 00000000000..5dddcb0b00c
--- /dev/null
+++ b/src/test/recovery/t/050_archive_enabled_standby.pl
@@ -0,0 +1,77 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test that a standby with archive recovery enabled does not start WAL receiver early:
+# - Verifies that wal_receiver_start_at = 'consistency'
+# - Ensures WAL receiver is not started when restore_command is specified
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 1;
+
+# Initialize primary node with archving enabled and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(2);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup with restore enabled from archived WAL
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,  has_restoring => 1);
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should not have started
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;", 'f')
+	or die "Timed out while waiting for streaming to start";
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+	'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
\ No newline at end of file
-- 
2.50.1

#28

Fujii Masao

masao.fujii@gmail.com

3 months ago

In reply to: sunil s (#27)

Re: Unnecessary delay in streaming replication due to replay lag

On Thu, Sep 11, 2025 at 5:51 PM sunil s <sunilfeb26@gmail.com> wrote:

Hello Hackers,

PFA rebased patch due to the code changes done in upstream commit 63599896545c7869f7dd28cd593e8b548983d613.

The current status of the patch registered in Commit Fest is "Ready for Committer".

+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_at</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL.

When this parameter is set to 'startup' or 'consistency', what happens
if replication begins early and the startup process fails to replay
a WAL record—say, due to corruption—before reaching the replication
start point? In that case, the standby might fail to recover correctly
because of missing WAL records, while a transaction waiting for
synchronous replication may have already been acknowledged as committed.
Wouldn't that lead to a serious problem?

Regards,

--
Fujii Masao

#29

Josef Šimánek

josef.simanek@gmail.com

2 months ago

In reply to: sunil s (#27)

Re: Unnecessary delay in streaming replication due to replay lag

ne 2. 11. 2025 v 18:33 odesílatel sunil s <sunilfeb26@gmail.com> napsal:

Hello Hackers,

PFA rebased patch due to the code changes done in upstream commit 63599896545c7869f7dd28cd593e8b548983d613.

src/test/recovery/t/050_archive_enabled_standby.pl is missing the
ending newline. Is that intentional?

could be seen at
https://github.com/postgresql-cfbot/postgresql/commit/041e477fea9677fa6dee0736ffe4825f704c066e

Show quoted text

The current status of the patch registered in Commit Fest is "Ready for Committer".

Thanks & Regards,
Sunil S
Broadcom Inc

#30

sunil s

sunilfeb26@gmail.com

2 months ago

In reply to: Fujii Masao (#28)

3 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

When this parameter is set to 'startup' or 'consistency', what happens
if replication begins early and the startup process fails to replay
a WAL record—say, due to corruption—before reaching the replication
start point? In that case, the standby might fail to recover correctly
because of missing WAL records,

Let’s compare with and without these patch changes ,

Without the patch:

*Scenario 1:* With a large recovery_min_apply_delay (e.g., 2 hours)
Even in this case, the flush acknowledgment for streamed WALs is sent, and
the primary already recycled those WAL files.
If a corrupted record is encountered later during replay then streaming of
those records is not possible.

*Scenario 2:* With recovery_min_apply_delay = 0 or in normal standby
operation

In this case the restart_lsn is advanced based on flushPtr, allowing the
primary to recycle the corresponding WAL files.

If a corrupt record is encountered during replaying local wal records, then
streaming will also fail here right ?.

With this patch:

Starting the WAL receiver early(let’s say at consistent point) will allow
us to prefetch the records more early in the redo loop instead of waiting
till we exhaust locally available wal.

Even if the WAL receiver hadn’t started early, those WAL segments would
have been recycled, since the restart_lsn would have advanced.
Therefore, the record corruption behaviour is unchanged, but the benefit
from this patch is reduced replay lag.

- Reduces replay lag when recovery_min_apply_delay is large, as reported
in
/messages/by-id/201901301432.p3utg64hum27@alvherre.pgsql
[2]: /messages/by-id/201901301432.p3utg64hum27@alvherre.pgsql
- Mitigates delay for standbys lagging due to network bandwidth or
latency or slow disk write(HDD).
- faster recovery
- Currently till wal reciver is started the acknowledgement for
commit is not sent for waiting transaction, since wal reciver is not
running.With this new change the waiting transaction will get
unblocked as
soon as we apply the record.

In normal condition also the slot is advanced based on flushPtr, even if
the mode is remote_apply.We fixed a corrupt scenario for cont record at the
end of last locally available segment.Previously we were starting
at the last stage/corrupt record(like cont record [1]https://github.com/postgres/postgres/commit/0668719801838aa6a8bda330ff9b3d20097ea844 ) but now much early.

If there is a situation where the wal record is retained in primary then we
can restart the wal receiver from old lsnptr in case of corrupt record,
which would be older LSN than what we are starting as part of early
streaming.
This same mechanism is used in standby where we switch b/w wal source.I
don’t see any scenario where the new workflow would break existing behavior.
Could you point out the specific case you’re concerned about? Understanding
that will help us refine the implementation.

while a transaction waiting for synchronous replication may have already

been acknowledged as committed.

Wouldn't that lead to a serious problem?

Without the patch:

If the synchronous replication mode is flush(on), then even with a
recovery_min_apply_delay set for larger value(e.g., 2 hours), the
transaction is acknowledged as committed before the record is actually
applied on the standby.

If the mode is remote_apply, the primary waits until the record is applied
on the standby, which includes waiting for the configured recovery delay.

With the patch:

The behavior remains the same with respect to synchronous_commit — it still
depends on whether the mode is flush or remote_apply.

So we can see a similar situation when recovery_min_apply_delay set for
larger value(e.g., 2 hours)/a slow apply situation where all the wal files
are streamed but not replayed.

*AFAIU this patch doesn’t introduce any new behavior.In a normal situation
where the WAL receiver is continuously streaming, we would anyway received
those WAL segments without waiting for*
*replaying to finish right.*

The only difference is we are initating walreciever more early in the
recovery loop, which will going to benifit us in many ways.In system where
replay is slow due to low power hardware/system resource or the
low network bandwidth/slower disk write (HDD) will makes the standby to
lag behind Primary.

By prefetching the wal records early will avoid more wal build up in
primary, which would avoid running out of disk space and also benifit us
for faster standby recovery.
Faster recovery means faster application availability/lower downtime in
case of sync commit enabled.

src/test/recovery/t/050_archive_enabled_standby.pl is missing the
ending newline. Is that intentional?

Thanks for reporting. Fixed in the new rebased patch.

Reference:
[1]: https://github.com/postgres/postgres/commit/0668719801838aa6a8bda330ff9b3d20097ea844
https://github.com/postgres/postgres/commit/0668719801838aa6a8bda330ff9b3d20097ea844
[2]: /messages/by-id/201901301432.p3utg64hum27@alvherre.pgsql
/messages/by-id/201901301432.p3utg64hum27@alvherre.pgsql

Thanks & Regards,
Sunil S

Attachments:

v9-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchapplication/octet-stream; name=v9-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From 9611a671f595c37334b14e1288141a42d3775a9b Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:27:58 +0530
Subject: [PATCH v9 1/3] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Archiving from the restore command does not holds the control lock
and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
Hence skipping early start of the wal receiver in case of archive recovery.

Co-authors: Sunil Seetharama<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlogrecovery.c     | 182 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/guc_tables.c           |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/replication/walreceiver.h         |  10 +
 7 files changed, 232 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 06d1e4403b5..66740dcb22c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5072,6 +5072,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
         the new setting.
        </para>
       </listitem>
+
+    </varlistentry>
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_at">
+      <term><varname>wal_receiver_start_at</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_at</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_at</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_at</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
      </varlistentry>
 
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 550de6e4a59..1285fa049ae 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -496,6 +496,167 @@ EnableStandbyMode(void)
 	disable_startup_progress_timeout();
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR           *dir;
+	struct dirent *de;
+	XLogSegNo     startsegno = -1;
+	XLogSegNo     endsegno   = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 *
+	 * Archiving from the restore command does not holds the control lock
+	 * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever
+	 * early start will create a race condition with the checkpointer process
+	 * as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16. Hence skipping
+	 * early start of the wal receiver in case of archive recovery.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_at ||
+		(ArchiveRecoveryRequested &&
+			recoveryRestoreCommand != NULL && strcmp(recoveryRestoreCommand, "") != 0))
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_CONSISTENCY && !reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir        = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			XLogSegNo logSegNo;
+			TimeLineID tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("could not start streaming WAL eagerly"),
+							errdetail("There are timeline changes in the locally available WAL files."),
+							errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				FreeDir(dir);
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState * state;
+		XLogRecPtr   startptr;
+		WALReadError errinfo;
+		char         xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+						errmsg("out of memory"),
+						errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI, &errinfo))
+		{
+			/*
+			 * FIXME: In case when a segment file with zero bytes is found in pg_wal directory ,
+			 * instead of error out here skip that file and try to read the next previous endsegno.
+			*/
+			WALReadRaiseError(&errinfo);
+		}
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("requesting stream from beginning of: \"%s\"", xlogfname));
+			XLogReaderFree(state);
+			SetInstallXLogFileSegmentActive();
+			RequestXLogStreaming(currentTLI,
+								 startptr,
+								 PrimaryConnInfo,
+								 PrimarySlotName,
+								 wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("invalid WAL segment found while calculating stream start: \"%s\". skipping..", xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Prepare the system for WAL recovery, if needed.
  *
@@ -807,6 +968,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, recoveryTargetTLI);
+
 	if (ArchiveRecoveryRequested)
 	{
 		if (StandbyModeRequested)
@@ -2193,6 +2357,7 @@ CheckTablespaceDirectory(void)
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
  * that it can start accepting read-only connections.
+ * Also, attempt to start the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -2290,6 +2455,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  lastReplayedTLI);
 }
 
 /*
@@ -3669,10 +3838,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -3805,13 +3976,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
-
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7361ffc9dcf..e1c8e7ef6e1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -88,6 +88,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_at = WAL_RCV_START_AT_EXHAUST;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c025..305d55f6008 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -3371,6 +3371,13 @@
   boot_val => 'false',
 },
 
+{ name => 'wal_receiver_start_at', type => 'enum', context => 'PGC_POSTMASTER', group => 'REPLICATION_STANDBY',
+  short_desc => 'When to start WAL receiver',
+  variable => 'wal_receiver_start_at',
+  boot_val => 'WAL_RCV_START_AT_EXHAUST',
+  options => 'wal_rcv_start_options',
+},
+
 { name => 'wal_receiver_status_interval', type => 'int', context => 'PGC_SIGHUP', group => 'REPLICATION_STANDBY',
   short_desc => 'Sets the maximum interval between WAL receiver status reports to the sending server.',
   flags => 'GUC_UNIT_S',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..255f081c977 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -483,6 +483,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry file_copy_method_options[] = {
 	{"copy", FILE_COPY_METHOD_COPY, false},
 #if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE) || defined(HAVE_COPY_FILE_RANGE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f62b61967ef..67e749ae8b6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -385,6 +385,7 @@
 					# retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0		# minimum delay for applying changes during recovery
 #sync_replication_slots = off		# enables slot synchronization on the physical standby from the primary
+#wal_receiver_start_at = 'exhaust'#	'exhaust', 'consistency', or 'startup'			# (change requires restart)
 
 # - Subscribers -
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e5557d21fa8..bafd7355598 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -28,6 +28,7 @@
 extern PGDLLIMPORT int wal_receiver_status_interval;
 extern PGDLLIMPORT int wal_receiver_timeout;
 extern PGDLLIMPORT bool hot_standby_feedback;
+extern PGDLLIMPORT int	wal_receiver_start_at;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -53,6 +54,15 @@ typedef enum
 	WALRCV_STOPPING,			/* requested to stop, but still running */
 } WalRcvState;
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* Shared memory area for management of walreceiver process */
 typedef struct
 {
-- 
2.50.1

v9-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchapplication/octet-stream; name=v9-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patchDownload

From b941825ede39ab43dcb57f0014f4fad1375ca18c Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:29:02 +0530
Subject: [PATCH v9 2/3] Test WAL receiver early start upon reaching
 consistency

This test ensures that when a standby reaches consistency,
the WAL receiver starts immediately and begins streaming using the latest valid
 WAL segment already available on disk.
This behavior minimizes delay and avoids waiting for WAL file once all the
locally available WAL file is restored and helps in providing `HIGH AVAILABILITY`
incase of Primary crash/failure.

More it helps quicker recovery when `recovery_min_apply_delay` in large and saves
Primary from running out of space.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 src/test/recovery/t/049_walreciver_start.pl | 96 +++++++++++++++++++++
 1 file changed, 96 insertions(+)
 create mode 100644 src/test/recovery/t/049_walreciver_start.pl

diff --git a/src/test/recovery/t/049_walreciver_start.pl b/src/test/recovery/t/049_walreciver_start.pl
new file mode 100644
index 00000000000..da31c470867
--- /dev/null
+++ b/src/test/recovery/t/049_walreciver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_at = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.50.1

v9-0003-Test-archive-recovery-takes-precedence-over-strea.patchapplication/octet-stream; name=v9-0003-Test-archive-recovery-takes-precedence-over-strea.patchDownload

From d34eb6ebb748c6bc3200cda08e3a516757edb569 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:36:56 +0530
Subject: [PATCH v9 3/3] Test archive recovery takes precedence over streaming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Since archive recovery is typically faster and more efficient than streaming,
initiating the WAL receiver early—before attempting recovery from archived WAL
files—is not ideal.
Furthermore, determining the correct starting point for streaming by examining
the last valid WAL segment restored from the archive adds complexity and potential risk.

Therefore, even when the configuration parameter wal_receiver_start_at is set to consistency,
archive recovery should take precedence, and the WAL receiver should only be
started after archive recovery is exhausted or deemed unavailable.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 .../recovery/t/050_archive_enabled_standby.pl | 77 +++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 src/test/recovery/t/050_archive_enabled_standby.pl

diff --git a/src/test/recovery/t/050_archive_enabled_standby.pl b/src/test/recovery/t/050_archive_enabled_standby.pl
new file mode 100644
index 00000000000..5e8cbada842
--- /dev/null
+++ b/src/test/recovery/t/050_archive_enabled_standby.pl
@@ -0,0 +1,77 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test that a standby with archive recovery enabled does not start WAL receiver early:
+# - Verifies that wal_receiver_start_at = 'consistency'
+# - Ensures WAL receiver is not started when restore_command is specified
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 1;
+
+# Initialize primary node with archving enabled and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(2);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup with restore enabled from archived WAL
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,  has_restoring => 1);
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should not have started
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;", 'f')
+	or die "Timed out while waiting for streaming to start";
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+	'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.50.1

#31

sunil s

sunilfeb26@gmail.com

about 2 months ago

In reply to: sunil s (#30)

3 attachment(s)

Re: Unnecessary delay in streaming replication due to replay lag

Hi,

Attaching the rebased patch after resolving some recent conflicts.

Thanks & Regards,
Sunil Seetharama

Attachments:

v10-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchapplication/octet-stream; name=v10-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patchDownload

From 71139278d2d7a8897257cbcc5bdc30c68b82758a Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:27:58 +0530
Subject: [PATCH v10 1/3] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Archiving from the restore command does not holds the control lock
and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
Hence skipping early start of the wal receiver in case of archive recovery.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  33 ++++
 src/backend/access/transam/xlogrecovery.c     | 182 +++++++++++++++++-
 src/backend/replication/walreceiver.c         |   1 +
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/guc_tables.c           |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/replication/walreceiver.h         |  10 +
 7 files changed, 233 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 023b3f03ba9..16b5de61477 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5079,6 +5079,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
         the new setting.
        </para>
       </listitem>
+
+    </varlistentry>
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_at">
+      <term><varname>wal_receiver_start_at</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_at</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_at</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_at</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
      </varlistentry>
 
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..f28dcdf92ab 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -496,6 +496,167 @@ EnableStandbyMode(void)
 	disable_startup_progress_timeout();
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning. startPoint signifies whether we
+ * are trying the eager start right at startup or once we have reached
+ * consistency.
+ */
+static void
+StartWALReceiverEagerlyIfPossible(WalRcvStartCondition startPoint,
+								  TimeLineID currentTLI)
+{
+	DIR           *dir;
+	struct dirent *de;
+	XLogSegNo     startsegno = -1;
+	XLogSegNo     endsegno   = -1;
+
+	/*
+	 * We should not be starting the walreceiver during bootstrap/init
+	 * processing.
+	 */
+	if (!IsNormalProcessingMode())
+		return;
+
+	/* Only the startup process can request an eager walreceiver start. */
+	Assert(AmStartupProcess());
+
+	/* Return if we are not set up to start the WAL receiver eagerly. */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_EXHAUST)
+		return;
+
+	/*
+	 * Sanity checks: We must be in standby mode with primary_conninfo set up
+	 * for streaming replication, the WAL receiver should not already have
+	 * started and the intended startPoint must match the start condition GUC.
+	 *
+	 * Archiving from the restore command does not holds the control lock
+	 * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever
+	 * early start will create a race condition with the checkpointer process
+	 * as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16. Hence skipping
+	 * early start of the wal receiver in case of archive recovery.
+	 */
+	if (!StandbyModeRequested || WalRcvStreaming() ||
+		!PrimaryConnInfo || strcmp(PrimaryConnInfo, "") == 0 ||
+		startPoint != wal_receiver_start_at ||
+		(ArchiveRecoveryRequested &&
+			recoveryRestoreCommand != NULL && strcmp(recoveryRestoreCommand, "") != 0))
+		return;
+
+	/*
+	 * We must have reached consistency if we wanted to start the walreceiver
+	 * at the consistency point.
+	 */
+	if (wal_receiver_start_at == WAL_RCV_START_AT_CONSISTENCY && !reachedConsistency)
+		return;
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir        = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			XLogSegNo logSegNo;
+			TimeLineID tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != currentTLI)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("could not start streaming WAL eagerly"),
+							errdetail("There are timeline changes in the locally available WAL files."),
+							errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				FreeDir(dir);
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in pg_wal. By this point,
+	 * we must have read at the segment that included the checkpoint record we
+	 * started replaying from.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState * state;
+		XLogRecPtr   startptr;
+		WALReadError errinfo;
+		char         xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, currentTLI, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+						errmsg("out of memory"),
+						errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 currentTLI, &errinfo))
+		{
+			/*
+			 * FIXME: In case when a segment file with zero bytes is found in pg_wal directory ,
+			 * instead of error out here skip that file and try to read the next previous endsegno.
+			*/
+			WALReadRaiseError(&errinfo);
+		}
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("requesting stream from beginning of: \"%s\"", xlogfname));
+			XLogReaderFree(state);
+			SetInstallXLogFileSegmentActive();
+			RequestXLogStreaming(currentTLI,
+								 startptr,
+								 PrimaryConnInfo,
+								 PrimarySlotName,
+								 wal_receiver_create_temp_slot);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("invalid WAL segment found while calculating stream start: \"%s\". skipping..", xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here as we should have at least one valid WAL
+	 * segment in pg_wal. By this point, we must have read at the segment that
+	 * included the checkpoint record we started replaying from.
+	 */
+	Assert(false);
+}
+
 /*
  * Prepare the system for WAL recovery, if needed.
  *
@@ -807,6 +968,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_STARTUP, recoveryTargetTLI);
+
 	if (ArchiveRecoveryRequested)
 	{
 		if (StandbyModeRequested)
@@ -2193,6 +2357,7 @@ CheckTablespaceDirectory(void)
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
  * that it can start accepting read-only connections.
+ * Also, attempt to start the WAL receiver eagerly if so configured.
  */
 static void
 CheckRecoveryConsistency(void)
@@ -2290,6 +2455,10 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/* Start WAL receiver eagerly if requested. */
+	StartWALReceiverEagerlyIfPossible(WAL_RCV_START_AT_CONSISTENCY,
+									  lastReplayedTLI);
 }
 
 /*
@@ -3669,10 +3838,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -3805,13 +3976,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
-
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..0e2f05f0cfa 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -88,6 +88,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_at = WAL_RCV_START_AT_EXHAUST;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c025..305d55f6008 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -3371,6 +3371,13 @@
   boot_val => 'false',
 },
 
+{ name => 'wal_receiver_start_at', type => 'enum', context => 'PGC_POSTMASTER', group => 'REPLICATION_STANDBY',
+  short_desc => 'When to start WAL receiver',
+  variable => 'wal_receiver_start_at',
+  boot_val => 'WAL_RCV_START_AT_EXHAUST',
+  options => 'wal_rcv_start_options',
+},
+
 { name => 'wal_receiver_status_interval', type => 'int', context => 'PGC_SIGHUP', group => 'REPLICATION_STANDBY',
   short_desc => 'Sets the maximum interval between WAL receiver status reports to the sending server.',
   flags => 'GUC_UNIT_S',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0209b2067a2..426c8743524 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -483,6 +483,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry file_copy_method_options[] = {
 	{"copy", FILE_COPY_METHOD_COPY, false},
 #if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE) || defined(HAVE_COPY_FILE_RANGE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..84e13930276 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -385,6 +385,8 @@
                                         # retrieve WAL after a failed attempt
 #recovery_min_apply_delay = 0           # minimum delay for applying changes during recovery
 #sync_replication_slots = off           # enables slot synchronization on the physical standby from the primary
+#wal_receiver_start_at = 'exhaust'      #'exhaust', 'consistency', or 'startup'
+                                        # (change requires restart)
 
 # - Subscribers -
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e5557d21fa8..bafd7355598 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -28,6 +28,7 @@
 extern PGDLLIMPORT int wal_receiver_status_interval;
 extern PGDLLIMPORT int wal_receiver_timeout;
 extern PGDLLIMPORT bool hot_standby_feedback;
+extern PGDLLIMPORT int	wal_receiver_start_at;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -53,6 +54,15 @@ typedef enum
 	WALRCV_STOPPING,			/* requested to stop, but still running */
 } WalRcvState;
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* Shared memory area for management of walreceiver process */
 typedef struct
 {
-- 
2.50.1

v10-0002-Test-WAL-receiver-early-start-upon-reaching-cons.patchapplication/octet-stream; name=v10-0002-Test-WAL-receiver-early-start-upon-reaching-cons.patchDownload

From 2036cd523b6f531f4db494e97914de8a49a579e6 Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:29:02 +0530
Subject: [PATCH v10 2/3] Test WAL receiver early start upon reaching
 consistency

This test ensures that when a standby reaches consistency,
the WAL receiver starts immediately and begins streaming using the latest valid
 WAL segment already available on disk.
This behavior minimizes delay and avoids waiting for WAL file once all the
locally available WAL file is restored and helps in providing `HIGH AVAILABILITY`
incase of Primary crash/failure.

More it helps quicker recovery when `recovery_min_apply_delay` in large and saves
Primary from running out of space.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 src/test/recovery/t/049_walreciver_start.pl | 96 +++++++++++++++++++++
 1 file changed, 96 insertions(+)
 create mode 100644 src/test/recovery/t/049_walreciver_start.pl

diff --git a/src/test/recovery/t/049_walreciver_start.pl b/src/test/recovery/t/049_walreciver_start.pl
new file mode 100644
index 00000000000..da31c470867
--- /dev/null
+++ b/src/test/recovery/t/049_walreciver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_at = 'consistency'
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.50.1

v10-0003-Test-archive-recovery-takes-precedence-over-stre.patchapplication/octet-stream; name=v10-0003-Test-archive-recovery-takes-precedence-over-stre.patchDownload

From d963f25db76d1e89dec5474a75fe8b626c232dbf Mon Sep 17 00:00:00 2001
From: Sunil S <sunil.s@broadcom.com>
Date: Mon, 7 Jul 2025 11:36:56 +0530
Subject: [PATCH v10 3/3] Test archive recovery takes precedence over streaming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Since archive recovery is typically faster and more efficient than streaming,
initiating the WAL receiver early—before attempting recovery from archived WAL
files—is not ideal.
Furthermore, determining the correct starting point for streaming by examining
the last valid WAL segment restored from the archive adds complexity and potential risk.

Therefore, even when the configuration parameter wal_receiver_start_at is set to consistency,
archive recovery should take precedence, and the WAL receiver should only be
started after archive recovery is exhausted or deemed unavailable.

Co-authors: Sunil S<sunilfeb26@gmail.com>, Soumyadeep Chakraborty <soumyadeep2007@gmail.com>, Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion: https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 .../recovery/t/050_archive_enabled_standby.pl | 77 +++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 src/test/recovery/t/050_archive_enabled_standby.pl

diff --git a/src/test/recovery/t/050_archive_enabled_standby.pl b/src/test/recovery/t/050_archive_enabled_standby.pl
new file mode 100644
index 00000000000..5e8cbada842
--- /dev/null
+++ b/src/test/recovery/t/050_archive_enabled_standby.pl
@@ -0,0 +1,77 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test that a standby with archive recovery enabled does not start WAL receiver early:
+# - Verifies that wal_receiver_start_at = 'consistency'
+# - Ensures WAL receiver is not started when restore_command is specified
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use File::Copy;
+use Test::More tests => 1;
+
+# Initialize primary node with archving enabled and start it.
+my $node_primary = PostgreSQL::Test::Cluster->new('test');
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(2);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup with restore enabled from archived WAL
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1,  has_restoring => 1);
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_at = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should not have started
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;", 'f')
+	or die "Timed out while waiting for streaming to start";
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+	'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.50.1