Implement waiting for wal lsn replay: reloaded

Started by Alexander Korotkovabout 1 year ago101 messages
#1Alexander Korotkov
aekorotkov@gmail.com
1 attachment(s)

Hi!

Introduction

The simple way to wait for a given lsn to replay on standby appears to
be useful because it provides a way to achieve read-your-writes
consistency while working with both replication leader and standby.
And it's both handy and cheaper to have built-in functionality for
that instead of polling pg_last_wal_replay_lsn().

Key problem

While this feature generally looks trivial, there is a surprisingly
hard problem. While waiting for an LSN to replay, you should hold any
snapshots. If you hold a snapshot on standby, that snapshot could
prevent the replay of WAL records. In turn, that could prevent the
wait to finish, causing a kind of deadlock. Therefore, waiting for
LSN to replay couldn't be implemented as a function. My last attempt
implements this functionality as a stored procedure [1]. This
approach generally works but has a couple of serious limitations.
1) Given that a CALL statement has to lookup a catalog for the stored
procedure, we can't work inside a transaction of REPEATABLE READ or a
higher isolation level (even if nothing has been done before in that
transaction). It is especially unpleasant that this limitation covers
the case of the implicit transaction when
default_transaction_isolation = 'repeatable read' [2]. I had a
workaround for that [3], but it looks a bit awkward.
2) Using output parameters for a stored procedure causes an extra
snapshot to be held. And that snapshot is difficult (unsafe?) to
release [3].

Present solution

The present patch implements a new utility command WAIT FOR LSN
'target_lsn' [, TIMEOUT 'timeout'][, THROW 'throw']. Unlike previous
attempts to implement custom syntax, it uses only one extra unreserved
keyword. The parameters are implemented as generic_option_list.

Custom syntax eliminates the problem of running within an empty
transaction of REPEATABLE READ level or higher. We don't need to
lookup a system catalog. Thus, we have to set a transaction snapshot.

Also, revising PlannedStmtRequiresSnapshot() allows us to avoid
holding a snapshot to return a value. Therefore, the WAIT command in
the attached patch returns its result status.

Also, the attached patch explicitly checks if the standby has been
promoted to throw the most relevant form of an error. The issue of
inaccurate error messages has been previously spotted in [5].

Any comments?

Links.
1. /messages/by-id/E1sZwuz-002NPQ-Lc@gemulon.postgresql.org
2. /messages/by-id/14de8671-e328-4c3e-b136-664f6f13a39f@iki.fi
3. /messages/by-id/CAPpHfdvRmTzGJw5rQdSMkTxUPZkjwtbQ=LJE2u9Jqh9gFXHpmg@mail.gmail.com
4. /messages/by-id/4953563546cb8c8851f84c7debf723ef@postgrespro.ru
5. /messages/by-id/ab0eddce-06d4-4db2-87ce-46fa2427806c@iki.fi

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v1-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v1-0001-Implement-WAIT-FOR-command.patchDownload
From 496808d1e9af1ae20bab59761be9d27c0cbaca2a Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 19 Nov 2024 07:16:41 +0200
Subject: [PATCH v1] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 336 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 185 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  14 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 ++
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  89 +++++
 src/include/commands/wait.h                   |  21 ++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wait_for_lsn.pl       | 217 +++++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 28 files changed, 966 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/043_wait_for_lsn.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557cd..91d258f9df1 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b7ebcc2a557..004f7e10e55 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bcab..f14d3933aec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6173,6 +6174,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 05c738d6614..869cb524082 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..313c8cc35df
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber *wakeUpProcs;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcs = palloc(sizeof(ProcNumber) * MaxBackends);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to wake up, but to avoid holding
+	 * the lock for too long, send the wakeups only after releasing the lock.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+	}
+	pfree(wakeUpProcs);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..d8f6965d8c6 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..3f06dc53410 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..3cc5b2e832f
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	char	   *result;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		char	   *name = str_tolower(defel->defname, strlen(defel->defname),
+									   DEFAULT_COLLATION_OID);
+
+		if (strcmp(name, "lsn") == 0)
+		{
+			lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												  CStringGetDatum(strVal(defel->arg))));
+		}
+		else if (strcmp(name, "timeout") == 0)
+		{
+			timeout = pg_strtoint64(strVal(defel->arg));
+		}
+		else if (strcmp(name, "throw") == 0)
+		{
+			throw = DatumGetBool(DirectFunctionCall1(boolin,
+													 CStringGetDatum(strVal(defel->arg))));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("wrong wait argument: %s",
+							defel->defname)));
+		}
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (GetOldestSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13ec..7858e5e076b 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 67eb96396af..7b692954f20 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -299,7 +299,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -778,7 +778,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1106,6 +1106,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16266,6 +16267,14 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR generic_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17922,6 +17931,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854fc..d68aa29d93e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 720ef99ee83..1f4c93520ff 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -891,6 +892,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 0c45fcf318f..116642b81b6 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1168,10 +1168,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1760,7 +1761,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f28bf371059..1507f784ac0 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..8efb4044d6f 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -345,6 +346,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..41234f6b961
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535a..9e1c26033a1 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0f9462493e3..1502be41688 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4258,4 +4258,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 899d64ad55f..87c58d2063b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -491,6 +491,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..88dc79b2bd6 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd93..02a6d576f08 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..32040d43550 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_for_lsn.pl b/src/test/recovery/t/043_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/043_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b54428b38cd..cac2424a99b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3129,7 +3129,11 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

#2Kirill Reshke
reshkekirill@gmail.com
In reply to: Alexander Korotkov (#1)
Re: Implement waiting for wal lsn replay: reloaded

On Wed, 27 Nov 2024 at 09:09, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

Introduction

The simple way to wait for a given lsn to replay on standby appears to
be useful because it provides a way to achieve read-your-writes
consistency while working with both replication leader and standby.
And it's both handy and cheaper to have built-in functionality for
that instead of polling pg_last_wal_replay_lsn().

Key problem

While this feature generally looks trivial, there is a surprisingly
hard problem. While waiting for an LSN to replay, you should hold any
snapshots. If you hold a snapshot on standby, that snapshot could
prevent the replay of WAL records. In turn, that could prevent the
wait to finish, causing a kind of deadlock. Therefore, waiting for
LSN to replay couldn't be implemented as a function. My last attempt
implements this functionality as a stored procedure [1]. This
approach generally works but has a couple of serious limitations.
1) Given that a CALL statement has to lookup a catalog for the stored
procedure, we can't work inside a transaction of REPEATABLE READ or a
higher isolation level (even if nothing has been done before in that
transaction). It is especially unpleasant that this limitation covers
the case of the implicit transaction when
default_transaction_isolation = 'repeatable read' [2]. I had a
workaround for that [3], but it looks a bit awkward.
2) Using output parameters for a stored procedure causes an extra
snapshot to be held. And that snapshot is difficult (unsafe?) to
release [3].

Present solution

The present patch implements a new utility command WAIT FOR LSN
'target_lsn' [, TIMEOUT 'timeout'][, THROW 'throw']. Unlike previous
attempts to implement custom syntax, it uses only one extra unreserved
keyword. The parameters are implemented as generic_option_list.

Custom syntax eliminates the problem of running within an empty
transaction of REPEATABLE READ level or higher. We don't need to
lookup a system catalog. Thus, we have to set a transaction snapshot.

Also, revising PlannedStmtRequiresSnapshot() allows us to avoid
holding a snapshot to return a value. Therefore, the WAIT command in
the attached patch returns its result status.

Also, the attached patch explicitly checks if the standby has been
promoted to throw the most relevant form of an error. The issue of
inaccurate error messages has been previously spotted in [5].

Any comments?

Links.
1. /messages/by-id/E1sZwuz-002NPQ-Lc@gemulon.postgresql.org
2. /messages/by-id/14de8671-e328-4c3e-b136-664f6f13a39f@iki.fi
3. /messages/by-id/CAPpHfdvRmTzGJw5rQdSMkTxUPZkjwtbQ=LJE2u9Jqh9gFXHpmg@mail.gmail.com
4. /messages/by-id/4953563546cb8c8851f84c7debf723ef@postgrespro.ru
5. /messages/by-id/ab0eddce-06d4-4db2-87ce-46fa2427806c@iki.fi

------
Regards,
Alexander Korotkov
Supabase

Hi!

What's the current status of
https://commitfest.postgresql.org/50/5167/ ? Should we close it or
reattach to this thread?

--
Best regards,
Kirill Reshke

#3Andrei Lepikhov
lepihov@gmail.com
In reply to: Kirill Reshke (#2)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

On 12/4/24 18:12, Kirill Reshke wrote:

On Wed, 27 Nov 2024 at 09:09, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Any comments?

What's the current status of
https://commitfest.postgresql.org/50/5167/ ? Should we close it or
reattach to this thread?

To push this feature further I rebased the patch onto current master.
Also, let's add a commitfest entry:
https://commitfest.postgresql.org/52/5550/

--
regards, Andrei Lepikhov

Attachments:

v2-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v2-0001-Implement-WAIT-FOR-command.patchDownload
From ea224b84d343ea726f47af30a7a974e0736d79cc Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 6 Feb 2025 14:13:09 +0700
Subject: [PATCH v2] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 336 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 185 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  15 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 ++
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  89 +++++
 src/include/commands/wait.h                   |  21 ++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/044_wait_for_lsn.pl       | 217 +++++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 28 files changed, 967 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/044_wait_for_lsn.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867..8b585cba75 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83f..bd14ec00d2 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..a32f473e0a 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8..74a62ab3ea 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab90d7..8336bb0cd1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9c270e7d46..62c37f31ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6194,6 +6195,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 473de6710d..5364576ca5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1829,6 +1830,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 0000000000..313c8cc35d
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber *wakeUpProcs;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcs = palloc(sizeof(ProcNumber) * MaxBackends);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to wake up, but to avoid holding
+	 * the lock for too long, send the wakeups only after releasing the lock.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+	}
+	pfree(wakeUpProcs);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ef0d407a38..f5db28bbd2 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..8351733500
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	char	   *result;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		char	   *name = str_tolower(defel->defname, strlen(defel->defname),
+									   DEFAULT_COLLATION_OID);
+
+		if (strcmp(name, "lsn") == 0)
+		{
+			lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												  CStringGetDatum(strVal(defel->arg))));
+		}
+		else if (strcmp(name, "timeout") == 0)
+		{
+			timeout = pg_strtoint64(strVal(defel->arg));
+		}
+		else if (strcmp(name, "throw") == 0)
+		{
+			throw = DatumGetBool(DirectFunctionCall1(boolin,
+													 CStringGetDatum(strVal(defel->arg))));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("wrong wait argument: %s",
+							defel->defname)));
+		}
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1..fa8431f794 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d7f9c00c40..67aa9554e2 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16334,6 +16335,14 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR generic_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17991,6 +18000,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18646,6 +18656,7 @@ bare_label_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed7036..27b447b7a7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a2..dbb613663f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 6f22496305..661296107c 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1162,10 +1162,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1751,7 +1752,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d5801..d23ac3b0f0 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f07162..3b282043ec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -346,6 +347,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 0000000000..41234f6b96
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..a7fa00ed41
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1..567586f2ec 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ffe155ee20..3dc1c1a56f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4305,4 +4305,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index cf2917ad07..0d0d8f4ab4 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -492,6 +492,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf56545238..a3f6607128 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d5..c4606d6504 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 0428704dbf..c1328b1e16 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -52,6 +52,7 @@ tests += {
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
       't/043_no_contrecord_switch.pl',
+	  't/044_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/044_wait_for_lsn.pl b/src/test/recovery/t/044_wait_for_lsn.pl
new file mode 100644
index 0000000000..79c2c49b9c
--- /dev/null
+++ b/src/test/recovery/t/044_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9a3bee93de..1e0be9f4f6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3149,7 +3149,11 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5

#4Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#1)
Re: Implement waiting for wal lsn replay: reloaded

27.11.2024 07:08, Alexander Korotkov wrote:

Present solution

The present patch implements a new utility command WAIT FOR LSN
'target_lsn' [, TIMEOUT 'timeout'][, THROW 'throw']. Unlike previous
attempts to implement custom syntax, it uses only one extra unreserved
keyword. The parameters are implemented as generic_option_list.

Custom syntax eliminates the problem of running within an empty
transaction of REPEATABLE READ level or higher. We don't need to
lookup a system catalog. Thus, we have to set a transaction snapshot.

Also, revising PlannedStmtRequiresSnapshot() allows us to avoid
holding a snapshot to return a value. Therefore, the WAIT command in
the attached patch returns its result status.

Also, the attached patch explicitly checks if the standby has been
promoted to throw the most relevant form of an error. The issue of
inaccurate error messages has been previously spotted in [5].

Any comments?

Good day, Alexander.

I briefly looked into patch and have couple of minor remarks:

1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.

2. I'd move `inHeap` field between `procno` and `phNode` to fill the gap
between fields on 64bit platforms.
Well, I believe, it would be better to tweak `pairingheap_node` to make it
clear if it is in heap or not. But such change would be unrelated to
current patch's sense. So lets stick with `inHeap`, but move it a bit.

Non-code question: do you imagine for `WAIT` command reuse for other cases?
Is syntax rule in gram.y convenient enough for such reuse? I believe, `LSN`
is not part of syntax to not introduce new keyword. But is it correct way?
I have no answer or strong opinion.

Otherwise, the patch looks quite strong to me.

-------
regards
Yura Sokolov

#5Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#4)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Yura!

On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I briefly looked into patch and have couple of minor remarks:

1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.

OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.

2. I'd move `inHeap` field between `procno` and `phNode` to fill the gap
between fields on 64bit platforms.
Well, I believe, it would be better to tweak `pairingheap_node` to make it
clear if it is in heap or not. But such change would be unrelated to
current patch's sense. So lets stick with `inHeap`, but move it a bit.

Ok, `inHeap` is moved.

Non-code question: do you imagine for `WAIT` command reuse for other cases?
Is syntax rule in gram.y convenient enough for such reuse? I believe, `LSN`
is not part of syntax to not introduce new keyword. But is it correct way?
I have no answer or strong opinion.

This is conscious decision. New rules and new keywords causes extra
states for parser state machine. There could be raised a question
whether feature is valuable enough to justify the slowdown of parser.
This is why I tried to make this feature as less invasive as possible
in terms of parser. And yes, there potentially could be other things
to wait. For instance, instead of waiting for lsn replay we could be
waiting for finishing replay of given xid.

Otherwise, the patch looks quite strong to me.

Great, thank you!

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v2-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v2-0001-Implement-WAIT-FOR-command.patchDownload
From 6324f7496fac463d98857b2c8ac9cbe3f2f40abf Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 19 Nov 2024 07:16:41 +0200
Subject: [PATCH v2] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 351 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 185 +++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  14 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 ++
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  89 +++++
 src/include/commands/wait.h                   |  21 ++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/044_wait_for_lsn.pl       | 217 +++++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 28 files changed, 981 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/044_wait_for_lsn.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a50fd99d9e5..12ea4f2cb45 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6194,6 +6195,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 473de6710d7..5364576ca5a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1829,6 +1830,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..5b70ba90ec1
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,351 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to avoid palloc() for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcsStatic[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	ProcNumber *wakeUpProcs = wakeUpProcsStatic;
+	int			numWakeUpProcs = 0;
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to wake up, but to avoid holding
+	 * the lock for too long, send the wakeups only after releasing the lock.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		/*
+		 * Check if we don't fit to WAKEUP_PROC_STATIC_ARRAY_SIZE.  Otherwise,
+		 * allocate entries for every backend.  It should be enough for every
+		 * case.
+		 */
+		if (wakeUpProcs == wakeUpProcsStatic &&
+			numWakeUpProcs >= WAKEUP_PROC_STATIC_ARRAY_SIZE)
+			wakeUpProcs = palloc(sizeof(ProcNumber) * MaxBackends);
+
+		Assert(numWakeUpProcs < MaxBackends);
+		wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+		SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	if (wakeUpProcs != wakeUpProcsStatic)
+		pfree(wakeUpProcs);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..d8f6965d8c6 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ef0d407a383..f5db28bbd22 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..83517335003
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	char	   *result;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		char	   *name = str_tolower(defel->defname, strlen(defel->defname),
+									   DEFAULT_COLLATION_OID);
+
+		if (strcmp(name, "lsn") == 0)
+		{
+			lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												  CStringGetDatum(strVal(defel->arg))));
+		}
+		else if (strcmp(name, "timeout") == 0)
+		{
+			timeout = pg_strtoint64(strVal(defel->arg));
+		}
+		else if (strcmp(name, "throw") == 0)
+		{
+			throw = DatumGetBool(DirectFunctionCall1(boolin,
+													 CStringGetDatum(strVal(defel->arg))));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("wrong wait argument: %s",
+							defel->defname)));
+		}
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d3887628d46..4f8f242b2cf 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16341,6 +16342,14 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR generic_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17999,6 +18008,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a20..dbb613663fa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 6f22496305a..661296107ce 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1162,10 +1162,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1751,7 +1752,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..3b282043eca 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -346,6 +347,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 8dd421fa0ef..08fb233ecae 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4306,4 +4306,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 0428704dbfd..52ec036e27e 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -52,6 +52,7 @@ tests += {
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
       't/043_no_contrecord_switch.pl',
+      't/044_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/044_wait_for_lsn.pl b/src/test/recovery/t/044_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/044_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6c170ac249..6b05cd3842f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3151,7 +3151,11 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

#6Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#5)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

17.02.2025 00:27, Alexander Korotkov wrote:

On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I briefly looked into patch and have couple of minor remarks:

1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.

OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.

I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v3-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v3-0001-Implement-WAIT-FOR-command.patchDownload
From fa107e15eab3ec2493f0663f03b563d49979e0b5 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Fri, 28 Feb 2025 15:40:18 +0300
Subject: [PATCH v3] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 347 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 185 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  14 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 ++
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  89 +++++
 src/include/commands/wait.h                   |  21 ++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/045_wait_for_lsn.pl       | 217 +++++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 28 files changed, 977 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52f53fa12e0..b03a39b510d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+resume:
+	numWakeUpProcs = 0;
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to wake up, but to avoid holding
+	 * the lock for too long, send the wakeups only after releasing the lock.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		Assert(numWakeUpProcs < MaxBackends);
+		wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+
+		if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+			break;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+		SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	/* Need to recheck if there were more waiters than static array size. */
+	if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+		goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..a5f44de1303
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		char	   *name = str_tolower(defel->defname, strlen(defel->defname),
+									   DEFAULT_COLLATION_OID);
+
+		if (strcmp(name, "lsn") == 0)
+		{
+			lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												  CStringGetDatum(strVal(defel->arg))));
+		}
+		else if (strcmp(name, "timeout") == 0)
+		{
+			timeout = pg_strtoint64(strVal(defel->arg));
+		}
+		else if (strcmp(name, "throw") == 0)
+		{
+			throw = DatumGetBool(DirectFunctionCall1(boolin,
+													 CStringGetDatum(strVal(defel->arg))));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("wrong wait argument: %s",
+							defel->defname)));
+		}
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d99c9355c6..11265ae3383 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16341,6 +16342,14 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR generic_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17999,6 +18008,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a20..dbb613663fa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..3b282043eca 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -346,6 +347,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0b208f51bdd..1c3baac08a9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4317,4 +4317,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
       't/042_low_level_backup.pl',
       't/043_no_contrecord_switch.pl',
       't/044_invalidate_inactive_slots.pl',
+      't/045_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fcb968e1ffe..7b6c30c8d4f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3169,7 +3169,11 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.43.0

#7Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#6)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

28.02.2025 16:03, Yura Sokolov пишет:

17.02.2025 00:27, Alexander Korotkov wrote:

On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I briefly looked into patch and have couple of minor remarks:

1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.

OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.

I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?

And fixed:
'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
gram.y's bare_label_keyword rule

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v4-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v4-0001-Implement-WAIT-FOR-command.patchDownload
From d9c44427a4cbecd6dd27edae48ea42d933756ff9 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Fri, 28 Feb 2025 15:40:18 +0300
Subject: [PATCH v4] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 347 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 185 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  15 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 ++
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  89 +++++
 src/include/commands/wait.h                   |  21 ++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/045_wait_for_lsn.pl       | 217 +++++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 28 files changed, 978 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52f53fa12e0..b03a39b510d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+resume:
+	numWakeUpProcs = 0;
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to wake up, but to avoid holding
+	 * the lock for too long, send the wakeups only after releasing the lock.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		Assert(numWakeUpProcs < MaxBackends);
+		wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+
+		if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+			break;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+		SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	/* Need to recheck if there were more waiters than static array size. */
+	if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+		goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..a5f44de1303
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		char	   *name = str_tolower(defel->defname, strlen(defel->defname),
+									   DEFAULT_COLLATION_OID);
+
+		if (strcmp(name, "lsn") == 0)
+		{
+			lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												  CStringGetDatum(strVal(defel->arg))));
+		}
+		else if (strcmp(name, "timeout") == 0)
+		{
+			timeout = pg_strtoint64(strVal(defel->arg));
+		}
+		else if (strcmp(name, "throw") == 0)
+		{
+			throw = DatumGetBool(DirectFunctionCall1(boolin,
+													 CStringGetDatum(strVal(defel->arg))));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("wrong wait argument: %s",
+							defel->defname)));
+		}
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d99c9355c6..3034573648f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16341,6 +16342,14 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR generic_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17999,6 +18008,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18655,6 +18665,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a20..dbb613663fa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..3b282043eca 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -346,6 +347,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0b208f51bdd..1c3baac08a9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4317,4 +4317,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
       't/042_low_level_backup.pl',
       't/043_no_contrecord_switch.pl',
       't/044_invalidate_inactive_slots.pl',
+      't/045_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fcb968e1ffe..7b6c30c8d4f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3169,7 +3169,11 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.43.0

#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Yura Sokolov (#7)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

On Fri, Feb 28, 2025 at 3:55 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

28.02.2025 16:03, Yura Sokolov пишет:

17.02.2025 00:27, Alexander Korotkov wrote:

On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I briefly looked into patch and have couple of minor remarks:

1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.

OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.

I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?

And fixed:
'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
gram.y's bare_label_keyword rule

Thank you, Yura. I've further revised the patch. Mostly added the
documentation including SQL command reference and few paragraphs in
the high availability chapter explaining the read-your-writes
consistency concept.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v5-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v5-0001-Implement-WAIT-FOR-command.patchDownload
From 8431a654aa5b872acef2bca7e66dfaff7dd5254d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v5] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 216 +++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 347 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 185 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  15 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 ++
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  89 +++++
 src/include/commands/wait.h                   |  21 ++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/045_wait_for_lsn.pl       | 217 +++++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 30 files changed, 1248 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index acf3ac0601d..ae316b5a0c9 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it possible to address this without switching to the synchronous
+    replication
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..9d6d3175f02
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,216 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">parameter</replaceable> '<replaceable class="parameter">value</replaceable>' [, ... ] ) ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <parameter>throw</parameter> is not specified or set to true.
+    If <parameter>throw</parameter> is set to false, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      The target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and greater than zero, the command waits until
+      <parameter>lsn</parameter> is reached or the specified
+      <parameter>timeout</parameter> has elapsed.  Must be a non-negative
+      integer, the default is zero.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">throw</replaceable></term>
+    <listitem>
+     <para>
+      Specify whether to throw an error in the case of timeout or
+      running on the primary.  The valid values are <literal>true</literal>
+      and <literal>false</literal>.  The default is <literal>true</literal>.
+      When set to <literal>false</literal> the status can be get from the
+      return `value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">success</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">not in recovery</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>throw</parameter> is true.
+    However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    It may also happen that target <parameter>lsn</parameter> is not reached
+    within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>throw</parameter> set to false.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100', THROW 'false';
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a829a055a97..1beb3999769 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+resume:
+	numWakeUpProcs = 0;
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to wake up, but to avoid holding
+	 * the lock for too long, send the wakeups only after releasing the lock.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		Assert(numWakeUpProcs < MaxBackends);
+		wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+
+		if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+			break;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+		SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	/* Need to recheck if there were more waiters than static array size. */
+	if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+		goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..a5f44de1303
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		char	   *name = str_tolower(defel->defname, strlen(defel->defname),
+									   DEFAULT_COLLATION_OID);
+
+		if (strcmp(name, "lsn") == 0)
+		{
+			lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												  CStringGetDatum(strVal(defel->arg))));
+		}
+		else if (strcmp(name, "timeout") == 0)
+		{
+			timeout = pg_strtoint64(strVal(defel->arg));
+		}
+		else if (strcmp(name, "throw") == 0)
+		{
+			throw = DatumGetBool(DirectFunctionCall1(boolin,
+													 CStringGetDatum(strVal(defel->arg))));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("wrong wait argument: %s",
+							defel->defname)));
+		}
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 271ae26cbaf..e4916148d02 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16369,6 +16370,14 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR generic_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
 
 /*
  * Aggregate decoration clauses
@@ -18027,6 +18036,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18683,6 +18693,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48ef..1a99e98f55b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..5849967882e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -346,6 +347,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf..dffa714e2c8 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4319,4 +4319,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
       't/042_low_level_backup.pl',
       't/043_no_contrecord_switch.pl',
       't/044_invalidate_inactive_slots.pl',
+      't/045_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..5ce3d36ae6d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3175,7 +3175,11 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

#9Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Alexander Korotkov (#8)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

10.03.2025 14:30, Alexander Korotkov пишет:

On Fri, Feb 28, 2025 at 3:55 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

28.02.2025 16:03, Yura Sokolov пишет:

17.02.2025 00:27, Alexander Korotkov wrote:

On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I briefly looked into patch and have couple of minor remarks:

1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.

OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.

I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?

And fixed:
'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
gram.y's bare_label_keyword rule

Thank you, Yura. I've further revised the patch. Mostly added the
documentation including SQL command reference and few paragraphs in
the high availability chapter explaining the read-your-writes
consistency concept.

Good day, Alexander.

Looking "for the last time" to the patch I found there remains
`pg_wal_replay_wait` function in documentation and one comment.
So I fixed it in documentation, and removed sentence from comment.

Otherwise v6 is just rebased v5.

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v6-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v6-0001-Implement-WAIT-FOR-command.patchDownload
From 80b4cb8c0ac75168ab1fce55feccc4f08f32ce34 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v6] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 216 +++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 347 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 184 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  15 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 ++
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  89 +++++
 src/include/commands/wait.h                   |  21 ++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/045_wait_for_lsn.pl       | 217 +++++++++++
 src/tools/pgindent/typedefs.list              |   4 +
 30 files changed, 1247 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index acf3ac0601d..ae316b5a0c9 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it possible to address this without switching to the synchronous
+    replication
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..2352ae9493f
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,216 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">parameter</replaceable> '<replaceable class="parameter">value</replaceable>' [, ... ] ) ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <parameter>throw</parameter> is not specified or set to true.
+    If <parameter>throw</parameter> is set to false, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      The target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and greater than zero, the command waits until
+      <parameter>lsn</parameter> is reached or the specified
+      <parameter>timeout</parameter> has elapsed.  Must be a non-negative
+      integer, the default is zero.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">throw</replaceable></term>
+    <listitem>
+     <para>
+      Specify whether to throw an error in the case of timeout or
+      running on the primary.  The valid values are <literal>true</literal>
+      and <literal>false</literal>.  The default is <literal>true</literal>.
+      When set to <literal>false</literal> the status can be get from the
+      return `value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">success</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">not in recovery</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>throw</parameter> is true.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    It may also happen that target <parameter>lsn</parameter> is not reached
+    within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>throw</parameter> set to false.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100', THROW 'false';
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a829a055a97..1beb3999769 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+resume:
+	numWakeUpProcs = 0;
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to wake up, but to avoid holding
+	 * the lock for too long, send the wakeups only after releasing the lock.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		Assert(numWakeUpProcs < MaxBackends);
+		wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+
+		if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+			break;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+		SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	/* Need to recheck if there were more waiters than static array size. */
+	if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+		goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..d95782ddaf8
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,184 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		char	   *name = str_tolower(defel->defname, strlen(defel->defname),
+									   DEFAULT_COLLATION_OID);
+
+		if (strcmp(name, "lsn") == 0)
+		{
+			lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												  CStringGetDatum(strVal(defel->arg))));
+		}
+		else if (strcmp(name, "timeout") == 0)
+		{
+			timeout = pg_strtoint64(strVal(defel->arg));
+		}
+		else if (strcmp(name, "throw") == 0)
+		{
+			throw = DatumGetBool(DirectFunctionCall1(boolin,
+													 CStringGetDatum(strVal(defel->arg))));
+		}
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("wrong wait argument: %s",
+							defel->defname)));
+		}
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 271ae26cbaf..e4916148d02 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16369,6 +16370,14 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR generic_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
 
 /*
  * Aggregate decoration clauses
@@ -18027,6 +18036,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18683,6 +18693,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48ef..1a99e98f55b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..5849967882e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -346,6 +347,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf..dffa714e2c8 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4319,4 +4319,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
       't/042_low_level_backup.pl',
       't/043_no_contrecord_switch.pl',
       't/044_invalidate_inactive_slots.pl',
+      't/045_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dfe2690bdd3..5377d6208e1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3177,7 +3177,11 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.43.0

#10Tomas Vondra
tomas@vondra.me
In reply to: Yura Sokolov (#9)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

I did a quick look at this patch. I haven't found any correctness
issues, but I have some general review comments and questions about the
grammar / syntax.

1) The sgml docs don't really show the syntax very nicely, it only shows
this at the beginning of wait_for.sgml:

WAIT FOR ( <replaceable class="parameter">parameter</replaceable>
'<replaceable class="parameter">value</replaceable>' [, ... ] ) ]

I kinda understand this comes from using the generic option list (I'll
get to that shortly), but I think it'd be much better to actually show
the "full" syntax here, instead of leaving the "parameters" to later.

2) The syntax description suggests "(" and ")" are required, but that
does not seem to be the case - in fact, it's not even optional, and when
I try using that, I get syntax error.

3) I have my doubts about using the generic_option_list for this. Yes, I
understand this allows using fewer reserved keywords, but it leads to
some weirdness and I'm not sure it's worth it. Not sure what the right
trade off is here.

Anyway, some examples of the weird stuff implied by this approach:

- it forces "," between the options, which is a clear difference from
what we do for every other command

- it forces everything to be a string, i.e. you can' say "TIMEOUT 10",
it has to be "TIMEOUT '10'"

I don't have a very strong opinion on this, but the result seems a bit
strange to me.

4) I'm not sure I understand the motivation of the "throw false" mode,
and I'm not sure I understand this description in the sgml docs:

On timeout, or if the server is promoted before
<parameter>lsn</parameter> is reached, an error is emitted,
as soon as <parameter>throw</parameter> is not specified or set to
true.
If <parameter>throw</parameter> is set to false, then the command
doesn't throw errors.

I find it a bit confusing. What is the use case for this mode?

5) One place in the docs says:

The target log sequence number to wait for.

Thie is literally the only place using "log sequence number" in our
code base, I'd just use "LSN" just like every other place.

6) The docs for the TIMEOUT parameter say this:

<varlistentry>
<term><replaceable class="parameter">timeout</replaceable></term>
<listitem>
<para>
When specified and greater than zero, the command waits until
<parameter>lsn</parameter> is reached or the specified
<parameter>timeout</parameter> has elapsed. Must be a non-
negative integer, the default is zero.
</para>
</listitem>
</varlistentry>

That doesn't say what unit does the option use. Is is seconds,
milliseconds or what?

In fact, it'd be nice to let users specify that in the value, similar
to other options (e.g. SET statement_timeout = '10s').

7) One place in the docs says this:

That is, after this function execution, the value returned by
<function>pg_last_wal_replay_lsn</function> should be greater ...

I think the reference to "function execution" is obsolete?

8) I find this confusing:

However, if <command>WAIT FOR</command> is
called on primary promoted from standby and <literal>lsn</literal>
was already replayed, then the <command>WAIT FOR</command> command
just exits immediately.

Does this mean running the WAIT command on a primary (after it was
already promoted) will exit immediately? Why does it matter that it
was promoted from a standby? Shouldn't it exit immediately even for
a standalone instance?

9) xlogwait.c

I think this should start with a basic "design" description of how the
wait is implemented, in a comment at the top of the file. That is, what
we keep in the shared memory, what happens during a wait, how it uses
the pairing heap, etc. After reading this comment I should understand
how it all fits together.

10) WaitForLSNReplay / WaitLSNWakeup

I think the function comment should document the important stuff (e.g.
return values for various situations, how it groups waiters into chunks
of 16 elements during wakeup, ...).

11) WaitLSNProcInfo / WaitLSNState

Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?

regards

--
Tomas Vondra

#11vignesh C
vignesh21@gmail.com
In reply to: Yura Sokolov (#9)
Re: Implement waiting for wal lsn replay: reloaded

On Wed, 12 Mar 2025 at 20:14, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Otherwise v6 is just rebased v5.

I noticed that Tomas's comments from [1]/messages/by-id/09a98dc9-eeb1-471d-b990-072513c3d584@vondra.me are not yet addressed, I have
changed the commitfest status to Waiting on Author, please address
them and update it to Needs review.
[1]: /messages/by-id/09a98dc9-eeb1-471d-b990-072513c3d584@vondra.me

Regards,
Vignesh

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tomas Vondra (#10)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Tomas.

Thank you so much for your review! Please find the revised patchset.

On Thu, Mar 13, 2025 at 4:15 PM Tomas Vondra <tomas@vondra.me> wrote:

I did a quick look at this patch. I haven't found any correctness
issues, but I have some general review comments and questions about the
grammar / syntax.

1) The sgml docs don't really show the syntax very nicely, it only shows
this at the beginning of wait_for.sgml:

WAIT FOR ( <replaceable class="parameter">parameter</replaceable>
'<replaceable class="parameter">value</replaceable>' [, ... ] ) ]

I kinda understand this comes from using the generic option list (I'll
get to that shortly), but I think it'd be much better to actually show
the "full" syntax here, instead of leaving the "parameters" to later.

Sounds reasonable, changed to show the full syntax in the synopsis.

2) The syntax description suggests "(" and ")" are required, but that
does not seem to be the case - in fact, it's not even optional, and when
I try using that, I get syntax error.

Good catch, fixed.

3) I have my doubts about using the generic_option_list for this. Yes, I
understand this allows using fewer reserved keywords, but it leads to
some weirdness and I'm not sure it's worth it. Not sure what the right
trade off is here.

Anyway, some examples of the weird stuff implied by this approach:

- it forces "," between the options, which is a clear difference from
what we do for every other command

- it forces everything to be a string, i.e. you can' say "TIMEOUT 10",
it has to be "TIMEOUT '10'"

I don't have a very strong opinion on this, but the result seems a bit
strange to me.

I've improved the syntax. I still tried to keep the number of new
keywords and grammar rules minimal. That leads to moving some parser
login into wait.c. This is probably a bit awkward, but saves our
grammar from bloat. Let me know what do you think about this
approach.

4) I'm not sure I understand the motivation of the "throw false" mode,
and I'm not sure I understand this description in the sgml docs:

On timeout, or if the server is promoted before
<parameter>lsn</parameter> is reached, an error is emitted,
as soon as <parameter>throw</parameter> is not specified or set to
true.
If <parameter>throw</parameter> is set to false, then the command
doesn't throw errors.

I find it a bit confusing. What is the use case for this mode?

The idea here is that application could do some handling of these
errors without having to parse the error messages (parsing error
messages is inconvenient because of localization etc).

5) One place in the docs says:

The target log sequence number to wait for.

Thie is literally the only place using "log sequence number" in our
code base, I'd just use "LSN" just like every other place.

OK fixed.

6) The docs for the TIMEOUT parameter say this:

<varlistentry>
<term><replaceable class="parameter">timeout</replaceable></term>
<listitem>
<para>
When specified and greater than zero, the command waits until
<parameter>lsn</parameter> is reached or the specified
<parameter>timeout</parameter> has elapsed. Must be a non-
negative integer, the default is zero.
</para>
</listitem>
</varlistentry>

That doesn't say what unit does the option use. Is is seconds,
milliseconds or what?

In fact, it'd be nice to let users specify that in the value, similar
to other options (e.g. SET statement_timeout = '10s').

The default unit of milliseconds is specified. Also, an alternative
way to specify timeout is now supported. Timeout might be a string
literal consisting of numeric and unit specifier.

7) One place in the docs says this:

That is, after this function execution, the value returned by
<function>pg_last_wal_replay_lsn</function> should be greater ...

I think the reference to "function execution" is obsolete?

Actually, this is just the function, which reports current replay LSN,
not function introduced by previous version of this patch. We refer
it to just express the constraint that LSN must be replayed after
execution of the command.

8) I find this confusing:

However, if <command>WAIT FOR</command> is
called on primary promoted from standby and <literal>lsn</literal>
was already replayed, then the <command>WAIT FOR</command> command
just exits immediately.

Does this mean running the WAIT command on a primary (after it was
already promoted) will exit immediately? Why does it matter that it
was promoted from a standby? Shouldn't it exit immediately even for
a standalone instance?

I think the previous sentence should give an idea that otherwise error
gets thrown. That also happens immediately for sure.

9) xlogwait.c

I think this should start with a basic "design" description of how the
wait is implemented, in a comment at the top of the file. That is, what
we keep in the shared memory, what happens during a wait, how it uses
the pairing heap, etc. After reading this comment I should understand
how it all fits together.

OK, I've added the header comment.

10) WaitForLSNReplay / WaitLSNWakeup

I think the function comment should document the important stuff (e.g.
return values for various situations, how it groups waiters into chunks
of 16 elements during wakeup, ...).

Revised header comments for those functions too.

11) WaitLSNProcInfo / WaitLSNState

Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?

Hmm, I don't remember why I moved them to xlogwait.h. OK, moved them
back to xlogwait.c.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v6-0001-Implement-WAIT-FOR-command.patchapplication/x-patch; name=v6-0001-Implement-WAIT-FOR-command.patchDownload
From 11f1b1db81ff323354035dba34a34f5ac55177a3 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v6] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 226 +++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 435 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 235 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  29 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  41 ++
 src/include/commands/wait.h                   |  21 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/046_wait_for_lsn.pl       | 217 +++++++++
 src/tools/pgindent/typedefs.list              |   5 +
 30 files changed, 1363 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/046_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..e29141c0538 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it possible to address this without switching to the synchronous
+    replication
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..ff3f309bc7c
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,226 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">option</replaceable> [, ... ] ) ]
+ALTER ROLE <replaceable class="parameter">role_specification</replaceable> [ WITH ] <replaceable class="parameter">option</replaceable> [ ... ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+      LSN '<replaceable class="parameter">lsn</replaceable>'
+    | TIMEOUT <replaceable class="parameter">timeout</replaceable>
+    | NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <literal>NO_THROW</literal> is not specified.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and <parameter>timeout</parameter> is greater than zero,
+      the command waits until <parameter>lsn</parameter> is reached or
+      the specified <parameter>timeout</parameter> has elapsed.
+     </para>
+     <para>
+      The <parameter>timeout</parameter> might be given as integer number of
+      milliseconds.  Also it might be given as string literal with
+      integer number of milliseconds or a number with unit
+      (see <xref linkend="config-setting-names-values"/>).
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>NO_THROW</literal></term>
+    <listitem>
+     <para>
+      Specify to not throw an error in the case of timeout or
+      running on the primary.  In this case the result status can be get from
+      the return value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">success</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">not in recovery</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+    However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    It may also happen that target <parameter>lsn</parameter> is not reached
+    within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..511e5531fb8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2831,6 +2832,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2d4c346473b..a0c98d9e801 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6361,6 +6362,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..2097271b2f8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1836,6 +1837,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..c2aee2d41f0
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,435 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < MaxBackends);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do it this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..784c779a252
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,235 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+	WaitStmtParamNone,
+	WaitStmtParamTimeout,
+	WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	WaitStmtParam curParam = WaitStmtParamNone;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_ptr(Node, option, stmt->options)
+	{
+		if (IsA(option, String))
+		{
+			String	   *str = castNode(String, option);
+			char	   *name = str_tolower(str->sval, strlen(str->sval),
+										   DEFAULT_COLLATION_OID);
+
+			if (curParam != WaitStmtParamNone)
+				elog(ERROR, "Unexpected param");
+
+			if (strcmp(name, "lsn") == 0)
+				curParam = WaitStmtParamLSN;
+			else if (strcmp(name, "timeout") == 0)
+				curParam = WaitStmtParamTimeout;
+			else if (strcmp(name, "no_throw") == 0)
+				throw = false;
+			else
+				elog(ERROR, "Unexpected param");
+
+		}
+		else if (IsA(option, Integer))
+		{
+			Integer    *intVal = castNode(Integer, option);
+
+			if (curParam != WaitStmtParamTimeout)
+				elog(ERROR, "Unexpected integer");
+
+			timeout = intVal->ival;
+
+			curParam = WaitStmtParamNone;
+		}
+		else if (IsA(option, A_Const))
+		{
+			A_Const    *constVal = castNode(A_Const, option);
+			String	   *str = &constVal->val.sval;
+
+			if (curParam != WaitStmtParamLSN &&
+				curParam != WaitStmtParamTimeout)
+				elog(ERROR, "Unexpected string");
+
+			if (curParam == WaitStmtParamLSN)
+			{
+				lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+													  CStringGetDatum(str->sval)));
+			}
+			else if (curParam == WaitStmtParamTimeout)
+			{
+				const char *hintmsg;
+				double		result;
+
+				if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("invalid value for timeout option: \"%s\"",
+									str->sval),
+							 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+				}
+				timeout = (int64) result;
+			}
+
+			curParam = WaitStmtParamNone;
+		}
+
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3c4268b271a..5ff7157a12a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -672,6 +672,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
+%type <node>	wait_option
+%type <list>	wait_option_list
+
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -786,7 +789,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1117,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16364,6 +16368,25 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR wait_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
+
+wait_option_list:
+			wait_option						{ $$ = list_make1($1); }
+			| wait_option_list wait_option	{ $$ = lappend($1, $2); }
+			;
+
+wait_option: ColLabel						{ $$ = (Node *) makeString($1); }
+			 | NumericOnly					{ $$ = (Node *) $1; }
+			 | Sconst						{ $$ = (Node *) makeStringConst($1, @1); }
+
+		;
 
 /*
  * Aggregate decoration clauses
@@ -18023,6 +18046,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18680,6 +18704,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 00c76d05356..87411aece47 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
 	size = add_size(size, MemoryContextReportingShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -346,6 +348,7 @@ CreateOrAttachShmemStructs(void)
 	InjectionPointShmemInit();
 	AioShmemInit();
 	MemoryContextReportingShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f194e6b3dcc..c966acdbff0 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -948,6 +949,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 8164d0fbb4f..f4d37c0bfc2 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1195,10 +1195,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1793,7 +1794,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 930321905f1..164a16bc5d8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -353,6 +354,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..15bddd9dba3
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4610fc61293..d06104d40ac 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4326,4 +4326,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..eb9de7dae00 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index cb983766c67..31b1e9bffcf 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -54,6 +54,7 @@ tests += {
       't/043_no_contrecord_switch.pl',
       't/044_invalidate_inactive_slots.pl',
       't/045_archive_restartpoint.pl',
+      't/046_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/046_wait_for_lsn.pl b/src/test/recovery/t/046_wait_for_lsn.pl
new file mode 100644
index 00000000000..f9446cce3f9
--- /dev/null
+++ b/src/test/recovery/t/046_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e5879e00dff..be191d3e2d9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3236,7 +3236,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
+WaitStmtParam
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

#13Álvaro Herrera
alvherre@kurilemu.de
In reply to: Alexander Korotkov (#12)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

On 2025-Apr-29, Alexander Korotkov wrote:

11) WaitLSNProcInfo / WaitLSNState

Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?

Hmm, I don't remember why I moved them to xlogwait.h. OK, moved them
back to xlogwait.c.

This change made the code no longer compile, because
WaitLSNState->minWaitedLSN is used in xlogrecovery.c which no longer has
access to the field definition. A rebased version with that change
reverted is attached.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

Attachments:

v7-0001-Implement-WAIT-FOR-command.patchtext/x-diff; charset=utf-8Download
From 1f9b5c7427239a6dc43ccad31634687a9d9fcf35 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v7] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 226 ++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 387 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 235 +++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  29 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  90 ++++
 src/include/commands/wait.h                   |  21 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/049_wait_for_lsn.pl       | 217 ++++++++++
 src/tools/pgindent/typedefs.list              |   5 +
 30 files changed, 1364 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..e29141c0538 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it possible to address this without switching to the synchronous
+    replication
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..ff3f309bc7c
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,226 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">option</replaceable> [, ... ] ) ]
+ALTER ROLE <replaceable class="parameter">role_specification</replaceable> [ WITH ] <replaceable class="parameter">option</replaceable> [ ... ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+      LSN '<replaceable class="parameter">lsn</replaceable>'
+    | TIMEOUT <replaceable class="parameter">timeout</replaceable>
+    | NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <literal>NO_THROW</literal> is not specified.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and <parameter>timeout</parameter> is greater than zero,
+      the command waits until <parameter>lsn</parameter> is reached or
+      the specified <parameter>timeout</parameter> has elapsed.
+     </para>
+     <para>
+      The <parameter>timeout</parameter> might be given as integer number of
+      milliseconds.  Also it might be given as string literal with
+      integer number of milliseconds or a number with unit
+      (see <xref linkend="config-setting-names-values"/>).
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>NO_THROW</literal></term>
+    <listitem>
+     <para>
+      Specify to not throw an error in the case of timeout or
+      running on the primary.  In this case the result status can be get from
+      the return value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">success</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">not in recovery</replaceable></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+    However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    It may also happen that target <parameter>lsn</parameter> is not reached
+    within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9a4de1616bc..d03a9e15c99 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6361,6 +6362,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f23ec8969c2..408454bb8b9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..64049f8e870
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,387 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < MaxBackends);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do it this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..784c779a252
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,235 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+	WaitStmtParamNone,
+	WaitStmtParamTimeout,
+	WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	WaitStmtParam curParam = WaitStmtParamNone;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	foreach_ptr(Node, option, stmt->options)
+	{
+		if (IsA(option, String))
+		{
+			String	   *str = castNode(String, option);
+			char	   *name = str_tolower(str->sval, strlen(str->sval),
+										   DEFAULT_COLLATION_OID);
+
+			if (curParam != WaitStmtParamNone)
+				elog(ERROR, "Unexpected param");
+
+			if (strcmp(name, "lsn") == 0)
+				curParam = WaitStmtParamLSN;
+			else if (strcmp(name, "timeout") == 0)
+				curParam = WaitStmtParamTimeout;
+			else if (strcmp(name, "no_throw") == 0)
+				throw = false;
+			else
+				elog(ERROR, "Unexpected param");
+
+		}
+		else if (IsA(option, Integer))
+		{
+			Integer    *intVal = castNode(Integer, option);
+
+			if (curParam != WaitStmtParamTimeout)
+				elog(ERROR, "Unexpected integer");
+
+			timeout = intVal->ival;
+
+			curParam = WaitStmtParamNone;
+		}
+		else if (IsA(option, A_Const))
+		{
+			A_Const    *constVal = castNode(A_Const, option);
+			String	   *str = &constVal->val.sval;
+
+			if (curParam != WaitStmtParamLSN &&
+				curParam != WaitStmtParamTimeout)
+				elog(ERROR, "Unexpected string");
+
+			if (curParam == WaitStmtParamLSN)
+			{
+				lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+													  CStringGetDatum(str->sval)));
+			}
+			else if (curParam == WaitStmtParamTimeout)
+			{
+				const char *hintmsg;
+				double		result;
+
+				if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("invalid value for timeout option: \"%s\"",
+									str->sval),
+							 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+				}
+				timeout = (int64) result;
+			}
+
+			curParam = WaitStmtParamNone;
+		}
+
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index db43034b9db..164fd23017c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
+%type <node>	wait_option
+%type <list>	wait_option_list
+
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16402,6 +16406,25 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR wait_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
+
+wait_option_list:
+			wait_option						{ $$ = list_make1($1); }
+			| wait_option_list wait_option	{ $$ = lappend($1, $2); }
+			;
+
+wait_option: ColLabel						{ $$ = (Node *) makeString($1); }
+			 | NumericOnly					{ $$ = (Node *) $1; }
+			 | Sconst						{ $$ = (Node *) makeStringConst($1, @1); }
+
+		;
 
 /*
  * Aggregate decoration clauses
@@ -18050,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18707,6 +18731,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..a1cb9f2473e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4f4191b0ea6..880fa7807eb 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..58ae9d7f350 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -352,6 +353,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..8d10ece6e8e
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..fa5fb1a8897 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 208d2e3a8ed..49060877808 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..3b66af602f0 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -57,6 +57,7 @@ tests += {
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
       't/048_vacuum_horizon_floor.pl'
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..f9446cce3f9
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2d6..037cc85030f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3256,7 +3256,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
+WaitStmtParam
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5

#14Xuneng Zhou
xunengzhou@gmail.com
In reply to: Álvaro Herrera (#13)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

Thanks for working on this.

I’ve just come across this thread and haven’t had a chance to dig into
the patch yet, but I’m keen to review it soon. In the meantime, I have
a quick question: is WAIT FOR REPLY intended mainly for user-defined
functions, or can internal code invoke it as well?

During a recent performance run [1]/messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com I noticed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
also hints that we could replace this check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by walsender:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when the wait is long. I built
a POC patch that swaps polling for CVs, but a single global CV (or
even separate “flush” and “replay” CVs) isn’t ideal:

The wake-up routines don’t know which LSN each waiter cares about, so
they’d have to broadcast on every flush/replay. Caching the minimum
outstanding LSN could reduce spuriously awakened waiters, yet wouldn’t
eliminate them—multiple backends might wait for different LSNs
simultaneously. A more precise solution would require a request queue
that maps waiters to target LSNs and issues targeted wake-ups, adding
complexity.

Walsender accepts the potential broadcast overhead by using two cvs
for different waiters, so it might be acceptable for
read_local_xlog_page_guts() as well. However, if WAIT FOR REPLY
becomes available to backend code, we might leverage it to eliminate
the polling for waiting replay in read_local_xlog_page_guts() without
introducing a bespoke dispatcher. I’d appreciate any thoughts on
whether that use case is in scope.

Best,
Xuneng

[1]: /messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com

#15Alexander Korotkov
aekorotkov@gmail.com
In reply to: Álvaro Herrera (#13)
Re: Implement waiting for wal lsn replay: reloaded

Hello, Álvaro!

On Wed, Aug 6, 2025 at 6:01 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Apr-29, Alexander Korotkov wrote:

11) WaitLSNProcInfo / WaitLSNState

Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?

Hmm, I don't remember why I moved them to xlogwait.h. OK, moved them
back to xlogwait.c.

This change made the code no longer compile, because
WaitLSNState->minWaitedLSN is used in xlogrecovery.c which no longer has
access to the field definition. A rebased version with that change
reverted is attached.

Thank you! The rebased version looks correct for me.

------
Regards,
Alexander Korotkov
Supabase

#16Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#14)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Xuneng Zhou!

On Thu, Aug 7, 2025 at 6:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Thanks for working on this.

I’ve just come across this thread and haven’t had a chance to dig into
the patch yet, but I’m keen to review it soon.

Great. Thank you for your attention to this patch. I appreciate your
intention to review it.

In the meantime, I have
a quick question: is WAIT FOR REPLY intended mainly for user-defined
functions, or can internal code invoke it as well?

Currently, WaitForLSNReplay() is assumed to only be called from
backend, as corresponding shmem is allocated only per-backend. But
there is absolutely no problem to tweak the patch to allocate shmem
for every Postgres process. This would enable to call
WaitForLSNReplay() wherever it is needed. There is only no problem to
extend this approach to support other kinds of LSNs not just replay
LSN.

During a recent performance run [1] I noticed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
also hints that we could replace this check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by walsender:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when the wait is long. I built
a POC patch that swaps polling for CVs, but a single global CV (or
even separate “flush” and “replay” CVs) isn’t ideal:

The wake-up routines don’t know which LSN each waiter cares about, so
they’d have to broadcast on every flush/replay. Caching the minimum
outstanding LSN could reduce spuriously awakened waiters, yet wouldn’t
eliminate them—multiple backends might wait for different LSNs
simultaneously. A more precise solution would require a request queue
that maps waiters to target LSNs and issues targeted wake-ups, adding
complexity.

Walsender accepts the potential broadcast overhead by using two cvs
for different waiters, so it might be acceptable for
read_local_xlog_page_guts() as well. However, if WAIT FOR REPLY
becomes available to backend code, we might leverage it to eliminate
the polling for waiting replay in read_local_xlog_page_guts() without
introducing a bespoke dispatcher. I’d appreciate any thoughts on
whether that use case is in scope.

This looks like a great new use-case for facilities developed in this
patch! I'll remove the restriction to use WaitForLSNReplay() only in
backend. I think you can write a patch with additional pairing heap
for flush LSN and include that into thread about
read_local_xlog_page_guts() optimization. Let me know if you need any
assistance.

------
Regards,
Alexander Korotkov
Supabase

#17Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#16)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander!

In the meantime, I have
a quick question: is WAIT FOR REPLY intended mainly for user-defined
functions, or can internal code invoke it as well?

Currently, WaitForLSNReplay() is assumed to only be called from
backend, as corresponding shmem is allocated only per-backend. But
there is absolutely no problem to tweak the patch to allocate shmem
for every Postgres process. This would enable to call
WaitForLSNReplay() wherever it is needed. There is only no problem to
extend this approach to support other kinds of LSNs not just replay
LSN.

Thanks for extending the functionality of the Wait For Replay patch!

This looks like a great new use-case for facilities developed in this
patch! I'll remove the restriction to use WaitForLSNReplay() only in
backend. I think you can write a patch with additional pairing heap
for flush LSN and include that into thread about
read_local_xlog_page_guts() optimization. Let me know if you need any
assistance.

This could be a more elegant approach which would solve the polling
issue well. I'll prepare a follow-up patch for it.

Best,
Xuneng

#18Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#16)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Thu, Aug 7, 2025 at 6:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Thanks for working on this.

I’ve just come across this thread and haven’t had a chance to dig into
the patch yet, but I’m keen to review it soon.

Great. Thank you for your attention to this patch. I appreciate your
intention to review it.

I did a quick pass over v7. There are a few thoughts to share—mostly
around documentation, build, and tests, plus some minor nits. The core
logic looks solid to me. I’ll take a deeper look as I work on a
follow‑up patch to add waiting for flush LSNs. And the patch seems to
need rebase; it can't be applied to HEAD cleanly for now.

Build
1) Consider adding a comma in `src/test/recovery/meson.build` after
`'t/048_vacuum_horizon_floor.pl'` so the list remains valid.

Core code
2) It may be safer for `WaitLSNWakeup()` to assert against the stack array size:
) Perhaps `Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);`
rather than `MaxBackends`.
For option parsing UX in `wait.c`, we might prefer:
3) Using `ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR),
errmsg(...)))` instead of `elog(ERROR, ...)` for consistency and
translatability.
4) Explicitly rejecting duplicate `LSN`/`TIMEOUT` options with a syntax error.
5) The result column label could align better with other utility
outputs if shortened to `status` (lowercase, no space).
6) After `parse_real()`, it could help to validate/clamp the timeout
to avoid overflow when converting to `int64` and when passing a `long`
to `WaitLatch()`.
7) If `nodes/print.h` in `src/backend/commands/wait.c` isn’t used, we
might drop the include.
8) A couple of comment nits: “do it this outside” → “do this outside”.

Tests
9) We might consider adding cases for:
- Negative `TIMEOUT` (to exercise the error path).
- Syntax errors (unknown option; duplicate `LSN`/`TIMEOUT`; missing `LSN`).

Documentation
`doc/src/sgml/ref/wait_for.sgml`
10) The index term could be updated to `<primary>WAIT FOR</primary>`.
11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds |
'duration-with-units'> ] [ NO_THROW ]
12) The purpose line might be smoother as “wait for a target LSN to be
replayed, optionally with a timeout”.
13) Return values might use `<literal>` for `success`, `timeout`, `not
in recovery`.
14) Consistently calling this a “command” (rather than
function/procedure) could reduce confusion.
15) The example text might read more cleanly as “If the target LSN is
not reached before the timeout …”.

`doc/src/sgml/high-availability.sgml`
16) The sentence could read “However, it is possible to address this
without switching to synchronous replication.”

`src/backend/utils/activity/wait_event_names.txt`
17) The description for `WAIT_FOR_WAL_REPLAY` might be clearer as
“Waiting for WAL replay to reach a target LSN on a standby.”

Best,
Xuneng

#19Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#16)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi all,

I did a rebase for the patch to v8 and incorporated a few changes:

1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.

Started a new thread [1]/messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.

[1]: /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com

Feedbacks welcome.

Best,
Xuneng

Attachments:

v8-0001-Implement-WAIT-FOR-command.patchapplication/x-patch; name=v8-0001-Implement-WAIT-FOR-command.patchDownload
From 4487999a6c393e42619ae77e5e7f14c6cac9f235 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Wed, 27 Aug 2025 09:12:38 +0800
Subject: [PATCH v8] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 219 ++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 284 +++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  29 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  90 ++++
 src/include/commands/wait.h                   |  21 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 269 ++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +-
 30 files changed, 1457 insertions(+), 15 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..ecaff5d5deb 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to the synchronous
+    replication
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..433901baa82
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,219 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <literal>NO_THROW</literal> is not specified.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and <parameter>timeout</parameter> is greater than zero,
+      the command waits until <parameter>lsn</parameter> is reached or
+      the specified <parameter>timeout</parameter> has elapsed.
+     </para>
+     <para>
+      The <parameter>timeout</parameter> might be given as integer number of
+      milliseconds.  Also it might be given as string literal with
+      integer number of milliseconds or a number with unit
+      (see <xref linkend="config-setting-names-values"/>).
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>NO_THROW</literal></term>
+    <listitem>
+     <para>
+      Specify to not throw an error in the case of timeout or
+      running on the primary.  In this case the result status can be get from
+      the return value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+    However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    It may also happen that target <parameter>lsn</parameter> is not reached
+    within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..f5257dfa689 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f23ec8969c2..408454bb8b9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..2cc9312e836
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const		WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const		WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..cfa42ad6f6c
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+	WaitStmtParamNone,
+	WaitStmtParamTimeout,
+	WaitStmtParamLSN
+}			WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	WaitStmtParam curParam = WaitStmtParamNone;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	bool		o_lsn = false;
+	bool		o_timeout = false;
+	bool		o_no_throw = false;
+
+	foreach_ptr(Node, option, stmt->options)
+	{
+		if (IsA(option, String))
+		{
+			String	   *str = castNode(String, option);
+			char	   *name = str_tolower(str->sval, strlen(str->sval),
+										   DEFAULT_COLLATION_OID);
+
+			if (curParam != WaitStmtParamNone)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected parameter after \"%s\"", name)));
+
+			if (strcmp(name, "lsn") == 0)
+			{
+				if (o_lsn)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "lsn")));
+				o_lsn = true;
+				curParam = WaitStmtParamLSN;
+			}
+			else if (strcmp(name, "timeout") == 0)
+			{
+				if (o_timeout)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "timeout")));
+				o_timeout = true;
+				curParam = WaitStmtParamTimeout;
+			}
+			else if (strcmp(name, "no_throw") == 0)
+			{
+				if (o_no_throw)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "no_throw")));
+				o_no_throw = true;
+				throw = false;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized parameter \"%s\"", name)));
+
+		}
+		else if (IsA(option, Integer))
+		{
+			Integer    *intVal = castNode(Integer, option);
+
+			if (curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected integer value")));
+
+			timeout = intVal->ival;
+
+			curParam = WaitStmtParamNone;
+		}
+		else if (IsA(option, A_Const))
+		{
+			A_Const    *constVal = castNode(A_Const, option);
+			String	   *str = &constVal->val.sval;
+
+			if (curParam != WaitStmtParamLSN &&
+				curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected string value")));
+
+			if (curParam == WaitStmtParamLSN)
+			{
+				lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+													  CStringGetDatum(str->sval)));
+			}
+			else if (curParam == WaitStmtParamTimeout)
+			{
+				const char *hintmsg;
+				double		result;
+
+				if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("invalid value for timeout option: \"%s\"",
+									str->sval),
+							 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+				}
+
+				/*
+				 * Get rid of any fractional part in the input. This is so we don't fail
+				 * on just-out-of-range values that would round into range.
+				 */
+				result = rint(result);
+
+				/* Range check */
+				if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("timeout value is out of range for type bigint")));
+
+				timeout = (int64) result;
+			}
+
+			curParam = WaitStmtParamNone;
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("unexpected parameter type")));
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt * stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index db43034b9db..164fd23017c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
+%type <node>	wait_option
+%type <list>	wait_option_list
+
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16402,6 +16406,25 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR wait_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
+
+wait_option_list:
+			wait_option						{ $$ = list_make1($1); }
+			| wait_option_list wait_option	{ $$ = lappend($1, $2); }
+			;
+
+wait_option: ColLabel						{ $$ = (Node *) makeString($1); }
+			 | NumericOnly					{ $$ = (Node *) $1; }
+			 | Sconst						{ $$ = (Node *) makeStringConst($1, @1); }
+
+		;
 
 /*
  * Aggregate decoration clauses
@@ -18050,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18707,6 +18731,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..a1cb9f2473e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4f4191b0ea6..880fa7807eb 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 5427da5bc1b..ee20a48b2c5 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -352,6 +353,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..72be2f76293
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+}			WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+}			WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState * waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ef9e5f0c0be
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt * stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..b8d3fc009fb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+}			WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..da1cfeb1c52
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,269 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+	stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/,
+	"get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;",
+	stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/,
+	"get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql('postgres', "WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid value for timeout option/,
+	"get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..f303f04d007 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -615,7 +615,6 @@ DatumTupleFields
 DbInfo
 DbInfoArr
 DbLocaleInfo
-DbOidName
 DeClonePtrType
 DeadLockState
 DeallocateStmt
@@ -2283,7 +2282,6 @@ PlannerParamItem
 Point
 Pointer
 PolicyInfo
-PolyNumAggState
 Pool
 PopulateArrayContext
 PopulateArrayState
@@ -4129,6 +4127,7 @@ tar_file
 td_entry
 teSection
 temp_tablespaces_extra
+test128
 test_re_flags
 test_regex_ctx
 test_shm_mq_header
@@ -4198,6 +4197,7 @@ varatt_expanded
 varattrib_1b
 varattrib_1b_e
 varattrib_4b
+vartag_external
 vbits
 verifier_context
 walrcv_alter_slot_fn
@@ -4326,7 +4326,6 @@ xmlGenericErrorFunc
 xmlNodePtr
 xmlNodeSetPtr
 xmlParserCtxtPtr
-xmlParserErrors
 xmlParserInputPtr
 xmlSaveCtxt
 xmlSaveCtxtPtr
-- 
2.49.0

#20Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#19)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Xuneng!

On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

I did a rebase for the patch to v8 and incorporated a few changes:

1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.

Started a new thread [1] and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.

[1] /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com

Feedbacks welcome.

Thank you for your reviewing and revising this patch.

I see you've integrated most of your points expressed in [1]. I went
though them and I've integrated the rest of them. Except this one.

11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]

I didn't find examples on how we do the similar things on other places
of docs. This is why I decided to leave this place as it currently
is.

Also, I found some mess up with typedefs.list. I've returned the
changes to typdefs.list back and re-indented the sources.

I'd like to ask your opinion of the way this feature is implemented in
terms of grammar: generic parsing implemented in gram.y and the rest
is done in wait.c. I think this approach should minimize additional
keywords and states for parsing code. This comes at the price of more
complex code in wait.c, but I think this is a fair price.

Links.
1. /messages/by-id/CABPTF7VsoGDMBq34MpLrMSZyxNZvVbgH6-zxtJOg5AwOoYURbw@mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v9-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v9-0001-Implement-WAIT-FOR-command.patchDownload
From 70fff63c02e85a197b727da1657bd24595fc8132 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Sun, 24 Aug 2025 20:10:37 +0800
Subject: [PATCH v9] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 218 ++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 284 +++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  29 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  93 +++++
 src/include/commands/wait.h                   |  21 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 281 +++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +
 30 files changed, 1474 insertions(+), 12 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..328ce7fe8ed
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,218 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <literal>NO_THROW</literal> is not specified.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and <parameter>timeout</parameter> is greater than zero,
+      the command waits until <parameter>lsn</parameter> is reached or
+      the specified <parameter>timeout</parameter> has elapsed.
+     </para>
+     <para>
+      The <parameter>timeout</parameter> might be given as integer number of
+      milliseconds.  Also it might be given as string literal with
+      integer number of milliseconds or a number with unit
+      (see <xref linkend="config-setting-names-values"/>).
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>NO_THROW</literal></term>
+    <listitem>
+     <para>
+      Specify to not throw an error in the case of timeout or
+      running on the primary.  In this case the result status can be get from
+      the return value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0baf0ac6160..7a078730e28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a0..e709b7392cf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..1d59ddd81aa
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+	WaitStmtParamNone,
+	WaitStmtParamTimeout,
+	WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	WaitStmtParam curParam = WaitStmtParamNone;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	bool		haveLsn = false;
+	bool		haveTimeout = false;
+	bool		haveNoThrow = false;
+
+	foreach_ptr(Node, option, stmt->options)
+	{
+		if (IsA(option, String))
+		{
+			String	   *str = castNode(String, option);
+			char	   *name = str_tolower(str->sval, strlen(str->sval),
+										   DEFAULT_COLLATION_OID);
+
+			if (curParam != WaitStmtParamNone)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected parameter after \"%s\"", name)));
+
+			if (strcmp(name, "lsn") == 0)
+			{
+				if (haveLsn)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "lsn")));
+				haveLsn = true;
+				curParam = WaitStmtParamLSN;
+			}
+			else if (strcmp(name, "timeout") == 0)
+			{
+				if (haveTimeout)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "timeout")));
+				haveTimeout = true;
+				curParam = WaitStmtParamTimeout;
+			}
+			else if (strcmp(name, "no_throw") == 0)
+			{
+				if (haveNoThrow)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "no_throw")));
+				haveNoThrow = true;
+				throw = false;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized parameter \"%s\"", name)));
+
+		}
+		else if (IsA(option, Integer))
+		{
+			Integer    *intVal = castNode(Integer, option);
+
+			if (curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected integer value")));
+
+			timeout = intVal->ival;
+
+			curParam = WaitStmtParamNone;
+		}
+		else if (IsA(option, A_Const))
+		{
+			A_Const    *constVal = castNode(A_Const, option);
+			String	   *str = &constVal->val.sval;
+
+			if (curParam != WaitStmtParamLSN &&
+				curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected string value")));
+
+			if (curParam == WaitStmtParamLSN)
+			{
+				lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+													  CStringGetDatum(str->sval)));
+			}
+			else if (curParam == WaitStmtParamTimeout)
+			{
+				const char *hintmsg;
+				double		result;
+
+				if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("invalid value for timeout option: \"%s\"",
+									str->sval),
+							 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+				}
+
+				/*
+				 * Get rid of any fractional part in the input. This is so we
+				 * don't fail on just-out-of-range values that would round
+				 * into range.
+				 */
+				result = rint(result);
+
+				/* Range check */
+				if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("timeout value is out of range for type bigint")));
+
+				timeout = (int64) result;
+			}
+
+			curParam = WaitStmtParamNone;
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("unexpected parameter type")));
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a comment, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..8675dfd2e99 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
+%type <node>	wait_option
+%type <list>	wait_option_list
+
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16403,6 +16407,25 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR wait_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
+
+wait_option_list:
+			wait_option						{ $$ = list_make1($1); }
+			| wait_option_list wait_option	{ $$ = lappend($1, $2); }
+			;
+
+wait_option: ColLabel						{ $$ = (Node *) makeString($1); }
+			 | NumericOnly					{ $$ = (Node *) $1; }
+			 | Sconst						{ $$ = (Node *) makeStringConst($1, @1); }
+
+		;
 
 /*
  * Aggregate decoration clauses
@@ -18051,6 +18074,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18708,6 +18732,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 5f442bc3bd4..398f4d2b363 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2060,6 +2068,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2115,6 +2126,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3092,6 +3106,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3690,6 +3708,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +356,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/*
+	 * A pairing heap node for participation in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..fa5fb1a8897 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..9d06b5c060f
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,281 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+	stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/, "get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+	stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+	stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+	stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/, "get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid value for timeout option/,
+	"get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..49dab055752 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3255,7 +3255,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
+WaitStmtParam
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

#21Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#20)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander,

On Sun, Sep 14, 2025 at 3:31 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

I did a rebase for the patch to v8 and incorporated a few changes:

1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.

Started a new thread [1] and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.

[1] /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com

Feedbacks welcome.

Thank you for your reviewing and revising this patch.

I see you've integrated most of your points expressed in [1]. I went
though them and I've integrated the rest of them. Except this one.

11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]

I didn't find examples on how we do the similar things on other places
of docs. This is why I decided to leave this place as it currently
is.

+1. I re-check other commands with similar parameter patterns, and
they follow the approach in v9.

Also, I found some mess up with typedefs.list. I've returned the
changes to typdefs.list back and re-indented the sources.

Thanks for catching and fixing that.

I'd like to ask your opinion of the way this feature is implemented in
terms of grammar: generic parsing implemented in gram.y and the rest
is done in wait.c. I think this approach should minimize additional
keywords and states for parsing code. This comes at the price of more
complex code in wait.c, but I think this is a fair price.

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Best,
Xuneng

#22Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#21)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Xuneng!

On Sun, Sep 14, 2025 at 4:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sun, Sep 14, 2025 at 3:31 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

I did a rebase for the patch to v8 and incorporated a few changes:

1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.

Started a new thread [1] and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.

[1] /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com

Feedbacks welcome.

Thank you for your reviewing and revising this patch.

I see you've integrated most of your points expressed in [1]. I went
though them and I've integrated the rest of them. Except this one.

11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]

I didn't find examples on how we do the similar things on other places
of docs. This is why I decided to leave this place as it currently
is.

+1. I re-check other commands with similar parameter patterns, and
they follow the approach in v9.

Also, I found some mess up with typedefs.list. I've returned the
changes to typdefs.list back and re-indented the sources.

Thanks for catching and fixing that.

I'd like to ask your opinion of the way this feature is implemented in
terms of grammar: generic parsing implemented in gram.y and the rest
is done in wait.c. I think this approach should minimize additional
keywords and states for parsing code. This comes at the price of more
complex code in wait.c, but I think this is a fair price.

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

The attached revision of patch contains fix of the typo in the comment
you reported off-list.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v10-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v10-0001-Implement-WAIT-FOR-command.patchDownload
From 63c1d54b6a2933167271277dc6ed3c3af70dd703 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Sun, 24 Aug 2025 20:10:37 +0800
Subject: [PATCH v10] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 218 ++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 284 +++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  29 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  93 +++++
 src/include/commands/wait.h                   |  21 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 281 +++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +
 30 files changed, 1474 insertions(+), 12 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..328ce7fe8ed
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,218 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <literal>NO_THROW</literal> is not specified.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and <parameter>timeout</parameter> is greater than zero,
+      the command waits until <parameter>lsn</parameter> is reached or
+      the specified <parameter>timeout</parameter> has elapsed.
+     </para>
+     <para>
+      The <parameter>timeout</parameter> might be given as integer number of
+      milliseconds.  Also it might be given as string literal with
+      integer number of milliseconds or a number with unit
+      (see <xref linkend="config-setting-names-values"/>).
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>NO_THROW</literal></term>
+    <listitem>
+     <para>
+      Specify to not throw an error in the case of timeout or
+      running on the primary.  In this case the result status can be get from
+      the return value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0baf0ac6160..7a078730e28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a0..e709b7392cf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..ffcc0bbf457
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+	WaitStmtParamNone,
+	WaitStmtParamTimeout,
+	WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	WaitStmtParam curParam = WaitStmtParamNone;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	bool		haveLsn = false;
+	bool		haveTimeout = false;
+	bool		haveNoThrow = false;
+
+	foreach_ptr(Node, option, stmt->options)
+	{
+		if (IsA(option, String))
+		{
+			String	   *str = castNode(String, option);
+			char	   *name = str_tolower(str->sval, strlen(str->sval),
+										   DEFAULT_COLLATION_OID);
+
+			if (curParam != WaitStmtParamNone)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected parameter after \"%s\"", name)));
+
+			if (strcmp(name, "lsn") == 0)
+			{
+				if (haveLsn)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "lsn")));
+				haveLsn = true;
+				curParam = WaitStmtParamLSN;
+			}
+			else if (strcmp(name, "timeout") == 0)
+			{
+				if (haveTimeout)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "timeout")));
+				haveTimeout = true;
+				curParam = WaitStmtParamTimeout;
+			}
+			else if (strcmp(name, "no_throw") == 0)
+			{
+				if (haveNoThrow)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "no_throw")));
+				haveNoThrow = true;
+				throw = false;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized parameter \"%s\"", name)));
+
+		}
+		else if (IsA(option, Integer))
+		{
+			Integer    *intVal = castNode(Integer, option);
+
+			if (curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected integer value")));
+
+			timeout = intVal->ival;
+
+			curParam = WaitStmtParamNone;
+		}
+		else if (IsA(option, A_Const))
+		{
+			A_Const    *constVal = castNode(A_Const, option);
+			String	   *str = &constVal->val.sval;
+
+			if (curParam != WaitStmtParamLSN &&
+				curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected string value")));
+
+			if (curParam == WaitStmtParamLSN)
+			{
+				lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+													  CStringGetDatum(str->sval)));
+			}
+			else if (curParam == WaitStmtParamTimeout)
+			{
+				const char *hintmsg;
+				double		result;
+
+				if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("invalid value for timeout option: \"%s\"",
+									str->sval),
+							 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+				}
+
+				/*
+				 * Get rid of any fractional part in the input. This is so we
+				 * don't fail on just-out-of-range values that would round
+				 * into range.
+				 */
+				result = rint(result);
+
+				/* Range check */
+				if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("timeout value is out of range for type bigint")));
+
+				timeout = (int64) result;
+			}
+
+			curParam = WaitStmtParamNone;
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("unexpected parameter type")));
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..8675dfd2e99 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
+%type <node>	wait_option
+%type <list>	wait_option_list
+
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16403,6 +16407,25 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR wait_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
+
+wait_option_list:
+			wait_option						{ $$ = list_make1($1); }
+			| wait_option_list wait_option	{ $$ = lappend($1, $2); }
+			;
+
+wait_option: ColLabel						{ $$ = (Node *) makeString($1); }
+			 | NumericOnly					{ $$ = (Node *) $1; }
+			 | Sconst						{ $$ = (Node *) makeStringConst($1, @1); }
+
+		;
 
 /*
  * Aggregate decoration clauses
@@ -18051,6 +18074,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18708,6 +18732,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 5f442bc3bd4..398f4d2b363 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2060,6 +2068,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2115,6 +2126,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3092,6 +3106,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3690,6 +3708,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +356,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/*
+	 * A pairing heap node for participation in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..fa5fb1a8897 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..9d06b5c060f
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,281 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+	stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/, "get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+	stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+	stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+	stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/, "get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid value for timeout option/,
+	"get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..49dab055752 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3255,7 +3255,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
+WaitStmtParam
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

#23Álvaro Herrera
alvherre@kurilemu.de
In reply to: Alexander Korotkov (#22)
Re: Implement waiting for wal lsn replay: reloaded

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#24Xuneng Zhou
xunengzhou@gmail.com
In reply to: Álvaro Herrera (#23)
Re: Implement waiting for wal lsn replay: reloaded

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Best,
Xuneng

#25Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#24)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Here's the updated v11. Many thanks Jian for off-list discussions and review.

Best,
Xuneng

Attachments:

v11-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v11-0001-Implement-WAIT-FOR-command.patchDownload
From 0ee9a9275cd811f70a49560e0715556820fb81be Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sat, 27 Sep 2025 23:26:22 +0800
Subject: [PATCH v11] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 234 +++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 212 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  33 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  93 +++++
 src/include/commands/wait.h                   |  22 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   8 +
 src/include/parser/kwlist.h                   |   2 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 293 +++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +
 30 files changed, 1435 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 109713315c0..36b8ac6b855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6222,6 +6223,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..824b0942b34 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..fd95f24fa74 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -319,6 +319,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -671,7 +672,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -741,7 +741,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -785,7 +785,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1113,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16403,6 +16404,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			opt_with '(' utility_option_list ')'	{ $$ = $3; }
+			| /*EMPTY*/							    { $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17882,6 +17903,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18051,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18497,6 +18520,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18708,6 +18732,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +356,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/*
+	 * A pairing heap node for participation in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f1706df58fd..997c72ab858 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4363,4 +4363,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..69a81e21fbb 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -269,6 +269,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -494,6 +495,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5a80b4359f..ac0252936be 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3257,7 +3257,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
+WaitStmtParam
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.51.0

#26Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#25)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Here's the updated v11. Many thanks Jian for off-list discussions and review.

v12 removed unused
+WaitStmt
+WaitStmtParam in pgindent/typedefs.list.

Best,
Xuneng

Attachments:

v12-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v12-0001-Implement-WAIT-FOR-command.patchDownload
From d6fbbb3b0ad81c18657e6fafa50852bc9bf239e2 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sat, 27 Sep 2025 23:26:22 +0800
Subject: [PATCH v12] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 234 +++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 212 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  33 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  93 +++++
 src/include/commands/wait.h                   |  22 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   8 +
 src/include/parser/kwlist.h                   |   2 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 293 +++++++++++++
 src/tools/pgindent/typedefs.list              |   3 +
 30 files changed, 1433 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 109713315c0..36b8ac6b855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6222,6 +6223,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..824b0942b34 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..fd95f24fa74 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -319,6 +319,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -671,7 +672,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -741,7 +741,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -785,7 +785,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1113,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16403,6 +16404,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			opt_with '(' utility_option_list ')'	{ $$ = $3; }
+			| /*EMPTY*/							    { $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17882,6 +17903,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18051,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18497,6 +18520,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18708,6 +18732,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +356,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/*
+	 * A pairing heap node for participation in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f1706df58fd..997c72ab858 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4363,4 +4363,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..69a81e21fbb 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -269,6 +269,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -494,6 +495,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5a80b4359f..e6ff42b9ea0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3257,6 +3257,9 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.51.0

#27Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#26)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Here's the updated v11. Many thanks Jian for off-list discussions and review.

v12 removed unused
+WaitStmt
+WaitStmtParam in pgindent/typedefs.list.

Hi, I’ve split the patch into multiple patch sets for easier review,
per Michael’s advice [1]/messages/by-id/aOMsv9TszlB1n-W7@paquier.xyz.

[1]: /messages/by-id/aOMsv9TszlB1n-W7@paquier.xyz

Best,
Xuneng

Attachments:

v13-0003-Implement-WAIT-FOR-command.patchapplication/x-patch; name=v13-0003-Implement-WAIT-FOR-command.patchDownload
From c3dd9972d8043c07247bb3e2b476026268ee1bad Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 14 Oct 2025 20:50:04 +0800
Subject: [PATCH v13 3/3] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/access/transam/xlogwait.c     |  27 +-
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 ++++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   5 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/access/xlogwait.h             |   3 +-
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 293 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   3 +
 23 files changed, 951 insertions(+), 13 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index a114738bddf..7c8134f1209 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -373,9 +373,10 @@ WaitLSNCleanup(void)
  * or replica got promoted before the target LSN replayed.
  */
 WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 {
 	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
 	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
 
 	/* Shouldn't be called when shmem isn't initialized */
@@ -404,6 +405,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 			return WAIT_LSN_RESULT_SUCCESS;
 	}
 
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
 	/*
 	 * Add our process to the replay waiters heap.  It might happen that
 	 * target LSN gets replayed before we do.  Another check at the beginning
@@ -438,6 +445,18 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 				break;
 		}
 
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
@@ -464,6 +483,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	 */
 	deleteLSNWaiter(WAIT_LSN_REPLAY);
 
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
 	return WAIT_LSN_RESULT_SUCCESS;
 }
 
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 21caf2d43bf..1d016df1f6b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			opt_with '(' utility_option_list ')'	{ $$ = $3; }
+			| /*EMPTY*/							    { $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..f8685fa9039 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -947,6 +947,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 441bf475b4d..2e33a1d22d0 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
 	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
 	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
 										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
 } WaitLSNResult;
 
 /*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
 extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
 extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc09d1a3f03..c741099e186 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.51.0

v13-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/x-patch; name=v13-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 32dab7ed64eecb62adce6b1d124b1fa389515e74 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 10 Oct 2025 16:35:38 +0800
Subject: [PATCH v13 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 525 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 647 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4faed65765c
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,525 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *	  Used by internal WAL reading operations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..441bf475b4d
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

v13-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/x-patch; name=v13-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v13 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

#28Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#27)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Oct 14, 2025 at 9:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Here's the updated v11. Many thanks Jian for off-list discussions and review.

v12 removed unused
+WaitStmt
+WaitStmtParam in pgindent/typedefs.list.

Hi, I’ve split the patch into multiple patch sets for easier review,
per Michael’s advice [1].

[1] /messages/by-id/aOMsv9TszlB1n-W7@paquier.xyz

Patch 2 in v13 is corrupted and patch 3 has an error. Sorry for the
noise. Here's v14.

Best,
Xuneng

Attachments:

v14-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v14-0003-Implement-WAIT-FOR-command.patchDownload
From 40b49e1f21ab0af763e2875614a5105bad4fb2f6 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 14 Oct 2025 22:46:31 +0800
Subject: [PATCH v14] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/access/transam/xlogwait.c     |  24 +-
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 ++++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/access/xlogwait.h             |   3 +-
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 293 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   3 +
 23 files changed, 948 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				 WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 621f790bbdb..c5d269d6e06 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
  * or replica got promoted before the target LSN replayed.
  */
 WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 {
 	XLogRecPtr	currentLSN;
+	TimestampTz	endtime = 0;
 	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
 
 	/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	/* Should have a valid proc number */
 	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
 
+	if (timeout > 0) {
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
 	/*
 	 * Add our process to the replay waiters heap.  It might happen that
 	 * target LSN gets replayed before we do.  Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	for (;;)
 	{
 		int			rc;
+		long		delay_ms = 0;
 		currentLSN = GetXLogReplayRecPtr(NULL);
 
 		/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 				break;
 		}
 
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
 		CHECK_FOR_INTERRUPTS();
 
-		rc = WaitLatch(MyLatch, wake_events, -1,
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
 					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
 
 		/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	 */
 	deleteLSNWaiter(WAIT_LSN_REPLAY);
 
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
 	return WAIT_LSN_RESULT_SUCCESS;
 }
 
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 21caf2d43bf..1d016df1f6b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			opt_with '(' utility_option_list ')'	{ $$ = $3; }
+			| /*EMPTY*/							    { $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 441bf475b4d..2e33a1d22d0 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
 	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
 	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
 										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
 } WaitLSNResult;
 
 /*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
 extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
 extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc09d1a3f03..c741099e186 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.51.0

v14-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/octet-stream; name=v14-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v14 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v14-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v14-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 645e19b2d0d522c16eb731da527baf18f73a7ec2 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 14 Oct 2025 22:12:23 +0800
Subject: [PATCH v14 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 503 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 625 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..621f790bbdb
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		currentLSN = GetXLogReplayRecPtr(NULL);
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..441bf475b4d
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

#29Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#28)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Wed, Oct 15, 2025 at 8:23 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Oct 14, 2025 at 9:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Here's the updated v11. Many thanks Jian for off-list discussions and review.

v12 removed unused
+WaitStmt
+WaitStmtParam in pgindent/typedefs.list.

Hi, I’ve split the patch into multiple patch sets for easier review,
per Michael’s advice [1].

[1] /messages/by-id/aOMsv9TszlB1n-W7@paquier.xyz

Patch 2 in v13 is corrupted and patch 3 has an error. Sorry for the
noise. Here's v14.

Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.

Best,
Xuneng

Attachments:

v15-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patchapplication/octet-stream; name="v15-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patch"Download
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v15 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v15-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v15-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 39857e15fac0a7b5b3105b730db4dfb271788cca Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 15:47:27 +0800
Subject: [PATCH v15] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 503 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 625 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..49dae7ac1c4
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		currentLSN = GetXLogReplayRecPtr(NULL);
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..ada2a460ca4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

v15-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v15-0003-Implement-WAIT-FOR-command.patchDownload
From 72b1c2063710693b1976268e8be99a74a8533956 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 16:03:49 +0800
Subject: [PATCH v15] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/access/transam/xlogwait.c     |  24 +-
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 ++++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/access/xlogwait.h             |   3 +-
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 293 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   3 +
 23 files changed, 948 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				 WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 49dae7ac1c4..2f5f8eaf583 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
  * or replica got promoted before the target LSN replayed.
  */
 WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 {
 	XLogRecPtr	currentLSN;
+	TimestampTz	endtime = 0;
 	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
 
 	/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	/* Should have a valid proc number */
 	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
 
+	if (timeout > 0) {
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
 	/*
 	 * Add our process to the replay waiters heap.  It might happen that
 	 * target LSN gets replayed before we do.  Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	for (;;)
 	{
 		int			rc;
+		long		delay_ms = 0;
 		currentLSN = GetXLogReplayRecPtr(NULL);
 
 		/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 				break;
 		}
 
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
 		CHECK_FOR_INTERRUPTS();
 
-		rc = WaitLatch(MyLatch, wake_events, -1,
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
 					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
 
 		/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	 */
 	deleteLSNWaiter(WAIT_LSN_REPLAY);
 
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
 	return WAIT_LSN_RESULT_SUCCESS;
 }
 
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..c9e0738724b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			opt_with '(' utility_option_list ')'	{ $$ = $3; }
+			| /*EMPTY*/							    { $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index ada2a460ca4..28aea61f6a2 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
 	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
 	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
 										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
 } WaitLSNResult;
 
 /*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
 extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
 extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.51.0

#30Álvaro Herrera
alvherre@kurilemu.de
In reply to: Xuneng Zhou (#28)
Re: Implement waiting for wal lsn replay: reloaded

I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:

1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';

and I don't see why you need two ways to specify an option list.

So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1]/messages/by-id/202510101352.vvp4p3p2dblu@alvherre.pgsql, where we have to have additional
productions for the optional parenthesized option list.

So why not do just

+opt_wait_with_clause:
+           WITH '(' utility_option_list ')'        { $$ = $3; }
+           | /*EMPTY*/                             { $$ = NIL; }
+           ;

which keeps options 1 and 3 of the list above.

Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.

[1]: /messages/by-id/202510101352.vvp4p3p2dblu@alvherre.pgsql

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)

#31Xuneng Zhou
xunengzhou@gmail.com
In reply to: Álvaro Herrera (#30)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

Thank you for the grammar review and the clear recommendation.

On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:

1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';

and I don't see why you need two ways to specify an option list.

I agree with this as unnecessary choices are confusing.

So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.

So why not do just

+opt_wait_with_clause:
+           WITH '(' utility_option_list ')'        { $$ = $3; }
+           | /*EMPTY*/                             { $$ = NIL; }
+           ;

which keeps options 1 and 3 of the list above.

Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.

Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.

Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.

Best,
Xuneng

Attachments:

v16-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patchapplication/octet-stream; name="v16-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patch"Download
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v16 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v16-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v16-0003-Implement-WAIT-FOR-command.patchDownload
From 38971b2448786de5f58ba9be088d4e7e8fc11987 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 16:03:49 +0800
Subject: [PATCH v16 3/3] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>

---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/access/transam/xlogwait.c     |  24 +-
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 +++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/access/xlogwait.h             |   3 +-
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 301 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   3 +
 23 files changed, 956 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				 WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 49dae7ac1c4..2f5f8eaf583 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
  * or replica got promoted before the target LSN replayed.
  */
 WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 {
 	XLogRecPtr	currentLSN;
+	TimestampTz	endtime = 0;
 	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
 
 	/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	/* Should have a valid proc number */
 	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
 
+	if (timeout > 0) {
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
 	/*
 	 * Add our process to the replay waiters heap.  It might happen that
 	 * target LSN gets replayed before we do.  Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	for (;;)
 	{
 		int			rc;
+		long		delay_ms = 0;
 		currentLSN = GetXLogReplayRecPtr(NULL);
 
 		/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 				break;
 		}
 
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
 		CHECK_FOR_INTERRUPTS();
 
-		rc = WaitLatch(MyLatch, wake_events, -1,
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
 					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
 
 		/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	 */
 	deleteLSNWaiter(WAIT_LSN_REPLAY);
 
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
 	return WAIT_LSN_RESULT_SUCCESS;
 }
 
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..bec885ea73e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			WITH '(' utility_option_list ')'		{ $$ = $3; }
+			| /*EMPTY*/								{ $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index ada2a460ca4..28aea61f6a2 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
 	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
 	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
 										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
 } WaitLSNResult;
 
 /*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
 extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
 extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..cc709670e09
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,301 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+	stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+	"get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.51.0

v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 39857e15fac0a7b5b3105b730db4dfb271788cca Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 15:47:27 +0800
Subject: [PATCH v16 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 503 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 625 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..49dae7ac1c4
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		currentLSN = GetXLogReplayRecPtr(NULL);
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..ada2a460ca4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

#32Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#31)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Thank you for the grammar review and the clear recommendation.

On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:

1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';

and I don't see why you need two ways to specify an option list.

I agree with this as unnecessary choices are confusing.

So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.

So why not do just

+opt_wait_with_clause:
+           WITH '(' utility_option_list ')'        { $$ = $3; }
+           | /*EMPTY*/                             { $$ = NIL; }
+           ;

which keeps options 1 and 3 of the list above.

Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.

Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.

Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.

The filename of patch 1 is incorrect due to coping. Just correct it.

Best,
Xuneng

Attachments:

v16-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/octet-stream; name=v16-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v16 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v16-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v16-0003-Implement-WAIT-FOR-command.patchDownload
From 38971b2448786de5f58ba9be088d4e7e8fc11987 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 16:03:49 +0800
Subject: [PATCH v16 3/3] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>

---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/access/transam/xlogwait.c     |  24 +-
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 +++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/access/xlogwait.h             |   3 +-
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 301 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   3 +
 23 files changed, 956 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				 WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 49dae7ac1c4..2f5f8eaf583 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
  * or replica got promoted before the target LSN replayed.
  */
 WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 {
 	XLogRecPtr	currentLSN;
+	TimestampTz	endtime = 0;
 	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
 
 	/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	/* Should have a valid proc number */
 	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
 
+	if (timeout > 0) {
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
 	/*
 	 * Add our process to the replay waiters heap.  It might happen that
 	 * target LSN gets replayed before we do.  Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	for (;;)
 	{
 		int			rc;
+		long		delay_ms = 0;
 		currentLSN = GetXLogReplayRecPtr(NULL);
 
 		/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 				break;
 		}
 
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
 		CHECK_FOR_INTERRUPTS();
 
-		rc = WaitLatch(MyLatch, wake_events, -1,
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
 					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
 
 		/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
 	 */
 	deleteLSNWaiter(WAIT_LSN_REPLAY);
 
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
 	return WAIT_LSN_RESULT_SUCCESS;
 }
 
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..bec885ea73e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			WITH '(' utility_option_list ')'		{ $$ = $3; }
+			| /*EMPTY*/								{ $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index ada2a460ca4..28aea61f6a2 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
 	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
 	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
 										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
 } WaitLSNResult;
 
 /*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
 extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
 extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..cc709670e09
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,301 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+	stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+	"get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.51.0

v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 39857e15fac0a7b5b3105b730db4dfb271788cca Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 15:47:27 +0800
Subject: [PATCH v16 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 503 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 625 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..49dae7ac1c4
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		currentLSN = GetXLogReplayRecPtr(NULL);
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..ada2a460ca4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

#33Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#32)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi!

In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Thank you for the grammar review and the clear recommendation.

On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:

1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';

and I don't see why you need two ways to specify an option list.

I agree with this as unnecessary choices are confusing.

So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.

So why not do just

+opt_wait_with_clause:
+           WITH '(' utility_option_list ')'        { $$ = $3; }
+           | /*EMPTY*/                             { $$ = NIL; }
+           ;

which keeps options 1 and 3 of the list above.

Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.

Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.

Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.

The filename of patch 1 is incorrect due to coping. Just correct it.

Thank you for rebasing the patch.

I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v17-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v17-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From 18a1a51c7f7a1bedb23169bbbe8974a9f803b82a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v17 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.39.5 (Apple Git-154)

v17-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v17-0003-Implement-WAIT-FOR-command.patchDownload
From a5db333b5b5b9e0c0c27f6f2bfbad8c4cf327f9b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 23 Oct 2025 12:47:02 +0300
Subject: [PATCH v17 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 +++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 302 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   1 +
 21 files changed, 931 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..7c3a3541221 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char	   *timeout_str;
+			const char *hintmsg;
+			double		result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round into
+			 * range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+						   defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
+	 * command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..bec885ea73e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			WITH '(' utility_option_list ')'		{ $$ = $3; }
+			| /*EMPTY*/								{ $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..9796a36a2f6
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+	"get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+	stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+	"get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+	"WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 38d346a3691..d92cb2e6a71 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3269,6 +3269,7 @@ WaitLSNState
 WaitLSNProcInfo
 WaitLSNResult
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

v17-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v17-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 2d3e55c71e69e3cf39be10e42a57ad03ebc28217 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 23 Oct 2025 11:58:17 +0300
Subject: [PATCH v17 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 409 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 |  98 +++++
 src/include/storage/lwlocklist.h              |   1 +
 src/tools/pgindent/typedefs.list              |   4 +
 8 files changed, 521 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..8276c2f0947
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		int			i;
+
+		/* Initialize heaps and tracking */
+		for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+		{
+			pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+		}
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	int			i = (uintptr_t) arg;
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType operation)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+		minWaitedLSN = procInfo->waitLSN;
+	}
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	Assert(!procInfo->inHeap[i]);
+	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+	procInfo->inHeap[i] = true;
+	updateMinWaitedLSN(operation);
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNType operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (procInfo->inHeap[i])
+	{
+		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+		procInfo->inHeap[i] = false;
+		updateMinWaitedLSN(operation);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached. Record
+		 * process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+			/* Update appropriate flag */
+			procInfo->inHeap[i] = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+		return;
+
+	wakeupWaiters(operation, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		int			i;
+
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only
+		 * possible to get a false positive.  But that will be eliminated by a
+		 * recheck inside deleteLSNWaiter().
+		 */
+
+		for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+		{
+			if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+				deleteLSNWaiter((WaitLSNType) i);
+		}
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the waiters heap.  It might happen that target LSN
+	 * gets reached before we do.  Another check at the beginning of the loop
+	 * below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, operation);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = -1;
+
+		if (operation == WAIT_LSN_TYPE_REPLAY)
+			currentLSN = GetXLogReplayRecPtr(NULL);
+		else
+			currentLSN = GetFlushRecPtr(NULL);
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * reached.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(operation);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been reached */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   (operation == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory heap.  We might already be
+	 * deleted by the startup process.  The 'inHeap' flags prevents us from
+	 * the double deletion.
+	 */
+	deleteLSNWaiter(operation);
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..d7aad6d8be4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT		/* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
+	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
+	WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Heap membership flags for LSN types */
+	bool		inHeap[WAIT_LSN_TYPE_COUNT];
+
+	/* Heap nodes for LSN types */
+	pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN values some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * A pairing heaps of waiting processes ordered by LSN values (least LSN
+	 * is on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN,
+								int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 377a7946585..38d346a3691 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3264,6 +3264,10 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.5 (Apple Git-154)

#34Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#33)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Thu, Oct 23, 2025 at 6:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Thank you for the grammar review and the clear recommendation.

On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:

1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';

and I don't see why you need two ways to specify an option list.

I agree with this as unnecessary choices are confusing.

So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.

So why not do just

+opt_wait_with_clause:
+           WITH '(' utility_option_list ')'        { $$ = $3; }
+           | /*EMPTY*/                             { $$ = NIL; }
+           ;

which keeps options 1 and 3 of the list above.

Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.

Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.

Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.

The filename of patch 1 is incorrect due to coping. Just correct it.

Thank you for rebasing the patch.

I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.

Thanks for updating the patch set!
Patch 2 looks more elegant after the revision. I’ll review them soon.

Best,
Xuneng

#35Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#34)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Alexander!

On Thu, Oct 23, 2025 at 8:58 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Oct 23, 2025 at 6:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Thank you for the grammar review and the clear recommendation.

On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:

1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';

and I don't see why you need two ways to specify an option list.

I agree with this as unnecessary choices are confusing.

So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.

So why not do just

+opt_wait_with_clause:
+           WITH '(' utility_option_list ')'        { $$ = $3; }
+           | /*EMPTY*/                             { $$ = NIL; }
+           ;

which keeps options 1 and 3 of the list above.

Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.

Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.

Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.

The filename of patch 1 is incorrect due to coping. Just correct it.

Thank you for rebasing the patch.

I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.

Thanks for updating the patch set!
Patch 2 looks more elegant after the revision. I’ll review them soon.

I’ve made a few minor updates to the comments and docs in patches 2
and 3. The patch set LGTM now.

Best,
Xuneng

Attachments:

v18-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v18-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From b0ee110622dacd2d4769da6915580e9c3220c09f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:56:53 +0800
Subject: [PATCH v18 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v18-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v18-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From b611a90989aec7695349e47fd1fb89d7dd9b1872 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:59:42 +0800
Subject: [PATCH v18 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 409 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 |  98 +++++
 src/include/storage/lwlocklist.h              |   1 +
 src/tools/pgindent/typedefs.list              |   5 +
 8 files changed, 522 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..1f4b38a5114
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch until it is woken up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		int			i;
+
+		/* Initialize heaps and tracking */
+		for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+		{
+			pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+		}
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	int			i = (uintptr_t) arg;
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType operation)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+		minWaitedLSN = procInfo->waitLSN;
+	}
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	Assert(!procInfo->inHeap[i]);
+	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+	procInfo->inHeap[i] = true;
+	updateMinWaitedLSN(operation);
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNType operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (procInfo->inHeap[i])
+	{
+		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+		procInfo->inHeap[i] = false;
+		updateMinWaitedLSN(operation);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached. Record
+		 * process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+			/* Update appropriate flag */
+			procInfo->inHeap[i] = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes whose waited LSNs have been reached.
+		 * Since SetLatch() is a time-consuming operation, we do this outside
+		 * of WaitLSNLock. This is safe because procLatch is never freed, so
+		 * at worst we may set a latch for the wrong process or for no process
+		 * at all, which is harmless.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+		return;
+
+	wakeupWaiters(operation, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		int			i;
+
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only
+		 * possible to get a false positive.  But that will be eliminated by a
+		 * recheck inside deleteLSNWaiter().
+		 */
+
+		for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+		{
+			if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+				deleteLSNWaiter((WaitLSNType) i);
+		}
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the waiters heap.  It might happen that target LSN
+	 * gets reached before we do.  The check at the beginning of the loop
+	 * below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, operation);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = -1;
+
+		if (operation == WAIT_LSN_TYPE_REPLAY)
+			currentLSN = GetXLogReplayRecPtr(NULL);
+		else
+			currentLSN = GetFlushRecPtr(NULL);
+
+		/* Check that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but check if target LSN was already
+			 * reached.
+			 */
+			deleteLSNWaiter(operation);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been reached */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   (operation == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					 errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN"));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory heap.  We might already be
+	 * deleted by the startup process.  The 'inHeap' flags prevents us from
+	 * the double deletion.
+	 */
+	deleteLSNWaiter(operation);
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..d7aad6d8be4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT		/* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
+	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
+	WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Heap membership flags for LSN types */
+	bool		inHeap[WAIT_LSN_TYPE_COUNT];
+
+	/* Heap nodes for LSN types */
+	pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN values some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * A pairing heaps of waiting processes ordered by LSN values (least LSN
+	 * is on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN,
+								int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..e34dcf97df8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3265,7 +3265,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.51.0

v18-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v18-0003-Implement-WAIT-FOR-command.patchDownload
From f009c6a8bd305b50889366877cc7a8581fb40157 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 12:03:13 +0800
Subject: [PATCH v18 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 +++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 302 ++++++++++++++++++++++
 20 files changed, 930 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..742deb037b7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd91bcd68ec..45a16bd1ec2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6227,6 +6228,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char	   *timeout_str;
+			const char *hintmsg;
+			double		result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round into
+			 * range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+						   defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
+	 * command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a4b29c822e8..a4e6f80504b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16462,6 +16463,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			WITH '(' utility_option_list ')'		{ $$ = $3; }
+			| /*EMPTY*/								{ $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17949,6 +17970,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18119,6 +18141,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18565,6 +18588,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18776,6 +18800,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 74179139fa9..fde78c55160 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1158,10 +1158,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1738,7 +1739,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ecbddd12e1b..d14294a4ece 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,4 +4385,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..e0ddb06a2f0
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the LSN replay on standby using
+# the WAIT FOR command.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Check parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+	"get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+	stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+	"get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+	"WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 7. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
-- 
2.51.0

#36Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#35)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Sun, Nov 2, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi, Alexander!

On Thu, Oct 23, 2025 at 8:58 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Oct 23, 2025 at 6:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Thank you for the grammar review and the clear recommendation.

On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:

1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';

and I don't see why you need two ways to specify an option list.

I agree with this as unnecessary choices are confusing.

So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.

So why not do just

+opt_wait_with_clause:
+           WITH '(' utility_option_list ')'        { $$ = $3; }
+           | /*EMPTY*/                             { $$ = NIL; }
+           ;

which keeps options 1 and 3 of the list above.

Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.

Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.

Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.

The filename of patch 1 is incorrect due to coping. Just correct it.

Thank you for rebasing the patch.

I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.

Thanks for updating the patch set!
Patch 2 looks more elegant after the revision. I’ll review them soon.

I’ve made a few minor updates to the comments and docs in patches 2
and 3. The patch set LGTM now.

Fix an minor issue in v18: WaitStmt was mistakenly added to
pgindent/typedefs.list in patch 2, but it should belong to patch 3.

Best,
Xuneng

Attachments:

v19-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v19-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 65d6c1d497389925961738207422cd2bc69c95bd Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:59:42 +0800
Subject: [PATCH v19 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 409 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 |  98 +++++
 src/include/storage/lwlocklist.h              |   1 +
 src/tools/pgindent/typedefs.list              |   4 +
 8 files changed, 521 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..1f4b38a5114
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch until it is woken up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		int			i;
+
+		/* Initialize heaps and tracking */
+		for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+		{
+			pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+		}
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	int			i = (uintptr_t) arg;
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType operation)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+		minWaitedLSN = procInfo->waitLSN;
+	}
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	Assert(!procInfo->inHeap[i]);
+	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+	procInfo->inHeap[i] = true;
+	updateMinWaitedLSN(operation);
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNType operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (procInfo->inHeap[i])
+	{
+		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+		procInfo->inHeap[i] = false;
+		updateMinWaitedLSN(operation);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached. Record
+		 * process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+			/* Update appropriate flag */
+			procInfo->inHeap[i] = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes whose waited LSNs have been reached.
+		 * Since SetLatch() is a time-consuming operation, we do this outside
+		 * of WaitLSNLock. This is safe because procLatch is never freed, so
+		 * at worst we may set a latch for the wrong process or for no process
+		 * at all, which is harmless.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+	int			i = (int) operation;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+		return;
+
+	wakeupWaiters(operation, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		int			i;
+
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only
+		 * possible to get a false positive.  But that will be eliminated by a
+		 * recheck inside deleteLSNWaiter().
+		 */
+
+		for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+		{
+			if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+				deleteLSNWaiter((WaitLSNType) i);
+		}
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the waiters heap.  It might happen that target LSN
+	 * gets reached before we do.  The check at the beginning of the loop
+	 * below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, operation);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = -1;
+
+		if (operation == WAIT_LSN_TYPE_REPLAY)
+			currentLSN = GetXLogReplayRecPtr(NULL);
+		else
+			currentLSN = GetFlushRecPtr(NULL);
+
+		/* Check that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but check if target LSN was already
+			 * reached.
+			 */
+			deleteLSNWaiter(operation);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been reached */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   (operation == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					 errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN"));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory heap.  We might already be
+	 * deleted by the startup process.  The 'inHeap' flags prevents us from
+	 * the double deletion.
+	 */
+	deleteLSNWaiter(operation);
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..d7aad6d8be4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT		/* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
+	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
+	WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Heap membership flags for LSN types */
+	bool		inHeap[WAIT_LSN_TYPE_COUNT];
+
+	/* Heap nodes for LSN types */
+	pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN values some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * A pairing heaps of waiting processes ordered by LSN values (least LSN
+	 * is on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN,
+								int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..237d33c538c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3265,6 +3265,10 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.51.0

v19-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v19-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From b0ee110622dacd2d4769da6915580e9c3220c09f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:56:53 +0800
Subject: [PATCH v19 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v19-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v19-0003-Implement-WAIT-FOR-command.patchDownload
From b686e6126ac9eb5b54a2782ac8be3539454da49a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Mon, 3 Nov 2025 09:57:30 +0800
Subject: [PATCH v19 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 +++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 302 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   1 +
 21 files changed, 931 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..742deb037b7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd91bcd68ec..45a16bd1ec2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6227,6 +6228,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char	   *timeout_str;
+			const char *hintmsg;
+			double		result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round into
+			 * range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+						   defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
+	 * command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a4b29c822e8..a4e6f80504b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16462,6 +16463,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			WITH '(' utility_option_list ')'		{ $$ = $3; }
+			| /*EMPTY*/								{ $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17949,6 +17970,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18119,6 +18141,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18565,6 +18588,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18776,6 +18800,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 74179139fa9..fde78c55160 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1158,10 +1158,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1738,7 +1739,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ecbddd12e1b..d14294a4ece 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,4 +4385,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..e0ddb06a2f0
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the LSN replay on standby using
+# the WAIT FOR command.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Check parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+	"get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+	stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+	"get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+	"WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 7. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 237d33c538c..e34dcf97df8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3270,6 +3270,7 @@ WaitLSNState
 WaitLSNProcInfo
 WaitLSNResult
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.51.0

#37Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#36)
3 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hello, Xuneng!

On Mon, Nov 3, 2025 at 4:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sun, Nov 2, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Oct 23, 2025 at 8:58 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
I’ve made a few minor updates to the comments and docs in patches 2
and 3. The patch set LGTM now.

Fix an minor issue in v18: WaitStmt was mistakenly added to
pgindent/typedefs.list in patch 2, but it should belong to patch 3.

Thank you. I also made some minor changes to 0002 renaming
"operation" => "lsnType".

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v20-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v20-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 27d57234c169c6612e432bb5ff19acac2c5982d9 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 3 Nov 2025 13:31:13 +0200
Subject: [PATCH v20 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 409 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 |  98 +++++
 src/include/storage/lwlocklist.h              |   1 +
 src/tools/pgindent/typedefs.list              |   4 +
 8 files changed, 521 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..e04567cfd67
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch until it is woken up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		int			i;
+
+		/* Initialize heaps and tracking */
+		for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+		{
+			pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+		}
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	int			i = (uintptr_t) arg;
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified LSN type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType lsnType)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+	int			i = (int) lsnType;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+		minWaitedLSN = procInfo->waitLSN;
+	}
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on LSN type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType lsnType)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) lsnType;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	Assert(!procInfo->inHeap[i]);
+	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+	procInfo->inHeap[i] = true;
+	updateMinWaitedLSN(lsnType);
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on LSN type
+ */
+static void
+deleteLSNWaiter(WaitLSNType lsnType)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+	int			i = (int) lsnType;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (procInfo->inHeap[i])
+	{
+		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+		procInfo->inHeap[i] = false;
+		updateMinWaitedLSN(lsnType);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
+{
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+	int			i = (int) lsnType;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached. Record
+		 * process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+			/* Update appropriate flag */
+			procInfo->inHeap[i] = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(lsnType);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes whose waited LSNs have been reached.
+		 * Since SetLatch() is a time-consuming operation, we do this outside
+		 * of WaitLSNLock. This is safe because procLatch is never freed, so
+		 * at worst we may set a latch for the wrong process or for no process
+		 * at all, which is harmless.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN)
+{
+	int			i = (int) lsnType;
+
+	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+		return;
+
+	wakeupWaiters(lsnType, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		int			i;
+
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only
+		 * possible to get a false positive.  But that will be eliminated by a
+		 * recheck inside deleteLSNWaiter().
+		 */
+
+		for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+		{
+			if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+				deleteLSNWaiter((WaitLSNType) i);
+		}
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the waiters heap.  It might happen that target LSN
+	 * gets reached before we do.  The check at the beginning of the loop
+	 * below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, lsnType);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = -1;
+
+		if (lsnType == WAIT_LSN_TYPE_REPLAY)
+			currentLSN = GetXLogReplayRecPtr(NULL);
+		else
+			currentLSN = GetFlushRecPtr(NULL);
+
+		/* Check that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but check if target LSN was already
+			 * reached.
+			 */
+			deleteLSNWaiter(lsnType);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been reached */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					errcode(ERRCODE_ADMIN_SHUTDOWN),
+					errmsg("terminating connection due to unexpected postmaster exit"),
+					errcontext("while waiting for LSN"));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory heap.  We might already be
+	 * deleted by the startup process.  The 'inHeap' flags prevents us from
+	 * the double deletion.
+	 */
+	deleteLSNWaiter(lsnType);
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..4dc328b1b07
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+	WAIT_LSN_RESULT_TIMEOUT		/* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
+	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
+	WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Heap membership flags for LSN types */
+	bool		inHeap[WAIT_LSN_TYPE_COUNT];
+
+	/* Heap nodes for LSN types */
+	pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN values some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * A pairing heaps of waiting processes ordered by LSN values (least LSN
+	 * is on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
+								int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..237d33c538c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3265,6 +3265,10 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.5 (Apple Git-154)

v20-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v20-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From 697d5aa28add566198bd1bccce5625bc35e1ea5a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:56:53 +0800
Subject: [PATCH v20 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.39.5 (Apple Git-154)

v20-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v20-0003-Implement-WAIT-FOR-command.patchDownload
From 180a09f5d264b6b8ebd0db034ea89413751b23cd Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 3 Nov 2025 13:32:47 +0200
Subject: [PATCH v20 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 doc/src/sgml/high-availability.sgml       |  54 ++++
 doc/src/sgml/ref/allfiles.sgml            |   1 +
 doc/src/sgml/ref/wait_for.sgml            | 234 +++++++++++++++++
 doc/src/sgml/reference.sgml               |   1 +
 src/backend/access/transam/xact.c         |   6 +
 src/backend/access/transam/xlog.c         |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/wait.c               | 212 +++++++++++++++
 src/backend/parser/gram.y                 |  33 ++-
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/pquery.c                 |  12 +-
 src/backend/tcop/utility.c                |  22 ++
 src/include/commands/wait.h               |  22 ++
 src/include/nodes/parsenodes.h            |   8 +
 src/include/parser/kwlist.h               |   2 +
 src/include/tcop/cmdtaglist.h             |   1 +
 src/test/recovery/meson.build             |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 302 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   1 +
 21 files changed, 931 insertions(+), 11 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..742deb037b7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd91bcd68ec..45a16bd1ec2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6227,6 +6228,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char	   *timeout_str;
+			const char *hintmsg;
+			double		result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round into
+			 * range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+						   defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
+	 * command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a4b29c822e8..a4e6f80504b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16462,6 +16463,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			WITH '(' utility_option_list ')'		{ $$ = $3; }
+			| /*EMPTY*/								{ $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17949,6 +17970,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18119,6 +18141,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18565,6 +18588,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18776,6 +18800,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 74179139fa9..fde78c55160 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1158,10 +1158,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1738,7 +1739,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ecbddd12e1b..d14294a4ece 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,4 +4385,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..e0ddb06a2f0
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the LSN replay on standby using
+# the WAIT FOR command.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Check parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+	"get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+	stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+	"get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+	"WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 7. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 237d33c538c..e34dcf97df8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3270,6 +3270,7 @@ WaitLSNState
 WaitLSNProcInfo
 WaitLSNResult
 WaitPMResult
+WaitStmt
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.39.5 (Apple Git-154)

#38Álvaro Herrera
alvherre@kurilemu.de
In reply to: Alexander Korotkov (#37)
Re: Implement waiting for wal lsn replay: reloaded

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#39Andres Freund
andres@anarazel.de
In reply to: Álvaro Herrera (#38)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On 2025-11-03 16:06:58 +0100, �lvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

Greetings,

Andres Freund

#40Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#39)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi!

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v1-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchapplication/octet-stream; name=v1-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchDownload
From bb12721dc3efbd213416adb8c3563bb0c11c023b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 5 Nov 2025 11:10:04 +0200
Subject: [PATCH v1] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 27 ++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..85b5d9863cd 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,8 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3340,8 +3342,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
 		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
@@ -3360,6 +3360,29 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when appropriate
+	if (($mode eq 'replay') && ($isrecovery eq 't'))
+	{
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' WITH (timeout '${PostgreSQL::Test::Utils::timeout_default}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+	${output});
+			diag qq(Last pg_stat_replication contents:
+	${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.39.5 (Apple Git-154)

#41Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#40)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Thanks for pushing them!

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.

Best,
Xuneng

#42Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#41)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

On Wed, Nov 5, 2025 at 4:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Thanks for pushing them!

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.

I think we might end up extending WaitLSNType enum. However, I hate
inHeap and heapNode arrays growing in WaitLSNProcInfo as they are
allocated per process. I found that we could optimize WaitLSNProcInfo
struct turning them into simple variables because a single process can
wait only for a single LSN at a time. Please, check the attached
patch.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v1-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchapplication/octet-stream; name=v1-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchDownload
From 89fde94dd74810d2bf349af33b7ca9585080c0f6 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 7 Nov 2025 23:49:47 +0200
Subject: [PATCH v1] Optimize shared memory usage for WaitLSNProcInfo

We need separate pairing heaps for different WaitLSNType's, because there
might be waiters for different LSN's at the same time.  However, one process
can wait only for one type of LSN at a time.  So, not need for inHeap
and heapNode fields to be arrays.
---
 src/backend/access/transam/xlogwait.c | 40 ++++++++++++---------------
 src/include/access/xlogwait.h         |  7 +++--
 2 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 34fa41ed9b2..e1eb21be125 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -90,7 +90,7 @@ WaitLSNShmemInit(void)
 		for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
 		{
 			pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
-			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, NULL);
 		}
 
 		/* Initialize process info array */
@@ -106,9 +106,8 @@ WaitLSNShmemInit(void)
 static int
 waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 {
-	int			i = (uintptr_t) arg;
-	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
-	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, b);
 
 	if (aproc->waitLSN < bproc->waitLSN)
 		return 1;
@@ -132,7 +131,7 @@ updateMinWaitedLSN(WaitLSNType lsnType)
 	if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
 	{
 		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
-		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
 
 		minWaitedLSN = procInfo->waitLSN;
 	}
@@ -154,10 +153,11 @@ addLSNWaiter(XLogRecPtr lsn, WaitLSNType lsnType)
 
 	procInfo->procno = MyProcNumber;
 	procInfo->waitLSN = lsn;
+	procInfo->lsnType = lsnType;
 
-	Assert(!procInfo->inHeap[i]);
-	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
-	procInfo->inHeap[i] = true;
+	Assert(!procInfo->inHeap);
+	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+	procInfo->inHeap = true;
 	updateMinWaitedLSN(lsnType);
 
 	LWLockRelease(WaitLSNLock);
@@ -176,10 +176,10 @@ deleteLSNWaiter(WaitLSNType lsnType)
 
 	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
-	if (procInfo->inHeap[i])
+	if (procInfo->inHeap)
 	{
-		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
-		procInfo->inHeap[i] = false;
+		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+		procInfo->inHeap = false;
 		updateMinWaitedLSN(lsnType);
 	}
 
@@ -228,7 +228,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
 			WaitLSNProcInfo *procInfo;
 
 			/* Get procInfo using appropriate heap node */
-			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
 
 			if (XLogRecPtrIsValid(currentLSN) && procInfo->waitLSN > currentLSN)
 				break;
@@ -238,7 +238,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
 			(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
 
 			/* Update appropriate flag */
-			procInfo->inHeap[i] = false;
+			procInfo->inHeap = false;
 
 			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
 				break;
@@ -285,20 +285,14 @@ WaitLSNCleanup(void)
 {
 	if (waitLSNState)
 	{
-		int			i;
-
 		/*
-		 * We do a fast-path check of the heap flags without the lock.  These
-		 * flags are set to true only by the process itself.  So, it's only
+		 * We do a fast-path check of the inHeap flag without the lock.  This
+		 * flag is set to true only by the process itself.  So, it's only
 		 * possible to get a false positive.  But that will be eliminated by a
 		 * recheck inside deleteLSNWaiter().
 		 */
-
-		for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
-		{
-			if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
-				deleteLSNWaiter((WaitLSNType) i);
-		}
+		if (waitLSNState->procInfos[MyProcNumber].inHeap)
+			deleteLSNWaiter(waitLSNState->procInfos[MyProcNumber].lsnType);
 	}
 }
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 4dc328b1b07..46bac74988b 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -50,14 +50,17 @@ typedef struct WaitLSNProcInfo
 	/* LSN, which this process is waiting for */
 	XLogRecPtr	waitLSN;
 
+	/* The type of LSN to wait */
+	WaitLSNType lsnType;
+
 	/* Process to wake up once the waitLSN is reached */
 	ProcNumber	procno;
 
 	/* Heap membership flags for LSN types */
-	bool		inHeap[WAIT_LSN_TYPE_COUNT];
+	bool		inHeap;
 
 	/* Heap nodes for LSN types */
-	pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+	pairingheap_node heapNode;
 } WaitLSNProcInfo;
 
 /*
-- 
2.39.5 (Apple Git-154)

#43Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#40)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

------
Regards,
Alexander Korotkov
Supabase

I tested the patch using make check-world, and it worked well. I also
made a few adjustments:

- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

--
Best,
Xuneng

Attachments:

v2-0001-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v2-0001-Use-WAIT-FOR-LSN-in.patchDownload
From e24a00603080d476087b8e327284d849f72d86a8 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 12 Nov 2025 13:32:05 +0800
Subject: [PATCH v2] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 32 +++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..41784553d4b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,32 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when appropriate
+	if (($mode eq 'replay') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' WITH (timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for other modes or when WAIT FOR LSN is not applicable
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#44Tomas Vondra
tomas@vondra.me
In reply to: Alexander Korotkov (#40)
Re: Implement waiting for wal lsn replay: reloaded

On 11/5/25 10:51, Alexander Korotkov wrote:

Hi!

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Hi,

The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.

And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:

2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/04000060

2025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progress

So there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.

While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.

regards

--
Tomas Vondra

#45Xuneng Zhou
xunengzhou@gmail.com
In reply to: Tomas Vondra (#44)
2 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi Tomas,

On Fri, Nov 14, 2025 at 4:32 AM Tomas Vondra <tomas@vondra.me> wrote:

On 11/5/25 10:51, Alexander Korotkov wrote:

Hi!

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Hi,

The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.

And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:

2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/04000060

2025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progress

So there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.

Thanks for looking into this!

I did a quick analysis for this prolonged waiting:

In WaitLSNWakeup() (xlogwait.c:267), the fast-path check incorrectly
handled InvalidXLogRecPtr:
/* Fast path check */
if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return; // Issue: Returns early when currentLSN = 0

When currentLSN = InvalidXLogRecPtr (0), meaning "wake all waiters",
the check compared:
- minWaitedLSN (e.g., 0x570CC048) > 0 → TRUE
- Result: function returned early without waking anyone

When It Happened
During standby promotion, xlog.c:6246 calls:

WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);

This should wake all LSN waiters, but the bug prevented it. WAIT FOR
LSN commands could wait indefinitely. Test 049_wait_for_lsn.pl took 68
seconds instead of ~9 seconds.

if the above analysis is sound, the fix could be like:

Proposed fix:
Added a validity check before the comparison:
/*
* Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
* "wake all waiters" (e.g., during promotion when recovery ends).
*/
if (XLogRecPtrIsValid(currentLSN) &&
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return;

Result:
Test time: 68s → 9s
WAIT FOR LSN exits immediately on promotion (62ms vs 60s)

While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.

Please check the attached patch for replacing them.

--
Best,
Xuneng

Attachments:

v1-0001-Fix-incorrect-function-name-in-comments.patchapplication/octet-stream; name=v1-0001-Fix-incorrect-function-name-in-comments.patchDownload
From ce6227035eab97b6a67d97fd58e88dc1392a47c7 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 14 Nov 2025 09:39:31 +0800
Subject: [PATCH v1] Fix incorrect function name in comments

Update comments to reference WaitForLSN() instead of the outdated
WaitForLSNReplay() function name.
---
 src/backend/commands/wait.c   | 2 +-
 src/include/access/xlogwait.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 67068a92dbf..9c4764cf896 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -143,7 +143,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
 
 	/*
-	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * Process the result of WaitForLSN().  Throw appropriate error if
 	 * needed.
 	 */
 	switch (waitLSNResult)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 4dc328b1b07..f43e481c3b9 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -20,7 +20,7 @@
 #include "tcop/dest.h"
 
 /*
- * Result statuses for WaitForLSNReplay().
+ * Result statuses for WaitForLSN().
  */
 typedef enum
 {
-- 
2.51.0

v1-0001-Fix-WaitLSNWakeup-fast-path-check-for-InvalidXLog.patchapplication/octet-stream; name=v1-0001-Fix-WaitLSNWakeup-fast-path-check-for-InvalidXLog.patchDownload
From de673ec025074cd95ad4a4e53e2c26fcc14d5a4a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 14 Nov 2025 09:34:03 +0800
Subject: [PATCH v1] Fix WaitLSNWakeup() fast-path check for InvalidXLogRecPtr

WaitLSNWakeup() incorrectly returned early when called with
InvalidXLogRecPtr (meaning "wake all waiters"), because the fast-path
check compared minWaitedLSN > 0 without validating currentLSN first.
This caused WAIT FOR LSN commands to wait indefinitely during standby
promotion until random signals woke them.

Add XLogRecPtrIsValid() check before the comparison so InvalidXLogRecPtr
bypasses the fast-path and wakes all waiters immediately.
---
 src/backend/access/transam/xlogwait.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 34fa41ed9b2..78de93db47f 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -270,8 +270,12 @@ WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN)
 
 	Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
 
-	/* Fast path check */
-	if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+	/*
+	 * Fast path check.  Skip if currentLSN is InvalidXLogRecPtr, which means
+	 * "wake all waiters" (e.g., during promotion when recovery ends).
+	 */
+	if (XLogRecPtrIsValid(currentLSN) &&
+		pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
 		return;
 
 	wakeupWaiters(lsnType, currentLSN);
-- 
2.51.0

#46Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#45)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Xuneng!

On Fri, Nov 14, 2025 at 3:50 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Nov 14, 2025 at 4:32 AM Tomas Vondra <tomas@vondra.me> wrote:

On 11/5/25 10:51, Alexander Korotkov wrote:

Hi!

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Hi,

The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.

And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:

2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/04000060

2025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progress

So there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.

Thanks for looking into this!

I did a quick analysis for this prolonged waiting:

In WaitLSNWakeup() (xlogwait.c:267), the fast-path check incorrectly
handled InvalidXLogRecPtr:
/* Fast path check */
if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return; // Issue: Returns early when currentLSN = 0

When currentLSN = InvalidXLogRecPtr (0), meaning "wake all waiters",
the check compared:
- minWaitedLSN (e.g., 0x570CC048) > 0 → TRUE
- Result: function returned early without waking anyone

When It Happened
During standby promotion, xlog.c:6246 calls:

WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);

This should wake all LSN waiters, but the bug prevented it. WAIT FOR
LSN commands could wait indefinitely. Test 049_wait_for_lsn.pl took 68
seconds instead of ~9 seconds.

if the above analysis is sound, the fix could be like:

Proposed fix:
Added a validity check before the comparison:
/*
* Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
* "wake all waiters" (e.g., during promotion when recovery ends).
*/
if (XLogRecPtrIsValid(currentLSN) &&
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return;

Result:
Test time: 68s → 9s
WAIT FOR LSN exits immediately on promotion (62ms vs 60s)

While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.

Please check the attached patch for replacing them.

Thank you so much for your patches!
Pushed with minor corrections.

------
Regards,
Alexander Korotkov
Supabase

#47Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#42)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

On Sat, Nov 8, 2025 at 12:02 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Nov 5, 2025 at 4:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Thanks for pushing them!

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.

I think we might end up extending WaitLSNType enum. However, I hate
inHeap and heapNode arrays growing in WaitLSNProcInfo as they are
allocated per process. I found that we could optimize WaitLSNProcInfo
struct turning them into simple variables because a single process can
wait only for a single LSN at a time. Please, check the attached
patch.

Here is the updated patch integrating minor corrections provided by
Xuneng Zhou off-list. I'm going to push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v3-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchapplication/octet-stream; name=v3-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchDownload
From 09c5f97fac0b14d82d2108d4b31777c7a639608e Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sun, 16 Nov 2025 14:06:50 +0200
Subject: [PATCH v3] Optimize shared memory usage for WaitLSNProcInfo

We need separate pairing heaps for different WaitLSNType's, because there
might be waiters for different LSN's at the same time.  However, one process
can wait only for one type of LSN at a time.  So, no need for inHeap
and heapNode fields to be arrays.

Discussion: https://postgr.es/m/CAPpHfdsBR-7sDtXFJ1qpJtKiohfGoj%3DvqzKVjWxtWsWidx7G_A%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/access/transam/xlogwait.c | 42 ++++++++++++---------------
 src/include/access/xlogwait.h         | 14 ++++++---
 2 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 78de93db47f..98aa5f1e4a2 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -90,7 +90,7 @@ WaitLSNShmemInit(void)
 		for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
 		{
 			pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
-			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+			pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, NULL);
 		}
 
 		/* Initialize process info array */
@@ -106,9 +106,8 @@ WaitLSNShmemInit(void)
 static int
 waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 {
-	int			i = (uintptr_t) arg;
-	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
-	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, b);
 
 	if (aproc->waitLSN < bproc->waitLSN)
 		return 1;
@@ -132,7 +131,7 @@ updateMinWaitedLSN(WaitLSNType lsnType)
 	if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
 	{
 		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
-		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
 
 		minWaitedLSN = procInfo->waitLSN;
 	}
@@ -154,10 +153,11 @@ addLSNWaiter(XLogRecPtr lsn, WaitLSNType lsnType)
 
 	procInfo->procno = MyProcNumber;
 	procInfo->waitLSN = lsn;
+	procInfo->lsnType = lsnType;
 
-	Assert(!procInfo->inHeap[i]);
-	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
-	procInfo->inHeap[i] = true;
+	Assert(!procInfo->inHeap);
+	pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+	procInfo->inHeap = true;
 	updateMinWaitedLSN(lsnType);
 
 	LWLockRelease(WaitLSNLock);
@@ -176,10 +176,12 @@ deleteLSNWaiter(WaitLSNType lsnType)
 
 	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
-	if (procInfo->inHeap[i])
+	Assert(procInfo->lsnType == lsnType);
+
+	if (procInfo->inHeap)
 	{
-		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
-		procInfo->inHeap[i] = false;
+		pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+		procInfo->inHeap = false;
 		updateMinWaitedLSN(lsnType);
 	}
 
@@ -228,7 +230,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
 			WaitLSNProcInfo *procInfo;
 
 			/* Get procInfo using appropriate heap node */
-			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+			procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
 
 			if (XLogRecPtrIsValid(currentLSN) && procInfo->waitLSN > currentLSN)
 				break;
@@ -238,7 +240,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
 			(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
 
 			/* Update appropriate flag */
-			procInfo->inHeap[i] = false;
+			procInfo->inHeap = false;
 
 			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
 				break;
@@ -289,20 +291,14 @@ WaitLSNCleanup(void)
 {
 	if (waitLSNState)
 	{
-		int			i;
-
 		/*
-		 * We do a fast-path check of the heap flags without the lock.  These
-		 * flags are set to true only by the process itself.  So, it's only
+		 * We do a fast-path check of the inHeap flag without the lock.  This
+		 * flag is set to true only by the process itself.  So, it's only
 		 * possible to get a false positive.  But that will be eliminated by a
 		 * recheck inside deleteLSNWaiter().
 		 */
-
-		for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
-		{
-			if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
-				deleteLSNWaiter((WaitLSNType) i);
-		}
+		if (waitLSNState->procInfos[MyProcNumber].inHeap)
+			deleteLSNWaiter(waitLSNState->procInfos[MyProcNumber].lsnType);
 	}
 }
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index f43e481c3b9..e607441d618 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -50,14 +50,20 @@ typedef struct WaitLSNProcInfo
 	/* LSN, which this process is waiting for */
 	XLogRecPtr	waitLSN;
 
+	/* The type of LSN to wait */
+	WaitLSNType lsnType;
+
 	/* Process to wake up once the waitLSN is reached */
 	ProcNumber	procno;
 
-	/* Heap membership flags for LSN types */
-	bool		inHeap[WAIT_LSN_TYPE_COUNT];
+	/*
+	 * Heap membership flag.  A process can wait for only one LSN type at a
+	 * time, so a single flag suffices (tracked by the lsnType field).
+	 */
+	bool		inHeap;
 
-	/* Heap nodes for LSN types */
-	pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+	/* Pairing heap node for the waiters' heap (one per process) */
+	pairingheap_node heapNode;
 } WaitLSNProcInfo;
 
 /*
-- 
2.39.5 (Apple Git-154)

#48Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#43)
Re: Implement waiting for wal lsn replay: reloaded

On Wed, Nov 12, 2025 at 9:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

I tested the patch using make check-world, and it worked well. I also
made a few adjustments:

- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

#49Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#46)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander,

On Sat, Nov 15, 2025 at 6:29 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Fri, Nov 14, 2025 at 3:50 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Nov 14, 2025 at 4:32 AM Tomas Vondra <tomas@vondra.me> wrote:

On 11/5/25 10:51, Alexander Korotkov wrote:

Hi!

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Hi,

The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.

And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:

2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/04000060

2025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progress

So there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.

Thanks for looking into this!

I did a quick analysis for this prolonged waiting:

In WaitLSNWakeup() (xlogwait.c:267), the fast-path check incorrectly
handled InvalidXLogRecPtr:
/* Fast path check */
if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return; // Issue: Returns early when currentLSN = 0

When currentLSN = InvalidXLogRecPtr (0), meaning "wake all waiters",
the check compared:
- minWaitedLSN (e.g., 0x570CC048) > 0 → TRUE
- Result: function returned early without waking anyone

When It Happened
During standby promotion, xlog.c:6246 calls:

WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);

This should wake all LSN waiters, but the bug prevented it. WAIT FOR
LSN commands could wait indefinitely. Test 049_wait_for_lsn.pl took 68
seconds instead of ~9 seconds.

if the above analysis is sound, the fix could be like:

Proposed fix:
Added a validity check before the comparison:
/*
* Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
* "wake all waiters" (e.g., during promotion when recovery ends).
*/
if (XLogRecPtrIsValid(currentLSN) &&
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return;

Result:
Test time: 68s → 9s
WAIT FOR LSN exits immediately on promotion (62ms vs 60s)

While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.

Please check the attached patch for replacing them.

Thank you so much for your patches!
Pushed with minor corrections.

Thanks for pushing! It appears I should be running pgindent more regularly :).

--
Best,
Xuneng

#50Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#47)
Re: Implement waiting for wal lsn replay: reloaded

Hi!

On Sun, Nov 16, 2025 at 8:09 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sat, Nov 8, 2025 at 12:02 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Nov 5, 2025 at 4:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

Thanks for pushing them!

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.

I think we might end up extending WaitLSNType enum. However, I hate
inHeap and heapNode arrays growing in WaitLSNProcInfo as they are
allocated per process. I found that we could optimize WaitLSNProcInfo
struct turning them into simple variables because a single process can
wait only for a single LSN at a time. Please, check the attached
patch.

Here is the updated patch integrating minor corrections provided by
Xuneng Zhou off-list. I'm going to push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

LGTM. Thanks.

--
Best,
Xuneng

#51Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#48)
Re: Implement waiting for wal lsn replay: reloaded

Hi!

On Sun, Nov 16, 2025 at 8:37 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Nov 12, 2025 at 9:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

I tested the patch using make check-world, and it worked well. I also
made a few adjustments:

- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

#52Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#49)
Re: Implement waiting for wal lsn replay: reloaded

On Sun, Nov 16, 2025 at 3:25 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sat, Nov 15, 2025 at 6:29 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Thank you so much for your patches!
Pushed with minor corrections.

Thanks for pushing! It appears I should be running pgindent more regularly :).

Thank you. pgindent is not a problem for me, cause I anyway run it
every time before pushing a patch. But yes, if you make it a habit to
run pgindent every time before publishing a patch, it would become
cleaner.

------
Regards,
Alexander Korotkov
Supabase

#53Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#51)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander, Hackers,

On Sun, Nov 16, 2025 at 10:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi!

On Sun, Nov 16, 2025 at 8:37 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Nov 12, 2025 at 9:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:

On 2025-Nov-03, Alexander Korotkov wrote:

I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.

Sure. I don't understand why patches 0002 and 0003 are separate though.

FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.

Thank you for the feedback, pushed.

One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...

The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.

I tested the patch using make check-world, and it worked well. I also
made a few adjustments:

- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.

Syntax Option A: Mode in the WITH Clause

WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');

With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.

Syntax Option B: Mode as Part of the Main Command

WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';

Or a more concise variant using keywords:

WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;

This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.

I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.

--
Best,
Xuneng

#54Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#53)
5 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi!

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.

Syntax Option A: Mode in the WITH Clause

WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');

With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.

Syntax Option B: Mode as Part of the Main Command

WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';

Or a more concise variant using keywords:

WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;

This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.

I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.

I've implemented a patch that adds MODE support to WAIT FOR LSN

The new grammar looks like:

——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——

Two modes added: flush and write

Design decisions:

1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.

2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.

3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`

The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait types

Adds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.

-------
0002 - Add pg_last_wal_write_lsn() SQL function

Adds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.

-------
0003 - Add MODE parameter to WAIT FOR LSN command

Extends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.

-------
0004 - Add tab completion for WAIT FOR LSN MODE parameter

Adds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.

-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()

Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.

The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.

Feedback welcome.

--
Best,
Xuneng

Attachments:

v1-0002-Add-pg_last_wal_write_lsn-SQL-function.patchapplication/octet-stream; name=v1-0002-Add-pg_last_wal_write_lsn-SQL-function.patchDownload
From 7227bca84a9233fb2d7c130294511d48d8458e2f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:07:52 +0800
Subject: [PATCH v1 2/5] Add pg_last_wal_write_lsn() SQL function

Returns the current WAL write position on a standby server using
GetWalRcvWriteRecPtr(). This enables verification of WAIT FOR LSN MODE WRITE
and operational monitoring of standby WAL write progress.
---
 doc/src/sgml/func/func-admin.sgml      | 19 +++++++++++++++++++
 src/backend/access/transam/xlogfuncs.c | 19 +++++++++++++++++++
 src/include/catalog/pg_proc.dat        |  4 ++++
 3 files changed, 42 insertions(+)

diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..ed4e77d12ba 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -688,6 +688,25 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_last_wal_write_lsn</primary>
+        </indexterm>
+        <function>pg_last_wal_write_lsn</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Returns the last write-ahead log location that has been received and
+        written to disk by streaming replication, but not necessarily synced.
+        While streaming replication is in progress this will increase
+        monotonically. If recovery has completed then this will remain static
+        at the location of the last WAL record written during recovery. If
+        streaming replication is disabled, or if it has not yet started, the
+        function returns <literal>NULL</literal>.
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..46cd4a7ce2f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,25 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(recptr);
 }
 
+/*
+ * Report the last WAL write location (same format as pg_backup_start etc)
+ *
+ * This is useful for determining how much of WAL has been received and
+ * written to disk by walreceiver, but not necessarily synced/flushed.
+ */
+Datum
+pg_last_wal_write_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	recptr = GetWalRcvWriteRecPtr();
+
+	if (!XLogRecPtrIsValid(recptr))
+		PG_RETURN_NULL();
+
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * Report the last WAL replay location (same format as pg_backup_start etc)
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66431940700..fcb674c05b3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6782,6 +6782,10 @@
   proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
   prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '6434', descr => 'current wal write location',
+  proname => 'pg_last_wal_write_lsn', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => '',
+  prosrc => 'pg_last_wal_write_lsn' },
 { oid => '3821', descr => 'last wal replay location',
   proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
-- 
2.51.0

v1-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v1-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From ca10a52bd7a835b2873268236a4553fc911e2de3 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v1 1/5] Extend xlogwait infrastructure with write and flush 
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlogwait.c         | 79 ++++++++++++++-----
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 |  7 +-
 3 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..86709e0df63 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,43 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_FLUSH_STANDBY:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+			return GetFlushRecPtr(NULL);
+
+		default:
+			elog(ERROR, "invalid LSN wait type: %d", lsnType);
+			return InvalidXLogRecPtr;	/* keep compiler quiet */
+	}
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -341,13 +384,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +417,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..64a2fb02eac 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -36,8 +36,10 @@ typedef enum
 typedef enum WaitLSNType
 {
 	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
-	WAIT_LSN_TYPE_COUNT = 2
+	WAIT_LSN_TYPE_FLUSH_STANDBY = 1,	/* Waiting for flush on standby */
+	WAIT_LSN_TYPE_WRITE = 2,	/* Waiting for write on standby */
+	WAIT_LSN_TYPE_FLUSH_PRIMARY = 3,	/* Waiting for flush on primary */
+	WAIT_LSN_TYPE_COUNT = 4
 } WaitLSNType;
 
 /*
@@ -96,6 +98,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v1-0005-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v1-0005-Use-WAIT-FOR-LSN-in.patchDownload
From 6229917d4802a82bb63ac41ec32a7ca357701c67 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v1 5/5] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.

The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+	# This is more efficient than polling pg_stat_replication
+	if (($mode ne 'sent') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		# Map mode names to WAIT FOR LSN MODE values (uppercase)
+		my $wait_mode = uc($mode);
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v1-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v1-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From 071f67d1fae98e397c071dce0b9993b3be0c0e9f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v1 3/5] Add MODE parameter to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]

- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN

The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.

The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.

Also includes:
- Documentation updates for the new syntax and refactoring
  of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 184 ++++++++++++++++------
 src/backend/access/transam/xlog.c       |   6 +-
 src/backend/commands/wait.c             |  59 +++++--
 src/backend/parser/gram.y               |  21 ++-
 src/backend/replication/walreceiver.c   |  19 +++
 src/include/nodes/parsenodes.h          |  16 ++
 src/include/parser/kwlist.h             |   2 +
 src/test/recovery/t/049_wait_for_lsn.pl | 201 +++++++++++++++++++++---
 8 files changed, 422 insertions(+), 86 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..efd851149c0 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+   On timeout, or if the server is promoted before
+   <parameter>lsn</parameter> is reached, an error is emitted,
+   unless <literal>NO_THROW</literal> is specified in the WITH clause.
+   If <parameter>NO_THROW</parameter> is specified, then the command
+   doesn't throw errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -64,6 +67,53 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>MODE</literal></term>
+    <listitem>
+     <para>
+      Specifies the type of LSN processing to wait for. If not specified,
+      the default is <literal>REPLAY</literal>. The valid modes are:
+     </para>
+
+     <variablelist>
+      <varlistentry>
+       <term><literal>REPLAY</literal></term>
+       <listitem>
+        <para>
+         Wait for the LSN to be replayed (applied to the database).
+         After successful completion, <function>pg_last_wal_replay_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>FLUSH</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and flushed to durable storage on the replica. This
+         provides a durability guarantee without waiting for the WAL
+         to be applied.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>WRITE</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and written to the operating system on the replica.
+         This is faster than <literal>FLUSH</literal> but provides weaker
+         durability guarantees since the data may still be in OS buffers.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
     <listitem>
@@ -135,9 +185,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +201,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+   for the LSN to be replayed (applied to the database), which is useful
+   to achieve read-your-writes consistency while using an async replica
+   for reads and the primary for writes. The <literal>FLUSH</literal> mode
+   waits for the WAL to be flushed to durable storage on the replica,
+   providing a durability guarantee without waiting for replay. The
+   <literal>WRITE</literal> mode waits for the WAL to be written to the
+   operating system, which is faster than flush but provides weaker
+   durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+   last modification should be stored on the client application side or
+   the connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   <command>WAIT FOR</command> should be called on a standby.
+   If a user runs <command>WAIT FOR</command> on the primary, it
+   will error out unless <parameter>NO_THROW</parameter> is specified
+   in the WITH clause. However, if <command>WAIT FOR</command> is
+   called on a primary promoted from standby and <literal>lsn</literal>
+   was already reached, then the <command>WAIT FOR</command> command
+   just exits immediately. If the replica is promoted while waiting,
+   the command will return <literal>not in recovery</literal> (or throw
+   an error if <literal>NO_THROW</literal> is not specified). Promotion
+   creates a new timeline, and the LSN being waited for may refer to
+   WAL from the old timeline.
   </para>
 
 </refsect1>
@@ -175,21 +236,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -198,9 +259,9 @@ pg_current_wal_insert_lsn
    changes made on primary should be guaranteed to be visible on replica.
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +272,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
 ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
 </programlisting>
   </para>
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..a4c7a7c2b38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..73876ca5c7c 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,12 +29,29 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* "replay", "flush", "write" */
+	const char *verb;			/* "replayed", "flushed", "written" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_REPLAY] = {"replay", "replayed"},
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+	[WAIT_LSN_TYPE_WRITE] = {"write", "written"},
+};
+
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType;
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
@@ -41,6 +59,16 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
 
+	/*
+	 * Convert parse-time WaitLSNMode to runtime WaitLSNType. Values are
+	 * designed to match, so a simple cast is safe.
+	 */
+	lsnType = (WaitLSNType) stmt->mode;
+
+	/* Validate mode value (should never fail if grammar is correct) */
+	Assert(lsnType >= WAIT_LSN_TYPE_REPLAY &&
+		   lsnType < WAIT_LSN_TYPE_FLUSH_PRIMARY);
+
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
 										  CStringGetDatum(stmt->lsn_literal)));
@@ -107,8 +135,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +168,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +182,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +201,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <windef>	window_definition over_clause window_specification
 				opt_frame_clause frame_extent frame_bound
 %type <ival>	null_treatment opt_window_exclusion_clause
+%type <ival>	opt_wait_lsn_mode
 %type <str>		opt_existing_window_name
 %type <boolean> opt_if_not_exists
 %type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
 	EXPRESSION EXTENSION EXTERNAL EXTRACT
 
-	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
 	FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
 
 	GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
 	RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -16489,15 +16490,23 @@ xml_passing_mech:
  *****************************************************************************/
 
 WaitStmt:
-			WAIT FOR LSN_P Sconst opt_wait_with_clause
+			WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
 				{
 					WaitStmt *n = makeNode(WaitStmt);
 					n->lsn_literal = $4;
-					n->options = $5;
+					n->mode = $5;
+					n->options = $6;
 					$$ = (Node *) n;
 				}
 			;
 
+opt_wait_lsn_mode:
+			MODE REPLAY			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			| MODE FLUSH		{ $$ = WAIT_LSN_MODE_FLUSH; }
+			| MODE WRITE		{ $$ = WAIT_LSN_MODE_WRITE; }
+			| /*EMPTY*/			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			;
+
 opt_wait_with_clause:
 			WITH '(' utility_option_list ')'		{ $$ = $3; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
 			| FILTER
 			| FINALIZE
 			| FIRST_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
 			| FINALIZE
 			| FIRST_P
 			| FLOAT_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..818049599ed 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for then walk over
+		 * the shared memory array and set latches to notify the waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..68dc49dc2da 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,26 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ *
+ * These values are defined to match WaitLSNType in access/xlogwait.h
+ * for efficient conversion without overhead. The values must be kept
+ * in sync with WaitLSNType.
+ */
+typedef enum WaitLSNMode
+{
+	WAIT_LSN_MODE_REPLAY = 0,	/* Wait for LSN replay on standby */
+	WAIT_LSN_MODE_FLUSH = 1,	/* Wait for LSN flush to disk on standby */
+	WAIT_LSN_MODE_WRITE = 2		/* Wait for LSN write to WAL buffers on
+								 * standby */
+} WaitLSNMode;
+
 typedef struct WaitStmt
 {
 	NodeTag		type;
 	char	   *lsn_literal;	/* LSN string from grammar */
+	WaitLSNMode mode;			/* Wait mode: REPLAY/FLUSH/WRITE */
 	List	   *options;		/* List of DefElem nodes */
 } WaitStmt;
 
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..e579b98f019 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
 # the WAIT FOR command.
 use strict;
 use warnings FATAL => 'all';
@@ -62,7 +62,40 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');");
+
+# Verify via pg_stat_replication that standby reported the write
+my $standby_write_lsn = $node_primary->safe_psql(
+	'postgres', qq[
+	SELECT write_lsn FROM pg_stat_replication
+	WHERE application_name = 'standby';
+]);
+
+ok( $node_primary->safe_psql('postgres',
+		"SELECT '${standby_write_lsn}'::pg_lsn >= '${lsn_write}'::pg_lsn") eq
+	  't',
+	"standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,7 +121,7 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
 # within another function, or inside a transaction with an isolation level
 # higher than READ COMMITTED.
 
@@ -125,7 +158,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,7 +241,7 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -239,7 +272,7 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
-		WAIT FOR LSN '${lsn}';
+		WAIT FOR LSN '${lsn}' MODE REPLAY;
 		SELECT log_count(${i});
 	]);
 }
@@ -251,23 +284,138 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+my @write_sessions;
+my @write_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+	]);
+}
+
+# Wait for all WAIT FOR LSN commands to complete
+for (my $i = 0; $i < 3; $i++)
+{
+	$write_sessions[$i]->{run}->finish;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '$write_lsns[2]'::pg_lsn);");
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+ok($output >= 0,
+	"multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+my @flush_sessions;
+my @flush_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+	]);
+}
+
+# Wait for all WAIT FOR LSN commands to complete
+for (my $i = 0; $i < 3; $i++)
+{
+	$flush_sessions[$i]->{run}->finish;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[2]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently.  We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN.  When all complete, we verify that the replay LSN
+# (the slowest to advance due to recovery_min_apply_delay) has reached the
+# target.  Since REPLAY waiters block until replay completes, and WRITE/FLUSH
+# complete earlier, successful completion of all sessions proves proper
+# coordination.
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+	]);
+}
+
+# Resume replay so REPLAY waiters can complete
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+
+# Wait for all sessions to complete - this blocks until WAIT FOR LSN returns
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i]->{run}->finish;
+}
+
+# Verify: if all waiters completed, then the slowest (REPLAY) must have
+# reached the target LSN, which implies WRITE and FLUSH also succeeded
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"mixed mode waiters: all modes completed, replay reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes.  Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote.  Check the log for the relevant error messages.  Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +425,23 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
+
+# Wait for at least one "recovery is not in progress" error to appear
 $node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Verify all three sessions got the error by checking the log contains
+# the error message at least three times (from the promotion point)
+my $log_contents = slurp_file($node_standby->logfile, $log_offset);
+my $error_count = () = $log_contents =~ /recovery is not in progress/g;
+ok($error_count >= 3, 'promotion interrupted all wait modes');
 
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
 
 ok(1, 'wait for already replayed LSN exits immediately even after promotion');
 
 $output = $node_standby->safe_psql(
 	'postgres', qq[
-	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+	WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
 ok($output eq "not in recovery",
 	"WAIT FOR returns correct status after standby promotion");
 
@@ -295,8 +449,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

v1-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v1-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 1bb41bdd83b37f9ef7237095a368ea21e589d262 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v1 4/5] Add tab completion for WAIT FOR LSN MODE parameter

Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
 src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
 
 /*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
  * where option can be:
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * MODE defaults to REPLAY if not specified.
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
 	else if (Matches("WAIT", "FOR", "LSN"))
 		/* No completion for LSN value - user must provide manually */
 		;
+
+	/*
+	 * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+	 * REPLAY
+	 */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+		COMPLETE_WITH("MODE", "WITH");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+		COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
 		COMPLETE_WITH("WITH");
+	/* WITH directly after LSN (using default REPLAY mode) */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list (both with and without explicit MODE).
+	 * This fires when we're in an unfinished parenthesized option list.
+	 * get_previous_words treats a completed parenthesized option list as one
+	 * word, so the above test is correct. timeout takes a string value,
+	 * no_throw takes no value. We don't offer completions for these values.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+	}
+	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+	{
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("timeout", "no_throw");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

#55Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#54)
5 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi hackers,

On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi!

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.

Syntax Option A: Mode in the WITH Clause

WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');

With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.

Syntax Option B: Mode as Part of the Main Command

WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';

Or a more concise variant using keywords:

WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;

This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.

I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.

I've implemented a patch that adds MODE support to WAIT FOR LSN

The new grammar looks like:

——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——

Two modes added: flush and write

Design decisions:

1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.

2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.

3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`

The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait types

Adds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.

-------
0002 - Add pg_last_wal_write_lsn() SQL function

Adds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.

-------
0003 - Add MODE parameter to WAIT FOR LSN command

Extends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.

-------
0004 - Add tab completion for WAIT FOR LSN MODE parameter

Adds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.

-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()

Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.

The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.

Feedback welcome.

Here is the updated v2 patch set. Most of the updates are in patch 3.

Changes from v1:

Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)

Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat description

Patch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.

--
Best,
Xuneng

Attachments:

v2-0005-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v2-0005-Use-WAIT-FOR-LSN-in.patchDownload
From 02b633402db35770fd70ace6c1e6301f3dd6741b Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v2 5/5] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.

The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+	# This is more efficient than polling pg_stat_replication
+	if (($mode ne 'sent') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		# Map mode names to WAIT FOR LSN MODE values (uppercase)
+		my $wait_mode = uc($mode);
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v2-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v2-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 7fcaab3d495ccc42c3f9731d1de9a15c33c01ee8 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v2 4/5] Add tab completion for WAIT FOR LSN MODE parameter

Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
 src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
 
 /*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
  * where option can be:
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * MODE defaults to REPLAY if not specified.
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
 	else if (Matches("WAIT", "FOR", "LSN"))
 		/* No completion for LSN value - user must provide manually */
 		;
+
+	/*
+	 * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+	 * REPLAY
+	 */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+		COMPLETE_WITH("MODE", "WITH");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+		COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
 		COMPLETE_WITH("WITH");
+	/* WITH directly after LSN (using default REPLAY mode) */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list (both with and without explicit MODE).
+	 * This fires when we're in an unfinished parenthesized option list.
+	 * get_previous_words treats a completed parenthesized option list as one
+	 * word, so the above test is correct. timeout takes a string value,
+	 * no_throw takes no value. We don't offer completions for these values.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+	}
+	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+	{
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("timeout", "no_throw");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v2-0002-Add-pg_last_wal_write_lsn-SQL-function.patchapplication/octet-stream; name=v2-0002-Add-pg_last_wal_write_lsn-SQL-function.patchDownload
From 9d22e09d378e8f6c52aa95bc4a0e1650f4621a39 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:07:52 +0800
Subject: [PATCH v2 2/5] Add pg_last_wal_write_lsn() SQL function

Returns the current WAL write position on a standby server using
GetWalRcvWriteRecPtr(). This enables verification of WAIT FOR LSN MODE WRITE
and operational monitoring of standby WAL write progress.
---
 doc/src/sgml/func/func-admin.sgml      | 22 ++++++++++++++++++++++
 src/backend/access/transam/xlogfuncs.c | 20 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat        |  4 ++++
 3 files changed, 46 insertions(+)

diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..9ff196c4be4 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -688,6 +688,28 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_last_wal_write_lsn</primary>
+        </indexterm>
+        <function>pg_last_wal_write_lsn</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Returns the last write-ahead log location that has been received and
+        passed to the operating system by streaming replication, but not
+        necessarily synced to durable storage.  This is faster than
+        <function>pg_last_wal_receive_lsn</function> but provides weaker
+        durability guarantees since the data may still be in OS buffers.
+        While streaming replication is in progress this will increase
+        monotonically. If recovery has completed then this will remain static
+        at the location of the last WAL record written during recovery. If
+        streaming replication is disabled, or if it has not yet started, the
+        function returns <literal>NULL</literal>.
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..2797b2bf158 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,26 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(recptr);
 }
 
+/*
+ * Report the last WAL write location (same format as pg_backup_start etc)
+ *
+ * This is useful for determining how much of WAL has been received and
+ * passed to the operating system by walreceiver.  Unlike pg_last_wal_receive_lsn,
+ * this data may still be in OS buffers and not yet synced to durable storage.
+ */
+Datum
+pg_last_wal_write_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	recptr = GetWalRcvWriteRecPtr();
+
+	if (!XLogRecPtrIsValid(recptr))
+		PG_RETURN_NULL();
+
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * Report the last WAL replay location (same format as pg_backup_start etc)
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66431940700..478e0a8139f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6782,6 +6782,10 @@
   proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
   prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '6434', descr => 'last wal write location on standby',
+  proname => 'pg_last_wal_write_lsn', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => '',
+  prosrc => 'pg_last_wal_write_lsn' },
 { oid => '3821', descr => 'last wal replay location',
   proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
-- 
2.51.0

v2-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v2-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From da210bfc2b62d9a38ea54b94037380144753663a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v2 1/5] Extend xlogwait infrastructure with write and flush 
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_WRITE_STANDBY and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY → WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 84 ++++++++++++++-----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 13 ++-
 6 files changed, 81 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..4b145515269 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6243,7 +6243,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..243c0b368a9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY_STANDBY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..21823acee9c 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_REPLAY_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_WRITE_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_REPLAY_STANDBY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_WRITE_STANDBY:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_FLUSH_STANDBY:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+			return GetFlushRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_COUNT:
+			break;
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..43b37095afb 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..9721a7a7195 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,9 +35,15 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
-	WAIT_LSN_TYPE_COUNT = 2
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_REPLAY_STANDBY = 0,
+	WAIT_LSN_TYPE_WRITE_STANDBY = 1,
+	WAIT_LSN_TYPE_FLUSH_STANDBY = 2,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_FLUSH_PRIMARY = 3,
+
+	WAIT_LSN_TYPE_COUNT = 4
 } WaitLSNType;
 
 /*
@@ -96,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v2-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v2-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From 1367c1f3322b93190fcd4ca70ab309efd8556c77 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v2] Add MODE parameter to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]

- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN

The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.

The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.

Also includes:
- Documentation updates for the new syntax and refactoring
  of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 188 +++++++++++----
 src/backend/access/transam/xlog.c       |   6 +-
 src/backend/commands/wait.c             |  64 ++++-
 src/backend/parser/gram.y               |  21 +-
 src/backend/replication/walreceiver.c   |  19 ++
 src/include/nodes/parsenodes.h          |  11 +
 src/include/parser/kwlist.h             |   2 +
 src/test/recovery/t/049_wait_for_lsn.pl | 299 ++++++++++++++++++++++--
 8 files changed, 523 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..a5e7f6c6fe9 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+   On timeout, or if the server is promoted before
+   <parameter>lsn</parameter> is reached, an error is emitted,
+   unless <literal>NO_THROW</literal> is specified in the WITH clause.
+   If <parameter>NO_THROW</parameter> is specified, then the command
+   doesn't throw errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -64,6 +67,57 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>MODE</literal></term>
+    <listitem>
+     <para>
+      Specifies the type of LSN processing to wait for. If not specified,
+      the default is <literal>REPLAY</literal>. The valid modes are:
+     </para>
+
+     <variablelist>
+      <varlistentry>
+       <term><literal>REPLAY</literal></term>
+       <listitem>
+        <para>
+         Wait for the LSN to be replayed (applied to the database).
+         After successful completion, <function>pg_last_wal_replay_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>FLUSH</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and synced to durable storage via <function>fsync()</function>.
+         This provides a durability guarantee without waiting for the WAL
+         to be applied. After successful completion,
+         <function>pg_last_wal_receive_lsn()</function> will return a value
+         greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>WRITE</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and passed to the operating system via <function>write()</function>.
+         This is faster than <literal>FLUSH</literal> but provides weaker
+         durability guarantees since the data may still be in OS buffers.
+         After successful completion, <function>pg_last_wal_write_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
     <listitem>
@@ -135,9 +189,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +205,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+   for the LSN to be replayed (applied to the database), which is useful
+   to achieve read-your-writes consistency while using an async replica
+   for reads and the primary for writes. The <literal>FLUSH</literal> mode
+   waits for the WAL to be flushed to durable storage on the replica,
+   providing a durability guarantee without waiting for replay. The
+   <literal>WRITE</literal> mode waits for the WAL to be written to the
+   operating system, which is faster than flush but provides weaker
+   durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+   last modification should be stored on the client application side or
+   the connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   <command>WAIT FOR</command> should be called on a standby.
+   If a user runs <command>WAIT FOR</command> on the primary, it
+   will error out unless <parameter>NO_THROW</parameter> is specified
+   in the WITH clause. However, if <command>WAIT FOR</command> is
+   called on a primary promoted from standby and <literal>lsn</literal>
+   was already reached, then the <command>WAIT FOR</command> command
+   just exits immediately. If the replica is promoted while waiting,
+   the command will return <literal>not in recovery</literal> (or throw
+   an error if <literal>NO_THROW</literal> is not specified). Promotion
+   creates a new timeline, and the LSN being waited for may refer to
+   WAL from the old timeline.
   </para>
 
 </refsect1>
@@ -175,21 +240,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -198,9 +263,9 @@ pg_current_wal_insert_lsn
    changes made on primary should be guaranteed to be visible on replica.
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +276,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
 ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
 </programlisting>
   </para>
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b145515269..5b2a262ff8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 43b37095afb..05ad84fdb5b 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,12 +29,28 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* "replay", "flush", "write" */
+	const char *verb;			/* "replayed", "flushed", "written" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_REPLAY_STANDBY] = {"replay", "replayed"},
+	[WAIT_LSN_TYPE_WRITE_STANDBY] = {"write", "written"},
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType;
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
 										  CStringGetDatum(stmt->lsn_literal)));
 
+	/* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+	switch (stmt->mode)
+	{
+		case WAIT_LSN_MODE_REPLAY:
+			lsnType = WAIT_LSN_TYPE_REPLAY_STANDBY;
+			break;
+		case WAIT_LSN_MODE_WRITE:
+			lsnType = WAIT_LSN_TYPE_WRITE_STANDBY;
+			break;
+		case WAIT_LSN_MODE_FLUSH:
+			lsnType = WAIT_LSN_TYPE_FLUSH_STANDBY;
+			break;
+		default:
+			elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+	}
+
 	foreach_node(DefElem, defel, stmt->options)
 	{
 		if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <windef>	window_definition over_clause window_specification
 				opt_frame_clause frame_extent frame_bound
 %type <ival>	null_treatment opt_window_exclusion_clause
+%type <ival>	opt_wait_lsn_mode
 %type <str>		opt_existing_window_name
 %type <boolean> opt_if_not_exists
 %type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
 	EXPRESSION EXTENSION EXTERNAL EXTRACT
 
-	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
 	FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
 
 	GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
 	RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -16489,15 +16490,23 @@ xml_passing_mech:
  *****************************************************************************/
 
 WaitStmt:
-			WAIT FOR LSN_P Sconst opt_wait_with_clause
+			WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
 				{
 					WaitStmt *n = makeNode(WaitStmt);
 					n->lsn_literal = $4;
-					n->options = $5;
+					n->mode = $5;
+					n->options = $6;
 					$$ = (Node *) n;
 				}
 			;
 
+opt_wait_lsn_mode:
+			MODE REPLAY			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			| MODE FLUSH		{ $$ = WAIT_LSN_MODE_FLUSH; }
+			| MODE WRITE		{ $$ = WAIT_LSN_MODE_WRITE; }
+			| /*EMPTY*/			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			;
+
 opt_wait_with_clause:
 			WITH '(' utility_option_list ')'		{ $$ = $3; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
 			| FILTER
 			| FINALIZE
 			| FIRST_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
 			| FINALIZE
 			| FIRST_P
 			| FLOAT_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..be2971408e7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE_STANDBY])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for then walk over
+		 * the shared memory array and set latches to notify the waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..bbaf3242ccb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,21 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+	WAIT_LSN_MODE_REPLAY,		/* Wait for LSN replay on standby */
+	WAIT_LSN_MODE_WRITE,		/* Wait for LSN write on standby */
+	WAIT_LSN_MODE_FLUSH			/* Wait for LSN flush on standby */
+}			WaitLSNMode;
+
 typedef struct WaitStmt
 {
 	NodeTag		type;
 	char	   *lsn_literal;	/* LSN string from grammar */
+	WaitLSNMode mode;			/* Wait mode: REPLAY/FLUSH/WRITE */
 	List	   *options;		/* List of DefElem nodes */
 } WaitStmt;
 
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..6c9a463775b 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
 # the WAIT FOR command.
 use strict;
 use warnings FATAL => 'all';
@@ -62,7 +62,34 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,7 +115,7 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
 # within another function, or inside a transaction with an isolation level
 # higher than READ COMMITTED.
 
@@ -125,7 +152,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,7 +235,7 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -239,7 +266,7 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
-		WAIT FOR LSN '${lsn}';
+		WAIT FOR LSN '${lsn}' MODE REPLAY;
 		SELECT log_count(${i});
 	]);
 }
@@ -251,23 +278,239 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+my $orig_conninfo = $node_standby->safe_psql('postgres',
+	"SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+$node_standby->safe_psql(
+	'postgres', qq[
+	ALTER SYSTEM SET primary_conninfo = '';
+	SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 3 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql(
+	'postgres', qq[
+	ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
+	SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+	"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 3; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '$write_lsns[2]'::pg_lsn);");
+
+ok($output >= 0,
+	"multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+$node_standby->safe_psql(
+	'postgres', qq[
+	ALTER SYSTEM SET primary_conninfo = '';
+	SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 3 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql(
+	'postgres', qq[
+	ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
+	SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+	"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 3; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[2]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently.  We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN.  We stop the walreceiver and pause replay to
+# ensure all waiters block.  Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+my $orig_conninfo_7d = $node_standby->safe_psql('postgres',
+	"SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+$node_standby->safe_psql(
+	'postgres', qq[
+	ALTER SYSTEM SET primary_conninfo = '';
+	SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+$node_standby->safe_psql(
+	'postgres', qq[
+	ALTER SYSTEM SET primary_conninfo = '$orig_conninfo_7d';
+	SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+	"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes.  Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote.  Check the log for the relevant error messages.  Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +520,24 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
 
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
 
 ok(1, 'wait for already replayed LSN exits immediately even after promotion');
 
 $output = $node_standby->safe_psql(
 	'postgres', qq[
-	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+	WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
 ok($output eq "not in recovery",
 	"WAIT FOR returns correct status after standby promotion");
 
@@ -295,8 +545,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

#56Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#55)
5 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi hackers,

On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi!

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.

Syntax Option A: Mode in the WITH Clause

WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');

With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.

Syntax Option B: Mode as Part of the Main Command

WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';

Or a more concise variant using keywords:

WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;

This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.

I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.

I've implemented a patch that adds MODE support to WAIT FOR LSN

The new grammar looks like:

——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——

Two modes added: flush and write

Design decisions:

1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.

2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.

3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`

The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait types

Adds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.

-------
0002 - Add pg_last_wal_write_lsn() SQL function

Adds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.

-------
0003 - Add MODE parameter to WAIT FOR LSN command

Extends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.

-------
0004 - Add tab completion for WAIT FOR LSN MODE parameter

Adds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.

-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()

Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.

The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.

Feedback welcome.

Here is the updated v2 patch set. Most of the updates are in patch 3.

Changes from v1:

Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)

Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat description

Patch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.

--
Best,
Xuneng

Here is the updated v3. The changes are made to patch 3:

- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.

--
Best,
Xuneng

Attachments:

v3-0005-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v3-0005-Use-WAIT-FOR-LSN-in.patchDownload
From 48f072498a128eb47f616e8c7e2621eb1ff2d831 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v3 5/5] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.

The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+	# This is more efficient than polling pg_stat_replication
+	if (($mode ne 'sent') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		# Map mode names to WAIT FOR LSN MODE values (uppercase)
+		my $wait_mode = uc($mode);
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v3-0002-Add-pg_last_wal_write_lsn-SQL-function.patchapplication/octet-stream; name=v3-0002-Add-pg_last_wal_write_lsn-SQL-function.patchDownload
From 9d22e09d378e8f6c52aa95bc4a0e1650f4621a39 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:07:52 +0800
Subject: [PATCH v3 2/5] Add pg_last_wal_write_lsn() SQL function

Returns the current WAL write position on a standby server using
GetWalRcvWriteRecPtr(). This enables verification of WAIT FOR LSN MODE WRITE
and operational monitoring of standby WAL write progress.
---
 doc/src/sgml/func/func-admin.sgml      | 22 ++++++++++++++++++++++
 src/backend/access/transam/xlogfuncs.c | 20 ++++++++++++++++++++
 src/include/catalog/pg_proc.dat        |  4 ++++
 3 files changed, 46 insertions(+)

diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..9ff196c4be4 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -688,6 +688,28 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_last_wal_write_lsn</primary>
+        </indexterm>
+        <function>pg_last_wal_write_lsn</function> ()
+        <returnvalue>pg_lsn</returnvalue>
+       </para>
+       <para>
+        Returns the last write-ahead log location that has been received and
+        passed to the operating system by streaming replication, but not
+        necessarily synced to durable storage.  This is faster than
+        <function>pg_last_wal_receive_lsn</function> but provides weaker
+        durability guarantees since the data may still be in OS buffers.
+        While streaming replication is in progress this will increase
+        monotonically. If recovery has completed then this will remain static
+        at the location of the last WAL record written during recovery. If
+        streaming replication is disabled, or if it has not yet started, the
+        function returns <literal>NULL</literal>.
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..2797b2bf158 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,26 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(recptr);
 }
 
+/*
+ * Report the last WAL write location (same format as pg_backup_start etc)
+ *
+ * This is useful for determining how much of WAL has been received and
+ * passed to the operating system by walreceiver.  Unlike pg_last_wal_receive_lsn,
+ * this data may still be in OS buffers and not yet synced to durable storage.
+ */
+Datum
+pg_last_wal_write_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	recptr = GetWalRcvWriteRecPtr();
+
+	if (!XLogRecPtrIsValid(recptr))
+		PG_RETURN_NULL();
+
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * Report the last WAL replay location (same format as pg_backup_start etc)
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66431940700..478e0a8139f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6782,6 +6782,10 @@
   proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
   prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '6434', descr => 'last wal write location on standby',
+  proname => 'pg_last_wal_write_lsn', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => '',
+  prosrc => 'pg_last_wal_write_lsn' },
 { oid => '3821', descr => 'last wal replay location',
   proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
-- 
2.51.0

v3-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v3-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From d51394bdfdf16e0d569a0e5843288c1a36b671a5 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v3 3/5] Add MODE parameter to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]

- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN

The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.

The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.

Also includes:
- Documentation updates for the new syntax and refactoring
  of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 188 +++++++++++----
 src/backend/access/transam/xlog.c       |   6 +-
 src/backend/commands/wait.c             |  64 +++++-
 src/backend/parser/gram.y               |  21 +-
 src/backend/replication/walreceiver.c   |  19 ++
 src/include/nodes/parsenodes.h          |  11 +
 src/include/parser/kwlist.h             |   2 +
 src/test/recovery/t/049_wait_for_lsn.pl | 294 ++++++++++++++++++++++--
 8 files changed, 518 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..a5e7f6c6fe9 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+   On timeout, or if the server is promoted before
+   <parameter>lsn</parameter> is reached, an error is emitted,
+   unless <literal>NO_THROW</literal> is specified in the WITH clause.
+   If <parameter>NO_THROW</parameter> is specified, then the command
+   doesn't throw errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -64,6 +67,57 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>MODE</literal></term>
+    <listitem>
+     <para>
+      Specifies the type of LSN processing to wait for. If not specified,
+      the default is <literal>REPLAY</literal>. The valid modes are:
+     </para>
+
+     <variablelist>
+      <varlistentry>
+       <term><literal>REPLAY</literal></term>
+       <listitem>
+        <para>
+         Wait for the LSN to be replayed (applied to the database).
+         After successful completion, <function>pg_last_wal_replay_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>FLUSH</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and synced to durable storage via <function>fsync()</function>.
+         This provides a durability guarantee without waiting for the WAL
+         to be applied. After successful completion,
+         <function>pg_last_wal_receive_lsn()</function> will return a value
+         greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>WRITE</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and passed to the operating system via <function>write()</function>.
+         This is faster than <literal>FLUSH</literal> but provides weaker
+         durability guarantees since the data may still be in OS buffers.
+         After successful completion, <function>pg_last_wal_write_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
     <listitem>
@@ -135,9 +189,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +205,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+   for the LSN to be replayed (applied to the database), which is useful
+   to achieve read-your-writes consistency while using an async replica
+   for reads and the primary for writes. The <literal>FLUSH</literal> mode
+   waits for the WAL to be flushed to durable storage on the replica,
+   providing a durability guarantee without waiting for replay. The
+   <literal>WRITE</literal> mode waits for the WAL to be written to the
+   operating system, which is faster than flush but provides weaker
+   durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+   last modification should be stored on the client application side or
+   the connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   <command>WAIT FOR</command> should be called on a standby.
+   If a user runs <command>WAIT FOR</command> on the primary, it
+   will error out unless <parameter>NO_THROW</parameter> is specified
+   in the WITH clause. However, if <command>WAIT FOR</command> is
+   called on a primary promoted from standby and <literal>lsn</literal>
+   was already reached, then the <command>WAIT FOR</command> command
+   just exits immediately. If the replica is promoted while waiting,
+   the command will return <literal>not in recovery</literal> (or throw
+   an error if <literal>NO_THROW</literal> is not specified). Promotion
+   creates a new timeline, and the LSN being waited for may refer to
+   WAL from the old timeline.
   </para>
 
 </refsect1>
@@ -175,21 +240,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -198,9 +263,9 @@ pg_current_wal_insert_lsn
    changes made on primary should be guaranteed to be visible on replica.
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +276,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
 ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
 </programlisting>
   </para>
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b145515269..5b2a262ff8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 43b37095afb..05ad84fdb5b 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,12 +29,28 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* "replay", "flush", "write" */
+	const char *verb;			/* "replayed", "flushed", "written" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_REPLAY_STANDBY] = {"replay", "replayed"},
+	[WAIT_LSN_TYPE_WRITE_STANDBY] = {"write", "written"},
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType;
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
 										  CStringGetDatum(stmt->lsn_literal)));
 
+	/* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+	switch (stmt->mode)
+	{
+		case WAIT_LSN_MODE_REPLAY:
+			lsnType = WAIT_LSN_TYPE_REPLAY_STANDBY;
+			break;
+		case WAIT_LSN_MODE_WRITE:
+			lsnType = WAIT_LSN_TYPE_WRITE_STANDBY;
+			break;
+		case WAIT_LSN_MODE_FLUSH:
+			lsnType = WAIT_LSN_TYPE_FLUSH_STANDBY;
+			break;
+		default:
+			elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+	}
+
 	foreach_node(DefElem, defel, stmt->options)
 	{
 		if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <windef>	window_definition over_clause window_specification
 				opt_frame_clause frame_extent frame_bound
 %type <ival>	null_treatment opt_window_exclusion_clause
+%type <ival>	opt_wait_lsn_mode
 %type <str>		opt_existing_window_name
 %type <boolean> opt_if_not_exists
 %type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
 	EXPRESSION EXTENSION EXTERNAL EXTRACT
 
-	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
 	FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
 
 	GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
 	RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -16489,15 +16490,23 @@ xml_passing_mech:
  *****************************************************************************/
 
 WaitStmt:
-			WAIT FOR LSN_P Sconst opt_wait_with_clause
+			WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
 				{
 					WaitStmt *n = makeNode(WaitStmt);
 					n->lsn_literal = $4;
-					n->options = $5;
+					n->mode = $5;
+					n->options = $6;
 					$$ = (Node *) n;
 				}
 			;
 
+opt_wait_lsn_mode:
+			MODE REPLAY			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			| MODE FLUSH		{ $$ = WAIT_LSN_MODE_FLUSH; }
+			| MODE WRITE		{ $$ = WAIT_LSN_MODE_WRITE; }
+			| /*EMPTY*/			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			;
+
 opt_wait_with_clause:
 			WITH '(' utility_option_list ')'		{ $$ = $3; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
 			| FILTER
 			| FINALIZE
 			| FIRST_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
 			| FINALIZE
 			| FIRST_P
 			| FLOAT_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..be2971408e7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE_STANDBY])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for then walk over
+		 * the shared memory array and set latches to notify the waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..bbaf3242ccb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,21 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+	WAIT_LSN_MODE_REPLAY,		/* Wait for LSN replay on standby */
+	WAIT_LSN_MODE_WRITE,		/* Wait for LSN write on standby */
+	WAIT_LSN_MODE_FLUSH			/* Wait for LSN flush on standby */
+}			WaitLSNMode;
+
 typedef struct WaitStmt
 {
 	NodeTag		type;
 	char	   *lsn_literal;	/* LSN string from grammar */
+	WaitLSNMode mode;			/* Wait mode: REPLAY/FLUSH/WRITE */
 	List	   *options;		/* List of DefElem nodes */
 } WaitStmt;
 
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..ee3f2bf30d6 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
 # the WAIT FOR command.
 use strict;
 use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql('postgres',
+		"SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
 # within another function, or inside a transaction with an isolation level
 # higher than READ COMMITTED.
 
@@ -125,7 +184,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,7 +267,7 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
 \$\$
 LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
-		WAIT FOR LSN '${lsn}';
+		WAIT FOR LSN '${lsn}' MODE REPLAY;
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,199 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '$write_lsns[4]'::pg_lsn);");
+
+ok($output >= 0,
+	"multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently.  We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN.  We stop the walreceiver and pause replay to
+# ensure all waiters block.  Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes.  Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote.  Check the log for the relevant error messages.  Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +515,24 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
 
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
 
 ok(1, 'wait for already replayed LSN exits immediately even after promotion');
 
 $output = $node_standby->safe_psql(
 	'postgres', qq[
-	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+	WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
 ok($output eq "not in recovery",
 	"WAIT FOR returns correct status after standby promotion");
 
@@ -295,8 +540,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

v3-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v3-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 51624191461fe702522c315d9da7a68da48a4b13 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v3 4/5] Add tab completion for WAIT FOR LSN MODE parameter

Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
 src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
 
 /*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
  * where option can be:
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * MODE defaults to REPLAY if not specified.
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
 	else if (Matches("WAIT", "FOR", "LSN"))
 		/* No completion for LSN value - user must provide manually */
 		;
+
+	/*
+	 * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+	 * REPLAY
+	 */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+		COMPLETE_WITH("MODE", "WITH");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+		COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
 		COMPLETE_WITH("WITH");
+	/* WITH directly after LSN (using default REPLAY mode) */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list (both with and without explicit MODE).
+	 * This fires when we're in an unfinished parenthesized option list.
+	 * get_previous_words treats a completed parenthesized option list as one
+	 * word, so the above test is correct. timeout takes a string value,
+	 * no_throw takes no value. We don't offer completions for these values.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+	}
+	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+	{
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("timeout", "no_throw");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v3-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v3-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From da210bfc2b62d9a38ea54b94037380144753663a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v3 1/5] Extend xlogwait infrastructure with write and flush 
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_WRITE_STANDBY and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 84 ++++++++++++++-----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 13 ++-
 6 files changed, 81 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..4b145515269 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6243,7 +6243,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..243c0b368a9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY_STANDBY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..21823acee9c 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_REPLAY_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_WRITE_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_REPLAY_STANDBY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_WRITE_STANDBY:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_FLUSH_STANDBY:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+			return GetFlushRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_COUNT:
+			break;
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..43b37095afb 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..9721a7a7195 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,9 +35,15 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
-	WAIT_LSN_TYPE_COUNT = 2
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_REPLAY_STANDBY = 0,
+	WAIT_LSN_TYPE_WRITE_STANDBY = 1,
+	WAIT_LSN_TYPE_FLUSH_STANDBY = 2,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_FLUSH_PRIMARY = 3,
+
+	WAIT_LSN_TYPE_COUNT = 4
 } WaitLSNType;
 
 /*
@@ -96,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

#57Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#56)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Dec 2, 2025 at 11:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi hackers,

On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi!

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.

Syntax Option A: Mode in the WITH Clause

WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');

With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.

Syntax Option B: Mode as Part of the Main Command

WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';

Or a more concise variant using keywords:

WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;

This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.

I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.

I've implemented a patch that adds MODE support to WAIT FOR LSN

The new grammar looks like:

——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——

Two modes added: flush and write

Design decisions:

1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.

2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.

3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`

The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait types

Adds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.

-------
0002 - Add pg_last_wal_write_lsn() SQL function

Adds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.

-------
0003 - Add MODE parameter to WAIT FOR LSN command

Extends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.

-------
0004 - Add tab completion for WAIT FOR LSN MODE parameter

Adds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.

-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()

Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.

The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.

Feedback welcome.

Here is the updated v2 patch set. Most of the updates are in patch 3.

Changes from v1:

Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)

Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat description

Patch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.

--
Best,
Xuneng

Here is the updated v3. The changes are made to patch 3:

- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.

--
Best,
Xuneng

Just realized that patch 2 in prior emails could be dropped for
simplicity. Since the write LSN can be retrieved directly from
pg_stat_wal_receiver, the TAP test in patch 3 does not require a
separate SQL function for this purpose alone.

--
Best,
Xuneng

Attachments:

v4-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v4-0004-Use-WAIT-FOR-LSN-in.patchDownload
From 56044afa03fe5732460c8de28039915133137602 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v4 4/4] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.

The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+	# This is more efficient than polling pg_stat_replication
+	if (($mode ne 'sent') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		# Map mode names to WAIT FOR LSN MODE values (uppercase)
+		my $wait_mode = uc($mode);
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v4-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v4-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From c0748a75838fe9281a15f56976f3059596943fd3 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v4 2/4] Add MODE parameter to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]

- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN

The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.

The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.

Also includes:
- Documentation updates for the new syntax and refactoring
  of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 192 ++++++++++++----
 src/backend/access/transam/xlog.c       |   6 +-
 src/backend/commands/wait.c             |  64 +++++-
 src/backend/parser/gram.y               |  21 +-
 src/backend/replication/walreceiver.c   |  19 ++
 src/include/nodes/parsenodes.h          |  11 +
 src/include/parser/kwlist.h             |   2 +
 src/test/recovery/t/049_wait_for_lsn.pl | 294 ++++++++++++++++++++++--
 8 files changed, 522 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..28c68678315 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+   On timeout, or if the server is promoted before
+   <parameter>lsn</parameter> is reached, an error is emitted,
+   unless <literal>NO_THROW</literal> is specified in the WITH clause.
+   If <parameter>NO_THROW</parameter> is specified, then the command
+   doesn't throw errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -64,6 +67,61 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>MODE</literal></term>
+    <listitem>
+     <para>
+      Specifies the type of LSN processing to wait for. If not specified,
+      the default is <literal>REPLAY</literal>. The valid modes are:
+     </para>
+
+     <variablelist>
+      <varlistentry>
+       <term><literal>REPLAY</literal></term>
+       <listitem>
+        <para>
+         Wait for the LSN to be replayed (applied to the database).
+         After successful completion, <function>pg_last_wal_replay_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>FLUSH</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and flushed to disk. This provides a durability guarantee
+         without waiting for the WAL to be applied. After successful
+         completion, <function>pg_last_wal_receive_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+         This value is also available as the <structfield>flushed_lsn</structfield>
+         column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+         <structname>pg_stat_wal_receiver</structname></link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>WRITE</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and written to disk, but not yet flushed. This is faster
+         than <literal>FLUSH</literal> but provides weaker durability
+         guarantees since the data may still be in operating system buffers.
+         After successful completion, the <structfield>written_lsn</structfield>
+         column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+         <structname>pg_stat_wal_receiver</structname></link> will show
+         a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
     <listitem>
@@ -135,9 +193,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +209,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+   for the LSN to be replayed (applied to the database), which is useful
+   to achieve read-your-writes consistency while using an async replica
+   for reads and the primary for writes. The <literal>FLUSH</literal> mode
+   waits for the WAL to be flushed to durable storage on the replica,
+   providing a durability guarantee without waiting for replay. The
+   <literal>WRITE</literal> mode waits for the WAL to be written to the
+   operating system, which is faster than flush but provides weaker
+   durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+   last modification should be stored on the client application side or
+   the connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   <command>WAIT FOR</command> should be called on a standby.
+   If a user runs <command>WAIT FOR</command> on the primary, it
+   will error out unless <parameter>NO_THROW</parameter> is specified
+   in the WITH clause. However, if <command>WAIT FOR</command> is
+   called on a primary promoted from standby and <literal>lsn</literal>
+   was already reached, then the <command>WAIT FOR</command> command
+   just exits immediately. If the replica is promoted while waiting,
+   the command will return <literal>not in recovery</literal> (or throw
+   an error if <literal>NO_THROW</literal> is not specified). Promotion
+   creates a new timeline, and the LSN being waited for may refer to
+   WAL from the old timeline.
   </para>
 
 </refsect1>
@@ -175,21 +244,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -198,9 +267,9 @@ pg_current_wal_insert_lsn
    changes made on primary should be guaranteed to be visible on replica.
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +280,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
 ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
 </programlisting>
   </para>
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b145515269..5b2a262ff8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 43b37095afb..05ad84fdb5b 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,12 +29,28 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* "replay", "flush", "write" */
+	const char *verb;			/* "replayed", "flushed", "written" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_REPLAY_STANDBY] = {"replay", "replayed"},
+	[WAIT_LSN_TYPE_WRITE_STANDBY] = {"write", "written"},
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType;
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
 										  CStringGetDatum(stmt->lsn_literal)));
 
+	/* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+	switch (stmt->mode)
+	{
+		case WAIT_LSN_MODE_REPLAY:
+			lsnType = WAIT_LSN_TYPE_REPLAY_STANDBY;
+			break;
+		case WAIT_LSN_MODE_WRITE:
+			lsnType = WAIT_LSN_TYPE_WRITE_STANDBY;
+			break;
+		case WAIT_LSN_MODE_FLUSH:
+			lsnType = WAIT_LSN_TYPE_FLUSH_STANDBY;
+			break;
+		default:
+			elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+	}
+
 	foreach_node(DefElem, defel, stmt->options)
 	{
 		if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <windef>	window_definition over_clause window_specification
 				opt_frame_clause frame_extent frame_bound
 %type <ival>	null_treatment opt_window_exclusion_clause
+%type <ival>	opt_wait_lsn_mode
 %type <str>		opt_existing_window_name
 %type <boolean> opt_if_not_exists
 %type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
 	EXPRESSION EXTENSION EXTERNAL EXTRACT
 
-	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
 	FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
 
 	GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
 	RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -16489,15 +16490,23 @@ xml_passing_mech:
  *****************************************************************************/
 
 WaitStmt:
-			WAIT FOR LSN_P Sconst opt_wait_with_clause
+			WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
 				{
 					WaitStmt *n = makeNode(WaitStmt);
 					n->lsn_literal = $4;
-					n->options = $5;
+					n->mode = $5;
+					n->options = $6;
 					$$ = (Node *) n;
 				}
 			;
 
+opt_wait_lsn_mode:
+			MODE REPLAY			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			| MODE FLUSH		{ $$ = WAIT_LSN_MODE_FLUSH; }
+			| MODE WRITE		{ $$ = WAIT_LSN_MODE_WRITE; }
+			| /*EMPTY*/			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			;
+
 opt_wait_with_clause:
 			WITH '(' utility_option_list ')'		{ $$ = $3; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
 			| FILTER
 			| FINALIZE
 			| FIRST_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
 			| FINALIZE
 			| FIRST_P
 			| FLOAT_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..be2971408e7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE_STANDBY])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for then walk over
+		 * the shared memory array and set latches to notify the waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..bbaf3242ccb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,21 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+	WAIT_LSN_MODE_REPLAY,		/* Wait for LSN replay on standby */
+	WAIT_LSN_MODE_WRITE,		/* Wait for LSN write on standby */
+	WAIT_LSN_MODE_FLUSH			/* Wait for LSN flush on standby */
+}			WaitLSNMode;
+
 typedef struct WaitStmt
 {
 	NodeTag		type;
 	char	   *lsn_literal;	/* LSN string from grammar */
+	WaitLSNMode mode;			/* Wait mode: REPLAY/FLUSH/WRITE */
 	List	   *options;		/* List of DefElem nodes */
 } WaitStmt;
 
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..df7b563cfbb 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
 # the WAIT FOR command.
 use strict;
 use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql('postgres',
+		"SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
 # within another function, or inside a transaction with an isolation level
 # higher than READ COMMITTED.
 
@@ -125,7 +184,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,7 +267,7 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
 \$\$
 LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
-		WAIT FOR LSN '${lsn}';
+		WAIT FOR LSN '${lsn}' MODE REPLAY;
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,199 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);");
+
+ok($output >= 0,
+	"multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently.  We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN.  We stop the walreceiver and pause replay to
+# ensure all waiters block.  Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes.  Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote.  Check the log for the relevant error messages.  Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +515,24 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
 
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
 
 ok(1, 'wait for already replayed LSN exits immediately even after promotion');
 
 $output = $node_standby->safe_psql(
 	'postgres', qq[
-	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+	WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
 ok($output eq "not in recovery",
 	"WAIT FOR returns correct status after standby promotion");
 
@@ -295,8 +540,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

v4-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v4-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From af5b59d0e065ecb2f7b68c0eec8e55b892a5a435 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v4 3/4] Add tab completion for WAIT FOR LSN MODE parameter

Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
 src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
 
 /*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
  * where option can be:
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * MODE defaults to REPLAY if not specified.
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
 	else if (Matches("WAIT", "FOR", "LSN"))
 		/* No completion for LSN value - user must provide manually */
 		;
+
+	/*
+	 * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+	 * REPLAY
+	 */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+		COMPLETE_WITH("MODE", "WITH");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+		COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
 		COMPLETE_WITH("WITH");
+	/* WITH directly after LSN (using default REPLAY mode) */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list (both with and without explicit MODE).
+	 * This fires when we're in an unfinished parenthesized option list.
+	 * get_previous_words treats a completed parenthesized option list as one
+	 * word, so the above test is correct. timeout takes a string value,
+	 * no_throw takes no value. We don't offer completions for these values.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+	}
+	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+	{
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("timeout", "no_throw");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v4-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v4-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From da210bfc2b62d9a38ea54b94037380144753663a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v4] Extend xlogwait infrastructure with write and flush  wait
 types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 84 ++++++++++++++-----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 13 ++-
 6 files changed, 81 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..4b145515269 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6243,7 +6243,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..243c0b368a9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY_STANDBY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..21823acee9c 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_REPLAY_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_WRITE_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_REPLAY_STANDBY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_WRITE_STANDBY:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_FLUSH_STANDBY:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+			return GetFlushRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_COUNT:
+			break;
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..43b37095afb 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..9721a7a7195 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,9 +35,15 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY = 0,	/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH = 1,	/* Waiting for flush on primary */
-	WAIT_LSN_TYPE_COUNT = 2
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_REPLAY_STANDBY = 0,
+	WAIT_LSN_TYPE_WRITE_STANDBY = 1,
+	WAIT_LSN_TYPE_FLUSH_STANDBY = 2,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_FLUSH_PRIMARY = 3,
+
+	WAIT_LSN_TYPE_COUNT = 4
 } WaitLSNType;
 
 /*
@@ -96,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

#58Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#57)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Dec 2, 2025 at 6:10 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Dec 2, 2025 at 11:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi hackers,

On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi!

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.

Syntax Option A: Mode in the WITH Clause

WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');

With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.

Syntax Option B: Mode as Part of the Main Command

WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';

Or a more concise variant using keywords:

WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;

This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.

I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.

I've implemented a patch that adds MODE support to WAIT FOR LSN

The new grammar looks like:

——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——

Two modes added: flush and write

Design decisions:

1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.

2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.

3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`

The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait types

Adds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.

-------
0002 - Add pg_last_wal_write_lsn() SQL function

Adds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.

-------
0003 - Add MODE parameter to WAIT FOR LSN command

Extends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.

-------
0004 - Add tab completion for WAIT FOR LSN MODE parameter

Adds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.

-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()

Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.

The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.

Feedback welcome.

Here is the updated v2 patch set. Most of the updates are in patch 3.

Changes from v1:

Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)

Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat description

Patch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.

--
Best,
Xuneng

Here is the updated v3. The changes are made to patch 3:

- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.

--
Best,
Xuneng

Just realized that patch 2 in prior emails could be dropped for
simplicity. Since the write LSN can be retrieved directly from
pg_stat_wal_receiver, the TAP test in patch 3 does not require a
separate SQL function for this purpose alone.

Just rebase with minor changes to the wait-lsn types.

--
Best,
Xuneng

Attachments:

v5-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v5-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 9b5e818ed2807a7c2eb3ac743cbf4dfe8103ea6d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v5 3/4] Add tab completion for WAIT FOR LSN MODE parameter

Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
 src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b1ff6f6cd94..8f269b5cb13 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5327,10 +5327,11 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
 
 /*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
  * where option can be:
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * MODE defaults to REPLAY if not specified.
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5339,25 +5340,41 @@ match_previous_words(int pattern_id,
 	else if (Matches("WAIT", "FOR", "LSN"))
 		/* No completion for LSN value - user must provide manually */
 		;
+
+	/*
+	 * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+	 * REPLAY
+	 */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+		COMPLETE_WITH("MODE", "WITH");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+		COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
 		COMPLETE_WITH("WITH");
+	/* WITH directly after LSN (using default REPLAY mode) */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list (both with and without explicit MODE).
+	 * This fires when we're in an unfinished parenthesized option list.
+	 * get_previous_words treats a completed parenthesized option list as one
+	 * word, so the above test is correct. timeout takes a string value,
+	 * no_throw takes no value. We don't offer completions for these values.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+	}
+	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+	{
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("timeout", "no_throw");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v5-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v5-0004-Use-WAIT-FOR-LSN-in.patchDownload
From dd82542b2a4961fd050eab70ea66a1c152edefdc Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v5 4/4] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.

The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..eec8233b515 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3335,6 +3335,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3367,6 +3367,35 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+	# This is more efficient than polling pg_stat_replication
+	if (($mode ne 'sent') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		# Map mode names to WAIT FOR LSN MODE values (uppercase)
+		my $wait_mode = uc($mode);
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v5-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v5-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From 66c509e07bcbaa4580b32266326e34487a16d683 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v5 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 84 ++++++++++++++-----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 12 ++-
 6 files changed, 80 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a5640df51a..a6e348f2109 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6241,7 +6241,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2398d6975..01ffe30ffee 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..726a4a14084 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_COUNT:
+			break;
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (lsnType != WAIT_LSN_TYPE_PRIMARY_FLUSH && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c0632bf901a..05bd4376c67 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..3b2f34b8698 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v5-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v5-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From ac01547201b1098c31e9bb46594896b677207bd8 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v5 2/4] Add MODE parameter to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]

- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN

The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.

The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.

Also includes:
- Documentation updates for the new syntax and refactoring
  of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 192 +++++++++++----
 src/backend/access/transam/xlog.c       |   6 +-
 src/backend/commands/wait.c             |  64 ++++-
 src/backend/parser/gram.y               |  21 +-
 src/backend/replication/walreceiver.c   |  19 ++
 src/include/nodes/parsenodes.h          |  11 +
 src/include/parser/kwlist.h             |   2 +
 src/test/recovery/t/049_wait_for_lsn.pl | 295 ++++++++++++++++++++++--
 8 files changed, 523 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..28c68678315 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+   On timeout, or if the server is promoted before
+   <parameter>lsn</parameter> is reached, an error is emitted,
+   unless <literal>NO_THROW</literal> is specified in the WITH clause.
+   If <parameter>NO_THROW</parameter> is specified, then the command
+   doesn't throw errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -64,6 +67,61 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>MODE</literal></term>
+    <listitem>
+     <para>
+      Specifies the type of LSN processing to wait for. If not specified,
+      the default is <literal>REPLAY</literal>. The valid modes are:
+     </para>
+
+     <variablelist>
+      <varlistentry>
+       <term><literal>REPLAY</literal></term>
+       <listitem>
+        <para>
+         Wait for the LSN to be replayed (applied to the database).
+         After successful completion, <function>pg_last_wal_replay_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>FLUSH</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and flushed to disk. This provides a durability guarantee
+         without waiting for the WAL to be applied. After successful
+         completion, <function>pg_last_wal_receive_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+         This value is also available as the <structfield>flushed_lsn</structfield>
+         column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+         <structname>pg_stat_wal_receiver</structname></link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>WRITE</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and written to disk, but not yet flushed. This is faster
+         than <literal>FLUSH</literal> but provides weaker durability
+         guarantees since the data may still be in operating system buffers.
+         After successful completion, the <structfield>written_lsn</structfield>
+         column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+         <structname>pg_stat_wal_receiver</structname></link> will show
+         a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
     <listitem>
@@ -135,9 +193,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +209,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+   for the LSN to be replayed (applied to the database), which is useful
+   to achieve read-your-writes consistency while using an async replica
+   for reads and the primary for writes. The <literal>FLUSH</literal> mode
+   waits for the WAL to be flushed to durable storage on the replica,
+   providing a durability guarantee without waiting for replay. The
+   <literal>WRITE</literal> mode waits for the WAL to be written to the
+   operating system, which is faster than flush but provides weaker
+   durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+   last modification should be stored on the client application side or
+   the connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   <command>WAIT FOR</command> should be called on a standby.
+   If a user runs <command>WAIT FOR</command> on the primary, it
+   will error out unless <parameter>NO_THROW</parameter> is specified
+   in the WITH clause. However, if <command>WAIT FOR</command> is
+   called on a primary promoted from standby and <literal>lsn</literal>
+   was already reached, then the <command>WAIT FOR</command> command
+   just exits immediately. If the replica is promoted while waiting,
+   the command will return <literal>not in recovery</literal> (or throw
+   an error if <literal>NO_THROW</literal> is not specified). Promotion
+   creates a new timeline, and the LSN being waited for may refer to
+   WAL from the old timeline.
   </para>
 
 </refsect1>
@@ -175,21 +244,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -198,9 +267,9 @@ pg_current_wal_insert_lsn
    changes made on primary should be guaranteed to be visible on replica.
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +280,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
 ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
 </programlisting>
   </para>
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e348f2109..5c6f9feeccc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6238,10 +6238,12 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..60cf3ee1c9a 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,12 +29,28 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* "replay", "flush", "write" */
+	const char *verb;			/* "replayed", "flushed", "written" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = {"replay", "replayed"},
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = {"write", "written"},
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = {"flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType;
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
 										  CStringGetDatum(stmt->lsn_literal)));
 
+	/* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+	switch (stmt->mode)
+	{
+		case WAIT_LSN_MODE_REPLAY:
+			lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			break;
+		case WAIT_LSN_MODE_WRITE:
+			lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			break;
+		case WAIT_LSN_MODE_FLUSH:
+			lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			break;
+		default:
+			elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+	}
+
 	foreach_node(DefElem, defel, stmt->options)
 	{
 		if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 28f4e11e30f..94a9e874699 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -641,6 +641,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <windef>	window_definition over_clause window_specification
 				opt_frame_clause frame_extent frame_bound
 %type <ival>	null_treatment opt_window_exclusion_clause
+%type <ival>	opt_wait_lsn_mode
 %type <str>		opt_existing_window_name
 %type <boolean> opt_if_not_exists
 %type <boolean> opt_unique_null_treatment
@@ -732,7 +733,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
 	EXPRESSION EXTENSION EXTERNAL EXTRACT
 
-	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
 	FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
 
 	GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -773,7 +774,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
 	RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -16541,15 +16542,23 @@ xml_passing_mech:
  *****************************************************************************/
 
 WaitStmt:
-			WAIT FOR LSN_P Sconst opt_wait_with_clause
+			WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
 				{
 					WaitStmt *n = makeNode(WaitStmt);
 					n->lsn_literal = $4;
-					n->options = $5;
+					n->mode = $5;
+					n->options = $6;
 					$$ = (Node *) n;
 				}
 			;
 
+opt_wait_lsn_mode:
+			MODE REPLAY			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			| MODE FLUSH		{ $$ = WAIT_LSN_MODE_FLUSH; }
+			| MODE WRITE		{ $$ = WAIT_LSN_MODE_WRITE; }
+			| /*EMPTY*/			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			;
+
 opt_wait_with_clause:
 			WITH '(' utility_option_list ')'		{ $$ = $3; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -17989,6 +17998,7 @@ unreserved_keyword:
 			| FILTER
 			| FINALIZE
 			| FIRST_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FORMAT
@@ -18124,6 +18134,7 @@ unreserved_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESPECT_P
@@ -18578,6 +18589,7 @@ bare_label_keyword:
 			| FINALIZE
 			| FIRST_P
 			| FLOAT_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FOREIGN
@@ -18761,6 +18773,7 @@ bare_label_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..e15c5645b9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for then walk over
+		 * the shared memory array and set latches to notify the waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index bc7adba4a0f..c4d9f03a6a5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4413,10 +4413,21 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+	WAIT_LSN_MODE_REPLAY,		/* Wait for LSN replay on standby */
+	WAIT_LSN_MODE_WRITE,		/* Wait for LSN write on standby */
+	WAIT_LSN_MODE_FLUSH			/* Wait for LSN flush on standby */
+}			WaitLSNMode;
+
 typedef struct WaitStmt
 {
 	NodeTag		type;
 	char	   *lsn_literal;	/* LSN string from grammar */
+	WaitLSNMode mode;			/* Wait mode: REPLAY/FLUSH/WRITE */
 	List	   *options;		/* List of DefElem nodes */
 } WaitStmt;
 
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 9fde58f541c..04008805e46 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -379,6 +380,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..98060a5c79f 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
 # the WAIT FOR command.
 use strict;
 use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql('postgres',
+		"SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
 # within another function, or inside a transaction with an isolation level
 # higher than READ COMMITTED.
 
@@ -125,7 +184,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,7 +267,7 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
 \$\$
 LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
-		WAIT FOR LSN '${lsn}';
+		WAIT FOR LSN '${lsn}' MODE REPLAY;
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,200 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently.  We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN.  We stop the walreceiver and pause replay to
+# ensure all waiters block.  Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes.  Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote.  Check the log for the relevant error messages.  Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +516,24 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
 
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
 
 ok(1, 'wait for already replayed LSN exits immediately even after promotion');
 
 $output = $node_standby->safe_psql(
 	'postgres', qq[
-	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+	WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
 ok($output eq "not in recovery",
 	"WAIT FOR returns correct status after standby promotion");
 
@@ -295,8 +541,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

#59Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#58)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Dec 16, 2025 at 11:28 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Dec 2, 2025 at 6:10 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Dec 2, 2025 at 11:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi hackers,

On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi!

At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.

I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).

------
Regards,
Alexander Korotkov
Supabase

Makes sense. I'll play with it and try to prepare a follow-up patch.

--
Best,
Xuneng

In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.

Syntax Option A: Mode in the WITH Clause

WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');

With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.

Syntax Option B: Mode as Part of the Main Command

WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';

Or a more concise variant using keywords:

WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;

This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.

I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.

I've implemented a patch that adds MODE support to WAIT FOR LSN

The new grammar looks like:

——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——

Two modes added: flush and write

Design decisions:

1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.

2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.

3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`

The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait types

Adds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.

-------
0002 - Add pg_last_wal_write_lsn() SQL function

Adds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.

-------
0003 - Add MODE parameter to WAIT FOR LSN command

Extends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.

-------
0004 - Add tab completion for WAIT FOR LSN MODE parameter

Adds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.

-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()

Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.

The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.

Feedback welcome.

Here is the updated v2 patch set. Most of the updates are in patch 3.

Changes from v1:

Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)

Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat description

Patch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.

--
Best,
Xuneng

Here is the updated v3. The changes are made to patch 3:

- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.

--
Best,
Xuneng

Just realized that patch 2 in prior emails could be dropped for
simplicity. Since the write LSN can be retrieved directly from
pg_stat_wal_receiver, the TAP test in patch 3 does not require a
separate SQL function for this purpose alone.

Just rebase with minor changes to the wait-lsn types.

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

--
Best,
Xuneng

Attachments:

v6-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v6-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From 7292901a0119dca75c349cd6f5a460f5cb0e4139 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v6 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 81 ++++++++++++++-----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 12 ++-
 6 files changed, 77 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a5640df51a..a6e348f2109 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6241,7 +6241,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2398d6975..01ffe30ffee 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..d54b2fd7ae4 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,45 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -341,13 +386,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (lsnType != WAIT_LSN_TYPE_PRIMARY_FLUSH && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +419,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c0632bf901a..05bd4376c67 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..3b2f34b8698 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v6-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v6-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From 0df07ec61ec10096782262d7fcb996e879cf2367 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v6 2/4] Add MODE parameter to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]

- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN

The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.

The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.

Also includes:
- Documentation updates for the new syntax and refactoring
  of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 192 +++++++++++----
 src/backend/access/transam/xlog.c       |   6 +-
 src/backend/commands/wait.c             |  64 ++++-
 src/backend/parser/gram.y               |  21 +-
 src/backend/replication/walreceiver.c   |  19 ++
 src/include/nodes/parsenodes.h          |  11 +
 src/include/parser/kwlist.h             |   2 +
 src/test/recovery/t/049_wait_for_lsn.pl | 295 ++++++++++++++++++++++--
 8 files changed, 523 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..28c68678315 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+   On timeout, or if the server is promoted before
+   <parameter>lsn</parameter> is reached, an error is emitted,
+   unless <literal>NO_THROW</literal> is specified in the WITH clause.
+   If <parameter>NO_THROW</parameter> is specified, then the command
+   doesn't throw errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -64,6 +67,61 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>MODE</literal></term>
+    <listitem>
+     <para>
+      Specifies the type of LSN processing to wait for. If not specified,
+      the default is <literal>REPLAY</literal>. The valid modes are:
+     </para>
+
+     <variablelist>
+      <varlistentry>
+       <term><literal>REPLAY</literal></term>
+       <listitem>
+        <para>
+         Wait for the LSN to be replayed (applied to the database).
+         After successful completion, <function>pg_last_wal_replay_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>FLUSH</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and flushed to disk. This provides a durability guarantee
+         without waiting for the WAL to be applied. After successful
+         completion, <function>pg_last_wal_receive_lsn()</function>
+         will return a value greater than or equal to the target LSN.
+         This value is also available as the <structfield>flushed_lsn</structfield>
+         column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+         <structname>pg_stat_wal_receiver</structname></link>.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term><literal>WRITE</literal></term>
+       <listitem>
+        <para>
+         Wait for the WAL containing the LSN to be received from the
+         primary and written to disk, but not yet flushed. This is faster
+         than <literal>FLUSH</literal> but provides weaker durability
+         guarantees since the data may still be in operating system buffers.
+         After successful completion, the <structfield>written_lsn</structfield>
+         column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+         <structname>pg_stat_wal_receiver</structname></link> will show
+         a value greater than or equal to the target LSN.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
     <listitem>
@@ -135,9 +193,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +209,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+   for the LSN to be replayed (applied to the database), which is useful
+   to achieve read-your-writes consistency while using an async replica
+   for reads and the primary for writes. The <literal>FLUSH</literal> mode
+   waits for the WAL to be flushed to durable storage on the replica,
+   providing a durability guarantee without waiting for replay. The
+   <literal>WRITE</literal> mode waits for the WAL to be written to the
+   operating system, which is faster than flush but provides weaker
+   durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+   last modification should be stored on the client application side or
+   the connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   <command>WAIT FOR</command> should be called on a standby.
+   If a user runs <command>WAIT FOR</command> on the primary, it
+   will error out unless <parameter>NO_THROW</parameter> is specified
+   in the WITH clause. However, if <command>WAIT FOR</command> is
+   called on a primary promoted from standby and <literal>lsn</literal>
+   was already reached, then the <command>WAIT FOR</command> command
+   just exits immediately. If the replica is promoted while waiting,
+   the command will return <literal>not in recovery</literal> (or throw
+   an error if <literal>NO_THROW</literal> is not specified). Promotion
+   creates a new timeline, and the LSN being waited for may refer to
+   WAL from the old timeline.
   </para>
 
 </refsect1>
@@ -175,21 +244,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -198,9 +267,9 @@ pg_current_wal_insert_lsn
    changes made on primary should be guaranteed to be visible on replica.
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +280,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
 
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
 ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
 </programlisting>
   </para>
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e348f2109..5c6f9feeccc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6238,10 +6238,12 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..60cf3ee1c9a 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,12 +29,28 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* "replay", "flush", "write" */
+	const char *verb;			/* "replayed", "flushed", "written" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = {"replay", "replayed"},
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = {"write", "written"},
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = {"flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType;
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
 										  CStringGetDatum(stmt->lsn_literal)));
 
+	/* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+	switch (stmt->mode)
+	{
+		case WAIT_LSN_MODE_REPLAY:
+			lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			break;
+		case WAIT_LSN_MODE_WRITE:
+			lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			break;
+		case WAIT_LSN_MODE_FLUSH:
+			lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			break;
+		default:
+			elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+	}
+
 	foreach_node(DefElem, defel, stmt->options)
 	{
 		if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 28f4e11e30f..94a9e874699 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -641,6 +641,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <windef>	window_definition over_clause window_specification
 				opt_frame_clause frame_extent frame_bound
 %type <ival>	null_treatment opt_window_exclusion_clause
+%type <ival>	opt_wait_lsn_mode
 %type <str>		opt_existing_window_name
 %type <boolean> opt_if_not_exists
 %type <boolean> opt_unique_null_treatment
@@ -732,7 +733,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
 	EXPRESSION EXTENSION EXTERNAL EXTRACT
 
-	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+	FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
 	FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
 
 	GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -773,7 +774,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
 	RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -16541,15 +16542,23 @@ xml_passing_mech:
  *****************************************************************************/
 
 WaitStmt:
-			WAIT FOR LSN_P Sconst opt_wait_with_clause
+			WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
 				{
 					WaitStmt *n = makeNode(WaitStmt);
 					n->lsn_literal = $4;
-					n->options = $5;
+					n->mode = $5;
+					n->options = $6;
 					$$ = (Node *) n;
 				}
 			;
 
+opt_wait_lsn_mode:
+			MODE REPLAY			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			| MODE FLUSH		{ $$ = WAIT_LSN_MODE_FLUSH; }
+			| MODE WRITE		{ $$ = WAIT_LSN_MODE_WRITE; }
+			| /*EMPTY*/			{ $$ = WAIT_LSN_MODE_REPLAY; }
+			;
+
 opt_wait_with_clause:
 			WITH '(' utility_option_list ')'		{ $$ = $3; }
 			| /*EMPTY*/								{ $$ = NIL; }
@@ -17989,6 +17998,7 @@ unreserved_keyword:
 			| FILTER
 			| FINALIZE
 			| FIRST_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FORMAT
@@ -18124,6 +18134,7 @@ unreserved_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESPECT_P
@@ -18578,6 +18589,7 @@ bare_label_keyword:
 			| FINALIZE
 			| FIRST_P
 			| FLOAT_P
+			| FLUSH
 			| FOLLOWING
 			| FORCE
 			| FOREIGN
@@ -18761,6 +18773,7 @@ bare_label_keyword:
 			| RENAME
 			| REPEATABLE
 			| REPLACE
+			| REPLAY
 			| REPLICA
 			| RESET
 			| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..e15c5645b9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for then walk over
+		 * the shared memory array and set latches to notify the waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index bc7adba4a0f..c4d9f03a6a5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4413,10 +4413,21 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+	WAIT_LSN_MODE_REPLAY,		/* Wait for LSN replay on standby */
+	WAIT_LSN_MODE_WRITE,		/* Wait for LSN write on standby */
+	WAIT_LSN_MODE_FLUSH			/* Wait for LSN flush on standby */
+}			WaitLSNMode;
+
 typedef struct WaitStmt
 {
 	NodeTag		type;
 	char	   *lsn_literal;	/* LSN string from grammar */
+	WaitLSNMode mode;			/* Wait mode: REPLAY/FLUSH/WRITE */
 	List	   *options;		/* List of DefElem nodes */
 } WaitStmt;
 
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 9fde58f541c..04008805e46 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -379,6 +380,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..98060a5c79f 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
 # the WAIT FOR command.
 use strict;
 use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql('postgres',
+		"SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
 # within another function, or inside a transaction with an isolation level
 # higher than READ COMMITTED.
 
@@ -125,7 +184,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,7 +267,7 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
 \$\$
 LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
-		WAIT FOR LSN '${lsn}';
+		WAIT FOR LSN '${lsn}' MODE REPLAY;
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,200 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently.  We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN.  We stop the walreceiver and pause replay to
+# ensure all waiters block.  Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes.  Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote.  Check the log for the relevant error messages.  Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +516,24 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
 
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
 
 ok(1, 'wait for already replayed LSN exits immediately even after promotion');
 
 $output = $node_standby->safe_psql(
 	'postgres', qq[
-	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+	WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
 ok($output eq "not in recovery",
 	"WAIT FOR returns correct status after standby promotion");
 
@@ -295,8 +541,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

v6-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v6-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 94d36b07298fa2a46d26623c08a269cc6db6461a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v6 3/4] Add tab completion for WAIT FOR LSN MODE parameter

Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
 src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b1ff6f6cd94..8f269b5cb13 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5327,10 +5327,11 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
 
 /*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
  * where option can be:
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * MODE defaults to REPLAY if not specified.
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5339,25 +5340,41 @@ match_previous_words(int pattern_id,
 	else if (Matches("WAIT", "FOR", "LSN"))
 		/* No completion for LSN value - user must provide manually */
 		;
+
+	/*
+	 * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+	 * REPLAY
+	 */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+		COMPLETE_WITH("MODE", "WITH");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+		COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
 		COMPLETE_WITH("WITH");
+	/* WITH directly after LSN (using default REPLAY mode) */
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list (both with and without explicit MODE).
+	 * This fires when we're in an unfinished parenthesized option list.
+	 * get_previous_words treats a completed parenthesized option list as one
+	 * word, so the above test is correct. timeout takes a string value,
+	 * no_throw takes no value. We don't offer completions for these values.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+	}
+	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+	{
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("timeout", "no_throw");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v6-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v6-0004-Use-WAIT-FOR-LSN-in.patchDownload
From 7265330a02c5d966ef42cce3f9c15f4acae37ff4 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v6 4/4] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.

The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..eec8233b515 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3335,6 +3335,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3367,6 +3367,35 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+	# This is more efficient than polling pg_stat_replication
+	if (($mode ne 'sent') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		# Map mode names to WAIT FOR LSN MODE values (uppercase)
+		my $wait_mode = uc($mode);
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#60Chao Li
li.evan.chao@gmail.com
In reply to: Xuneng Zhou (#26)
Re: Implement waiting for wal lsn replay: reloaded

On Oct 4, 2025, at 09:35, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Here's the updated v11. Many thanks Jian for off-list discussions and review.

v12 removed unused
+WaitStmt
+WaitStmtParam in pgindent/typedefs.list.

Best,
Xuneng
<v12-0001-Implement-WAIT-FOR-command.patch>

I just tried to review v12 but failed to “git am”. Can you please rebase the change?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#61Xuneng Zhou
xunengzhou@gmail.com
In reply to: Chao Li (#60)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Dec 16, 2025 at 1:49 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Oct 4, 2025, at 09:35, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Álvaro,

Thanks for your review.

On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Sep-15, Alexander Korotkov wrote:

It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.

Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.

I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.

Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like

WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]

This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.

Shall make LSN an unreserved keyword as well.

Here's the updated v11. Many thanks Jian for off-list discussions and review.

v12 removed unused
+WaitStmt
+WaitStmtParam in pgindent/typedefs.list.

Best,
Xuneng
<v12-0001-Implement-WAIT-FOR-command.patch>

I just tried to review v12 but failed to “git am”. Can you please rebase the change?

Thanks for looking into this.

That series of patches implementing the WAIT FOR REPLAY command was
applied last month (8af3ae0d , 447aae13, 3b4e53a0, a1f7f91b) in its
version 20. The current v6 patch set [1]https://commitfest.postgresql.org/patch/6265/ [2]/messages/by-id/CABPTF7XKti620ZAOXPGuhSZxCKyaV_9stq7ruhnuxvshUxCeRQ@mail.gmail.com primarily extends the
WAIT FOR functionality to support waiting for flush and write LSNs on
a replica by adding a MODE parameter [3]/messages/by-id/CAPpHfdt4b0wBC4+Oopp_eFQnNjDvxwQLrQ1r4GMJfCY0XWP0dA@mail.gmail.com. This made me wonder whether
it would be more appropriate to start a new thread for the extension,
though it is still part of the same WAIT FOR command.

[1]: https://commitfest.postgresql.org/patch/6265/
[2]: /messages/by-id/CABPTF7XKti620ZAOXPGuhSZxCKyaV_9stq7ruhnuxvshUxCeRQ@mail.gmail.com
[3]: /messages/by-id/CAPpHfdt4b0wBC4+Oopp_eFQnNjDvxwQLrQ1r4GMJfCY0XWP0dA@mail.gmail.com

--
Best,
Xuneng

#62Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#59)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

------
Regards,
Alexander Korotkov

#63Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#62)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander,

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

--
Best,
Xuneng

#64Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#63)
Re: Implement waiting for wal lsn replay: reloaded

On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

This is exactly what I meant. Please, go ahead.

------
Regards,
Alexander Korotkov
Supabase

#65Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#64)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

This is exactly what I meant. Please, go ahead.

Here is the updated patch set (v7). Please check.

--
Best,
Xuneng

Attachments:

v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From bbf69248589db7056b05ab996ec1831aa7fbb2b5 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v7 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 81 ++++++++++++++-----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 14 +++-
 6 files changed, 78 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a5640df51a..a6e348f2109 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6241,7 +6241,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2398d6975..01ffe30ffee 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..d54b2fd7ae4 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,45 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -341,13 +386,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (lsnType != WAIT_LSN_TYPE_PRIMARY_FLUSH && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +419,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c0632bf901a..05bd4376c67 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..4cf13f0ccb3 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.h
- *	  Declarations for LSN replay waiting routines.
+ *	  Declarations for WAL flush, write, and replay waiting routines.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v7-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v7-0004-Use-WAIT-FOR-LSN-in.patchDownload
From 9dde4e330844d827f783ab2caca505036ac884b0 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v7 4/4] Use WAIT FOR LSN in 
 PostgreSQL::Test::Cluster::wait_for_catchup()

Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.

The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 33 +++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..276350c5f13 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3335,6 +3335,9 @@ sub wait_for_catchup
 	$mode = defined($mode) ? $mode : 'replay';
 	my %valid_modes =
 	  ('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
 		if ($isrecovery eq 't')
 		{
 			$target_lsn = $self->lsn('replay');
@@ -3367,6 +3367,33 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+	# This is more efficient than polling pg_stat_replication
+	if (($mode ne 'sent') && ($isrecovery eq 't'))
+	{
+		my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+		my $query =
+		  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${mode}', timeout '${timeout}s', no_throw);];
+		my $output = $self->safe_psql('postgres', $query);
+		chomp($output);
+
+		if ($output ne 'success')
+		{
+			# Fetch additional detail for debugging purposes
+			$query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+			my $details = $self->safe_psql('postgres', $query);
+			diag qq(WAIT FOR LSN failed with status:
+${output});
+			diag qq(Last pg_stat_replication contents:
+${details});
+			croak "failed waiting for catchup";
+		}
+		print "done\n";
+		return;
+	}
+
+	# Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchapplication/octet-stream; name=v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchDownload
From 62db341638bd9515584f9c24b0adfeec61ada252 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v7 3/4] Add tab completion for WAIT FOR LSN MODE option

Update psql tab completion to support the MODE option in WAIT FOR LSN
command's WITH clause. After typing 'mode' inside the parenthesized
option list, completion offers the valid mode values: 'replay', 'write',
and 'flush'.
---
 src/bin/psql/tab-complete.in.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b1ff6f6cd94..5cb8de14e8e 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5329,8 +5329,10 @@ match_previous_words(int pattern_id,
 /*
  * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
  * where option can be:
+ *   MODE '<mode>'
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * and mode can be: replay | write | flush
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5343,21 +5345,21 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH");
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.  mode takes a string value ('replay', 'write', 'flush'),
+	 * timeout takes a string value, no_throw takes no value.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+			COMPLETE_WITH("mode", "timeout", "no_throw");
+		else if (TailMatches("mode"))
+			COMPLETE_WITH("'replay'", "'write'", "'flush'");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchDownload
From 5136083fff62902515c82118250034e0ab75cf2f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v7 2/4] Add MODE option to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE option in the
WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
- 'replay' (default): Wait for WAL to be replayed to the specified LSN
- 'write': Wait for WAL to be written (received) to the specified LSN
- 'flush': Wait for WAL to be flushed to disk at the specified LSN

The default mode is 'replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands where options are specified as string values in the WITH clause.

The 'write' and 'flush' modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.

Also includes:
- Documentation updates for the new syntax and small refactoring for the existing ones
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for write/flush waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 182 ++++++++++----
 src/backend/access/transam/xlog.c       |   6 +-
 src/backend/commands/wait.c             |  74 +++++-
 src/backend/replication/walreceiver.c   |  19 ++
 src/test/recovery/t/049_wait_for_lsn.pl | 305 ++++++++++++++++++++++--
 5 files changed, 508 insertions(+), 78 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..122012f5613 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    MODE '<replaceable class="parameter">mode</replaceable>'
     TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
     NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+    replay | write | flush
 </synopsis>
  </refsynopsisdiv>
 
@@ -34,20 +40,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+   On timeout, or if the server is promoted before
+   <parameter>lsn</parameter> is reached, an error is emitted,
+   unless <literal>NO_THROW</literal> is specified in the WITH clause.
+   If <parameter>NO_THROW</parameter> is specified, then the command
+   doesn't throw errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -72,6 +80,52 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
       The following parameters are supported:
 
       <variablelist>
+       <varlistentry>
+        <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+        <listitem>
+         <para>
+          Specifies the type of LSN processing to wait for. If not specified,
+          the default is <literal>replay</literal>. The valid modes are:
+         </para>
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>replay</literal>: Wait for the LSN to be replayed
+            (applied to the database). After successful completion,
+            <function>pg_last_wal_replay_lsn()</function> will return a
+            value greater than or equal to the target LSN.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>flush</literal>: Wait for the WAL containing the LSN
+            to be received from the primary and flushed to disk. This
+            provides a durability guarantee without waiting for the WAL
+            to be applied. After successful completion,
+            <function>pg_last_wal_receive_lsn()</function> will return a
+            value greater than or equal to the target LSN. This value is
+            also available as the <structfield>flushed_lsn</structfield>
+            column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link>.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>write</literal>: Wait for the WAL containing the LSN
+            to be received from the primary and written to disk, but not
+            yet flushed. This is faster than <literal>flush</literal> but
+            provides weaker durability guarantees since the data may still
+            be in operating system buffers. After successful completion, the
+            <structfield>written_lsn</structfield> column in
+            <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link> will show
+            a value greater than or equal to the target LSN.
+           </para>
+          </listitem>
+         </itemizedlist>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
         <listitem>
@@ -135,9 +189,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +205,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+   for the LSN to be replayed (applied to the database), which is useful
+   to achieve read-your-writes consistency while using an async replica
+   for reads and the primary for writes. The <literal>FLUSH</literal> mode
+   waits for the WAL to be flushed to durable storage on the replica,
+   providing a durability guarantee without waiting for replay. The
+   <literal>WRITE</literal> mode waits for the WAL to be written to the
+   operating system, which is faster than flush but provides weaker
+   durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+   last modification should be stored on the client application side or
+   the connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   <command>WAIT FOR</command> should be called on a standby.
+   If a user runs <command>WAIT FOR</command> on the primary, it
+   will error out unless <parameter>NO_THROW</parameter> is specified
+   in the WITH clause. However, if <command>WAIT FOR</command> is
+   called on a primary promoted from standby and <literal>lsn</literal>
+   was already reached, then the <command>WAIT FOR</command> command
+   just exits immediately. If the replica is promoted while waiting,
+   the command will return <literal>not in recovery</literal> (or throw
+   an error if <literal>NO_THROW</literal> is not specified). Promotion
+   creates a new timeline, and the LSN being waited for may refer to
+   WAL from the old timeline.
   </para>
 
 </refsect1>
@@ -175,21 +240,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -200,7 +265,7 @@ pg_current_wal_insert_lsn
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20';
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +276,31 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
 
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +310,12 @@ ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e348f2109..5c6f9feeccc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6238,10 +6238,12 @@ StartupXLOG(void)
 	LWLockRelease(ControlFileLock);
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..b3f1f7b8a69 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,18 +29,35 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* "replay", "flush", "write" */
+	const char *verb;			/* "replayed", "flushed", "written" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = {"replay", "replayed"},
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = {"write", "written"},
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = {"flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
 	const char *result = "<unset>";
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
+	bool		mode_specified = false;
 
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +65,30 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 	foreach_node(DefElem, defel, stmt->options)
 	{
-		if (strcmp(defel->defname, "timeout") == 0)
+		if (strcmp(defel->defname, "mode") == 0)
+		{
+			char	   *mode_str;
+
+			if (mode_specified)
+				errorConflictingDefElem(defel, pstate);
+			mode_specified = true;
+
+			mode_str = defGetString(defel);
+
+			if (pg_strcasecmp(mode_str, "replay") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			else if (pg_strcasecmp(mode_str, "write") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			else if (pg_strcasecmp(mode_str, "flush") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+								"MODE", mode_str),
+						 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "timeout") == 0)
 		{
 			char	   *timeout_str;
 			const char *hintmsg;
@@ -107,8 +148,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +181,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +195,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +214,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..e15c5645b9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for then walk over
+		 * the shared memory array and set latches to notify the waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..b589cecc028 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
 # the WAIT FOR command.
 use strict;
 use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql('postgres',
+		"SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' WITH (MODE 'write', timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR with MODE 'write'");
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' WITH (MODE 'flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR with MODE 'flush'");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
 # within another function, or inside a transaction with an isolation level
 # higher than READ COMMITTED.
 
@@ -125,7 +184,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,7 +267,23 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
+# Test invalid MODE value
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "MODE": "invalid"/,
+	"get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'replay', MODE 'write');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple REPLAY waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -226,7 +301,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
 \$\$
 LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -243,6 +320,7 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +329,200 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'write', timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'flush', timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently.  We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN.  We stop the walreceiver and pause replay to
+# ensure all waiters block.  Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('replay', 'write', 'flush');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+		DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes.  Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote.  Check the log for the relevant error messages.  Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('replay', 'write', 'flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +532,16 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
 
 $node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
 
@@ -295,8 +557,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

#66Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#65)
Re: Implement waiting for wal lsn replay: reloaded

On Fri, Dec 19, 2025 at 4:50 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

This is exactly what I meant. Please, go ahead.

Here is the updated patch set (v7). Please check.

I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?

If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?

------
Regards,
Alexander Korotkov
Supabase

#67Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#66)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander,

Thanks for your feedback!

I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?

I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.

If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?

Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.

--
Best,
Xuneng

#68Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#67)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Alexander,

Thanks for your feedback!

I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?

I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.

Flush mode on the primary seems useful when synchronous_commit is set
to off [1]https://postgresqlco.nf/doc/en/param/synchronous_commit/. In that mode, a transaction in primary may return success
before its WAL is durably flushed to disk, trading durability for
lower latency. A “wait for primary flush” operation provides an
explicit durability barrier for cases where applications or tools
occasionally need stronger guarantees.

[1]: https://postgresqlco.nf/doc/en/param/synchronous_commit/

If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?

Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.

After further consideration, it also seems reasonable to use a single,
unified flush mode that works on both primary and standby servers,
provided its semantics are clearly documented to avoid the potential
confusion on failure. I don’t have a strong preference between these
two and would be interested in your thoughts.

If a standby is promoted while a session is waiting, the command
better abort and return an error (or report “not in recovery” when
using NO_THROW). At that point, the meaning of the LSN being waited
for may have changed due to the timeline switch and the transition
from standby to primary. An LSN such as 0/5000000 on TLI 2 can
represent entirely different WAL content from 0/5000000 on TLI 1.
Allowing the wait to silently continue across promotion risks giving
users a false sense of safety—for example, interpreting “wait
completed” as “the original data is now durable,” which would no
longer be true.

--
Best,
Xuneng

Attachments:

synchronous_commit.pngimage/png; name=synchronous_commit.pngDownload
�PNG


IHDRp��K�?SiCCPICC Profile(�u�O(DQ��g����B,��b�h��Y��iP��y�O����')kk����Z�NVJVX)�l,m�I�wg03��������Nh�����������k�F�:�D?t�vD$[`��Z�[hJo��������8��4�>p.�����
&S�M}g�l!]@3��mW(�%�J.E�W�����D�Oj��x�|E���V�|O6M~����-�k���*�,�9�A,b�"
$�*��O��Q&v �� �"trH��8��8Lr!������5��20=@(6��a�\�K
o��������sU��w��:�h���! p
T���z^�7��g��Ic�
� }�eXIfMM*>F(�iN����x�p��ASCIIScreenshot;���	pHYs%%IR$��iTXtXML:com.adobe.xmp<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 6.0.0">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:exif="http://ns.adobe.com/exif/1.0/">
         <exif:PixelYDimension>1178</exif:PixelYDimension>
         <exif:PixelXDimension>1904</exif:PixelXDimension>
         <exif:UserComment>Screenshot</exif:UserComment>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
�V�iDOTM(MM�|��8:@IDATx���UE�G��ABl�.�����l�EA�R����VPQ�@LP�A���yf������{w��{�=1g���Y�N�Z������}���?��  ��  ��  ��  ��  ��  ��  e��Jb�-{�����  ��  ��  ��  ��  ��  ��  �! \?T��  ��  ��  ��  ��  ��  ��  ���p�ty�  ��  ��  ��  ��  ��  ��  ~��9&��  ��  ��  ��  ��  ��  �@9  �r])��  ��  ��  ��  ��  ��  ��b��CE�	��  ��  ��  ��  ��  ��  �� P��@�G
��  ��  ��  ��  ��  ��  �� ���p�P�c��  ��  ��  ��  ��  ��  ��  �b�-�����  ��  ��  ��  ��  ��  ��  �! \?T��  �k����t�u��&��K�,������Fk�U���V_M�Q��o��O1����7Q�����8yZ1V?���6ZG-^�TM�yz�5rB��\Z�|�a�5VS�6���(���uj�Z5��	�%�_���Y9�+7	e�����K��������9�>�W���h�&2~�������<]?�����+��rc�j���)��g_|G���?ESw��2.>�x�s�-�3������"�B`��VR7���j�\C]3�>5��������C�N�jX���x��� 5o���
/���n�@����Z��_�M����%K��6�hA [d
���\z���W;5����{�����v�Z�>��Y����gF�/����r� �%M�4P�zi^F;��d��2Y�]I�.�9�"6QE����_��5���I������PK�.5z���~/����� Pd���L��U�TQ7���j.��|�#��q��w�|+
��q��`����?W7��tQ�Y*Y���������ME���z����_�JV��7h���v�y�+�����8�B ����T��'�w��s��>sn�x/y	A�Y��P�cq(��s��G��w�^; ����(���p
F��{Y��l�
��� P(/S(-Qq�Q�s~�i�eoR��`"��oo�j�h�	j��77/�������_)����� Pd���L��x�
��^��[��7�~���?�w���H�}�l�>��[5��_�3���a��[�8�/�]��[�p���Y	���bRA�4���D��C����5���5Ou�|h����"B���2Iy�"j���j1��EvD����"�*��"��o��];p"8�UK5c�\����#+���c��:�	4��w�c��yO�����|�[�O��[l��m��^~}�d�LZ)-!b�M�\^X��#7��u;�����>���#�UG���������;�%*ymD�� ���.G�R��X�I����k�PJQ�@�! kp�A_�.��#���~�T^&)����b���x���5��`_ �b"�SF���[m����������V�;�L��1���;�%zk����Q,\����������F`�U��[����U���{@}1�������@��7mD���D��SU���r�>[u�����ce�������{o���/������t��+�����!�7��w�������x���<��I���CG���b��sy_��|�|p�O��v������s�3�]��z5o�[D�|��QXL�XEj�\C
���j��O�QY`5��T&��[�Z��k�,���	_]��@h�D"���SF���[F@�c��58Cp+x���w��[�;������/t�D�/�*�9�pP��D! 2~B����v�Q�P��b�*�^U����x��"�\��U��J!P(�w���%��t�\�.���G���)�	�������z�xf1�1�V�>(o�@\%|p	�qFd��hjQ�s~��(5�B@d�(��=w�(�d����U���xKn��]E��p+b��;�B�P�R�"J�	�r�-�����*�R� P.�B�\`�-��#�
��%B���)� N�_�`*Q�s~��(5�B@d�(��=w�(�d����U���xKn��]E��p#Z�A��j���������j�*+������f���>��5���"J����M�=Zn��UO�UK���?j��yj����[��U�������oG��Z�����S�o������=w�e7��j�����������U��8�Til"�����F�����m6��:m��i�����>U��W�{�l�}��G��~�5s�j�*j���P;6�\5�_G�R���3�75i�45��1j��yn��z�4��n������u�7�w5������R~2������S�{��h�x���q%��\c5��6�T�j���]w�Zm�l=5��?����Yr_h����^������+{���j���a�:�z�j������������w��W|8���j��������=�x��������W=���j�O��\C���v�i�G�~��8����!���SS-�c�~�,���Y�������8��,~�[����D��[K�{��9��4����<���>�b��
V_}U�����b����U:^�4r��+��ru��N[����o�?ck����,=n������6k�V]u�����|YR���\���]v����5�0sm1i�����������[����H���������A�������wF�/^t{���W[E�}��fnz��wK�G}9t�]������~���~����q+��F��v�c�y���j��_�}��Qc�x
zf�p�P��'��x���g��=���{������������BN2��7i�@�Km��^�d�R3G3�i�Y1��C��vTO�xCM�s5��X�V�����{�������������9
�������w�\�;����P��]#�����uc�&k�����O���n������&�Oo�xS_�xw��0w2�m���j�UWQ������g��e���!����5y�M70��9�v�O�u�l��L�<�W��������~b�k1�\>�2������?����9����z������r���'�������=�K�}
�|�\�f�<�7�d=��o���i��o��Y��NkM+��H��?��}
�q��#|���g��6[n���{G����&���*�=�5���7��L��T����4k�zM��z�_Y���f�7k(�QT�:������|"S����e9d��~���E�K�}�m�>w��[m��z���j����<������S�]V�-9�|A�����o�es�:�G������|��Z������k����j�5�?�L���M�4P{i>l���]S-^�T���o5/�����e��Xvt
-#!#n�AS��Xu�U4��'�:���/��1��T����y+��8���������a�C�:�-7�`9����=w��J�y���\�h�
�1?o���F�O\C��<��j��	������
�-~rH���}������~����/����}?���}�8H���M�c|O3��������k������N�z��z�E��0�}��s�)+���<8|��*+�l����;�>3�@[�#���Rw=�u����5a2�-��i���9���j�i����.��1�}����o�m9�.�����#�Qoi��'����.U�TQ��q��]�\|�U�{��0n��ff>��
��`�O��/����������~����b~�"���~�*�����4�������S�}�4���u��~�5!�����o����#_���� $�b�
@����SS;j�
&:�0�bhP�^p�Q�TF ��-^��w���_w�����w������=�X#�����?��u���C%�r�!{������^g�#��|�S%{��D���s�u�FQ�����xY����;���{Y���G������Z�y��K.�|�T3"w<0B+z������C�4BM�^7��8�������\�;���n<�2��(oO<��Q���J�'t_q
J�6ZG���ls��[��erc����6
k��.3�m#��(�B+=�%�}���F��*����1N�� ����bD���A>�l���W�l�(cF��c��t�z���g��A�����^����u���n���|���I����'#�s�
_1����"��g�r�j0)���g_�Y �T�%��tn�k���Jb�����s~�}#��)�	��w�6h�����|��3^
������������Oxo���7���]���1���p�����;�u�9�P4b|{��Czu9K5��>�(���r�+8%Q��b��
O<j?3wb��#����1�Gp���9��+������|�^���Wp�];��s�sg���s��Ds-��?{�����9���~��T+~]���d)������!h�=Z��S� w�vTC)E8����X�������G�_��d�U����sh�p��>%������1g_�����*��}a��������]�2����y��?�&��4|@gQ�a�:^X��O�r������[�y�=���4�`
��o�!�����}fi��P��0�#�Y�a��C������{�ch�`pA)D�����~�{����3J]M��*��`�AV��
���\�:�R�����8���z���x�C/�q!
�B��/�t�_�`W�<D�	X�_x���K�q//�Q(�P�������@D�;�r.��Aez��5�{<�w���8z���w���vSu������Z~�6^?�!c��7�Q�N4F|~{	�������v����i�!As��L���^;���t���
-{�;���g��L�8��� ���uU8r��g_|���0A�\��0��z0�L�����}��>;^�}��}�v�A��l~������;���9���n����;oc)q�"����|*�������a�K�G�p�_UK�i���'@�.f�������:�>�{y������K�+9����w8xQ�r]�G�j�X�����]��So�-M����V����]����gl���n�o2��1�.d��p}P���K�SJ!?N�E3���!e���F)J,3� �����7����MU�O2�#��B$F�����D"l�~s�������V��������q`#E���(Q�1i��������x�c�
����z�8 R�������\n<�;\>4��]���w����4c������$�m��V��9�g�o���-����<�y��)�z�4�W���=��DvA_�����|��CT�!��4^���.���'"�Z�eF�-u�Q9xt���q�n��{>�;�'}o?oz�3u{T��F��"��~'k4��b
��?�����;�;W��������O7����65�1��m���U'��#-��3��v�]+O��i���Dz��5N\a��9���$���ME�m���%�%�n�ko4D����V���J��;���C�#���M2��D��|�:M�LP�s�����X�&���q���z��A��gG�/�g������-��83�����h��1R�cM�a��vL��� r����>�t��
!	|�h���g�BH$��!*�0��cD������;��t����wxOs��G^T����s����/m�QjQ���dc�}@�cfk]W�B�]���������n<�,����� ������5a�Q@3V0Ge$�+��vU'�	�G�
�2CG��D[I��&�O���v�OQA����c�����F���X�l����;NGl6���S�t�i�4�q���;X���0������]�3��5���������4�N�x��'���������|���a�_<?��i_����/0���(��mB�s�?L�j��_�9���]�L'I�����:��������-�0�sDR��OT2u�K�s�a��{�����c�u���z|���������NA�8�Y��i����'����/{���yb���|�';�[r�Qe\�������zN��D���^�s������7M�k�[�,�"m^����}�����}�
���v��Hgv��SbZk0��sQf7k��R!���.A�V��������{g��j�:"�e���!���(�-���G!�R���N.�������D��@;�b��7{����O:���������X� ���������r.>�xsb���8����������nd�����k�8!� Z�:d�x�Y���-k?���������\������C�<����:2�H�|�9�K-{h>;Yvjh0����t��R?r���=����:��v���)�	���>��j�$�c��r����dpc]p�+DV�����|y8���8��>�x}��W��d}��
�tt@V�������O�Ev�~�-Z7TU;v�c�K�r8!��������/|a=-�Cp�z�m���O�r�w��>����������g���/�v�M���#�w�=�{�t�Wu?��#���>�Z��?�LLd�j��c�t��ab>���Z�2���F�W�>���Y��8��l`���z�=�5�T�l����1���w�.7��d���gm/���W�w�EFB��=!d	�EtG��D�o�G�I�1`<�9�F��e%����<������9��@�.�����n2m��>h�`��Po�)�%!28.�}�%2F��Z
*������u��i���bi�o�����ni�v.�_JC�J�;��DG�\
���cp�y����y1�Z�.���p=������EE��O���B����F��aZP����v��6�����jk-"�`R�E�^��|���������E��.�*Z�u|��e$1�x�Ea\�K��w��e����B�KD�;��`x�u�k��z�2�y��1��A����D���j�!)�`D�kc�7u(J[&a���k�g�\���w;Y�1�\k=���Q��E.���6�����t���1��A���S�@`|�z��]-�]�K.:�(�0�\3���s��-��2x��R�0n\&D�b����M3u(Z�(�qb�1.�>���|
��SK��)/�[�yF�������(w0�bp%�ST4�1����9l/S$F*�~)t1�v�Q�8@�"l]0�!�� x�������v�o�6�A�������?3f��>��.^b� e��-�4�^{�#=o����������t���)z�c8,���`�m��@���A����;�(1���F� c���0idP>��[��x�����~��K�#F������EiJ�3��NZ��5V���������~��X�|����!�0���!�Q{-��� $��J#7��G�������"���i��(;��jc�����So�.Z�u�q�A/|?���vC��������gB3��v��%���zMg�`1R+�1`p�#�zv>�(��x^.��1������e���2����I5�3U��ay�>�>� ��r\�w;��V���u�����G�q
v���6}�oM��g���uI�-����2����^v��S���l�R�g\���I#'��^~W=��P�{	���:*g���W�l����my����_Y~�0���\�s��1F1�z��;���Z�D^�������e��c����"�A��v�������G��:�k@��.08������>������yb�w�2�P��,y��T���1��Vk�yD�)���G��f�EW����!��y�;���)��"��Ga[����^����j�}W;^��@�5pp���Ftn��'-�a�2[JX��O��K�D�
�T������`���7��w����, ������c��yh�����xl�i�1�!�^;�>3�JN:_�L��n���.���0=�]����+�p	��v~�|R���1r����'_.������)����P����:�P����e�t���~\�
e�(k�����8j��Fq��4�:�I��,dU��+�I�������wK���sa�V��*���i	W�$�zn1L�%o�&���|y��"Sv�;cD�y�e8����z��$���z���^KrI'�A+��\t�����!�������iC��� D(�N��[K��hCj�^x��F=(���P?��u��}Io�g�#}(��DQbh�"�����
�u�]G��N���dT1�<F�~�� r�
���6�R�P(����`�p�&�������8���@P�p��5��hk������{8�K���DP�{�O��^���~��J{�>R��!�y?���.a����r��E��*r0������S����:
��&�e��v,�O�;��}��2�b�"Z"m5�}��G������s���U:�"5=��q��D����7���q����zo�%��I�"r�t�sA�)��B��]�6s	��b�#��������������n���u��r�$����/��/�0���t�D�3��)��~��xT��4���z������o��Vf<G	��o�`a�8����s1)�I��5m�\���z[m8y���V(��=�;i+?:�;�N%I�(�]����}�k8���J`�jM��%M~(^?+n�k0�f�?e~y��%�*����p�"�g�����(���:d�J�#"�����d� [�|?���q�Y�r���y�O���'�<�����'l2@M�e����f��W��]��6���#k�V��I3���w�v^��9�H�qy�� �X����B�����C�[��9t��:�#��:u�g^r�� .���hG(N�e�#�
���;�T��|�}l����pY���/��Wd������ �!�R�.��[��ee)�����	�C)O����,FY�����,��z&��qn���2�t}n9�<���
�/���kaz0�NSn�z���r6���������Dv���?m��f�J��s-�������f&���~��Q
�i�u�M��Y�������1vg��]��:���>������������
B���p��`p����a��~���^{�J����3��aj�s08�������R�.�@�~B�-�O"KH��{�/P�L�K�F���6��a�&�/�\����q�v��c�c\y�-+����!_P(�w��uI�r�&���3����8(*���\m ��hK��^���Fx�W��{0���v�p}��8�}\��a��������:U��
^g^��Wy�.�l,?@+D��2�������c����E�ySh����[�H�E�Q�A}��6��E#��b�4��A�\V\��f�
g_�7����o�ww.�3�r�e��E�0Gu��~����py����~�������]H*h����+�u�R#�����7��_]��z��%��5�f5��`��?���pX��Zq����zO=Q[��/��&�b�����Gu&��^�������v�2��t�i���u�K=:�a��������)</�t��'*�Q��V|�����6����U?*�5�}����s�
�'��]{�\�����e��ge�Ms
��������|!����K=s!R�������e��u]����7u��������'<R-{��!����2�~�2|2�G�TdSa�� y#�%�G ��o�������6B=tv&R~3��eP6���x����s�����K.Y(�4u��m�����{\`!d8 ���y/��;���f�"j��C[E�]�����-���<\V�0�{��5����YG=~c����f��x?m��qF���;������t�[J��B�Id����?�n���M�g�|��^�q�p��}-Lf�i�KS�b��a&{T�Lx&����+��p��D�C~\���M��#9-d�����{��](�����*��}���q��q����:[G�O�����t�t���uI���N��������_i�v.�4u��'\P*������gh�bd������Z����nRL� ��e���@d/4@�����0��������p��m�5�\v�8
c�����#�y�#{�z "$�����T�w;Y�1�\l����u3��8�%������k:/B�0�a��d�5�>��������n����>��m"�IYK5/�9N(�
M������Oe<�!�~Qi�op
��_��paDT�M/1
��TEA��z����O�j"���u	K	��G���_j�1��x�}{ILe!�����(��\r�I'�>�U������MG�ep��������q�9]�Wu'�w
�m��)+��*~��m�+��7_w��{����������6�{�A�p�{�-��������8�.u�l
�����_���S��(T�Lk�����t�d����Xl�������
�-���y��R�>Q��
�6AkeV��X�`������Xg���uI���������A������{��|����[.��LBD����}H62@���m������DF� ������=����C��GD��:B;�>������g��^�z���nc��Y��9��&ub��|��~��{'j�A��.Y(���S@���6���C����?w��3��Y��:.Q2�\syk�+����~�\.�qh�����8y�h�i(����L�P���g���N����~'�C�9�^�Z���������z��u��3�1w-J������8�t�\����2m_���u:M=z��z�b���V�{����\
�8���O=�F�Z�'z����ky�4�:���oIy��CqdU�Ib����<Ay���90M����`�������
B���p���{R �����iU����BDS\B�}�
���W��P���?q�e^}�#u���{X���5;W�!d?�0���!�����.A�x�A�%����Cr��r[u���2���<�7�����W
�j���	���~�m�k��)I]��K�{���%���q^r�(I�����E�#����
�^Js�P�,H���qr�S����k��
�;_5�`#����b���u���t�������0����������K�������w�n�my������n}�+���O�'s����C�A;B������}k\����:����{�.e<v������(	�@��q���h!�~��XY��h`�;<��(�����J�HS?�_�c��r�Q���Vs'{#�F��N���1����v&��B�p�%�/�x`��vy���o<������������g�����	���oa�����n���:����Y��bY������`��0�rMs��&?����7�5�����~��2�m�B^���-;������Gu�����/T*"?��+.9�D��R�G�[J]���q��Rp�1���X����+K>�h�x�nf��n��Pn�����*��V+�U<3���t
����A�.Y(���O��4�1�����C~��;��p�>t/�7<�������kI�{�����"
G��$<\��0���9w��/���F��'s���G��U������w����v������$r������Uo���at�j~�Udu���#��q��S4s�m�.5
�=������l_��Y!M=K�m6��'�j#/#7�%W����=�/�|�ydR ��o����R��S������\G�%Y?��;�xY���p���.OP^����;�uZ�V������[��i�A:����K#���^�����H�Q����n���0����������.��q#o�}������QD�����b.R��@��u������Q���[�}'`��KE;��R��q��,�;�N�a�+d�x�-i��b����]�����~
���\��Oq���w���#Z�v
��Z���B(j�E[���]�M��8��V����115iT����Z�Hu���d���\��T����K�i��D:�)?�X�V[��(����]����:Z���4�q	G
���+��2������J�]�3�c��W������/������;p�g�]�#���AP���) ����>I~]���+�j������e�������u{���}�!��e1Vx&��Po���w?N�}���t��8�`s*n�c��%��3�;�Ge�X������v�����E��~�$D�ku��Pj���jz<�1��8��\��V��k�R�����IsQ��?���&�������y>0j=��Na�[���yU@��'��
����Y�������rMs������(~(+^?+.���LY��Xe>��T�k0uL��k
*8T����H�H���	��9h�U/Y2.?��?��9B$-rg\"����q���zv>��J�k��
e/�[��I��f�X�vT�|��H���Lv�=��*��!�d.����� ������:wq�(L�1]r�=?��*d���PWA�0�%K��N�|O��e9�����?��wA�o
�m��:�t��/hfL���:�P��^;�\����C�������t6,�]�]1F���C6B>m~8�������r�ywM$��l?M�f��uq>m_���u:M���{g�>]��V5)����}�������������������YG�2���O�K����v��M�o���q�q�W��GE4��)���I�����UV��$��������>+����i�Zm�EB>1�:�p�
�Ez���N%_��#'id������Kt?����+p��2�3I���P�<���5���8�Bg'�0��
��x����/DZ�y�!�8m������'����4���'k�Dh����^b�[��%��H]"v�(�6��(�e����Y�"L�u.�Q���\�
��I�'��V��<I�b�f�u��%�fn�{y��_��,
�f���}v.��������0�g�~�q�B�m��Ea����V���%�.-����v���7�H�����g�UW,��g��i��#u��:�
r�	������7��G����Y��AxF� pa��J&s�CyV�kWT������t�uU�m73)�qv���70�Q����I�w�	�t��~�B����O����7Z���N�EV)��Ai�5�g����ai���Gi����K���'}�,���u��$�������������l������M��$��b
f-����O��:������?D!���1�R>c���8�BdEa�[�Q'!"�X;�Pa����6�����<�[n��tR':w&z�(j�������8��]!���7	��8��I��k����{L��C�0���Cna��$��X]B\L�������~��M����{����&'K8R|5a��Y0{�����8}-����VD�AI�B��A�}��|���p�������������Q
�`��\�d�t�-mY5�^�>?�������i����;����{��a�V{M���V>���r�(���x����S\�eY��O(.����;Y��x��8�����[)R0b���\"�"Q�o����Q�^�������q�^m�SQi�ri#�����o��t?�,(?n��uYA��/�0n�w?]����Q�~�b���u�e�1�b%���*�_9Z=��+���3K�}PC����-����J���9��&�3:"�r?J2�K����p�5o�������E������{*t[0J
����jYb���(:���@��`������������� em�=���)�!"���K[n����aY��|�Y�=�%�������c�(��&�fs�*���{.�r�C��i�]����
�o���u��j�� ��d\.^�T+���	���a	
R��2�D���2s
�:]�����|����)��������/_C�K��[�{)�M��|��"�~��j��������,I�?�5-i],6Y�Ca�~�\�n��������l�3�3m.����d��:����y��r`�����}�i�>��;5g^���-g�Q���ED����]V���UsM����+qd�=W�Tz��c�"�K�����8��tzN��)�;7�����a���g��/
�-������e}�Jqo��~�:c�^'?N�q�-#��$<\��0N]�&�x������F���$���:<N66�A��?L�q(l�<���=��������m�n�s���c�}����S�{��=N_��N��F�x���:��e�:k�,-��%�� �>j~v��,���+	-�2�H��=2w.U?�2�X?*�����\�d����BVMb���G������
�Y�q�B����0]��F�	�,�������]b�vct*�!1S�:�����x���A^�B��l�`n�}�����W��P�BX&�d�����	~�r�f������v��(�Qv�I���=!�"�������B��Q���x��FT4La��4�	�e!XPn���������_d����]���R���u��	c*\�/
�-���eDy�u# ���b�%��n������R�-3�;��q|�Q�����j�r�w?�B��N&��xvm'EZ,��c�P�Q�!��{x�V~�F��5>�}��Lzb�~d�Q.k���Z�4Xn?���������Vs�u�/R���2H���/��e�+7����L������W���	�� ~�C���{�SAJ��ptZ�&jK���u�gB(�I59q�4�[�1���%��vd�p��5��4������2�~T,kp���rMKZ�9����x�\
�vo�������Q���<�`��A�ipm��4�A���]��j��l�~�K����r��1����d�����h�&�t��������u?�0�������?
����I��������A�m�����������E|B7���;�P������[����SS���"�q��{y~��']p��0.I�kV�����j{�q&5:�����q(+~8���&l�>��:xK+��l�
mh����}e�8}-�:m������9����%/\��8,�[����KA�N���r��T�s�5E��}�a���gT���������M��U7�}�g����F~k��������\[���G�K�~���d�$Xe��s�?mnV|G����'�3L�j�s��Q���G>�������r��v�1_������J�����=�����K5\(C!,�w�N�x�6P��0<�-62wF�?u+�^���?�0���{�]O��$�6��(�e����e!XPn�����y�}&�*Q�T'������eY��V�����C�-�3���F���|����*}��u��r;�>RP��LW�Kj�u+�����r�
�d�s�\9�T$n���$�j�0�r	�Q���Q��`��;x���w���w���W_l<���U�a�%,k������=��������C�c�wI:�g5w�WX�m7U�E�����0G^��Y�� ��U��;J��1��_s����-�?��CAJ����8n�#��w��h��z"�H��g���5{�B���9�~T,kp���rMKZ�7����x}R����T�RK��5('5��e�]��'��S�k��>A��6�Z�,�;����P�l�q�R.h��\yH�s�~�>�R�k�����'_��$�c�k��dbo������W�gnwy��|YUp�(&nR��U��t��Zg�p)j�;x�]���hC�u�}V�
���pY����$�Yp�JY_g���UGo���O8��ke�G=��Ya��q���#s����[��M�������~�L�g�u3���eg�C�K�y�4=�@A���O��s���������Q�,)�~~u�sS�:��'����US�3S�>�x����:M���"I��$�����-�&1�f���x/���2��]���������3��7L�j��3���)d��p��0�sa��O�d���g�m��uaJ>������_,�
�P�����u(]I�Z������7W��ls���*����n�\���
�5=[�������i+$�6��(�e�}h�@IDAT����e!XP�m�����"�.�3����Q4.Y��]uT*" i]�s�������wx�c��K{����������z�Tj�"������sv�~s������J(��������Si��D~���D����znc�>`���'�#��Pz�k����$�\������"9U�\�"�����O�x�DEF�������������nv������;O>fu����N��~{�S������I���r��k�7������]j���P��'��;�95�1'p=�����$��2Qk���l�FY��bY�sy����\�B�F�C�3-�������a��y�o���w8�r}Ws*�-;j
��$sOY����>�8��L
+�s�������c�e8��E�{����?o������������K�S��\G���e������i&s({���-N]/>�d)�8����{��U|s-�Q25i������b2�����n?AT2r%�.E�{�m���r�)����[Ny��?��8����o�����,���]��m` x��x��,���?�>#�{V�=/���cu�d�|��$���k�]	:����S�w�!Rn���W�\� �>J���*��x��:Zr,k������:���c4��q���P]x�u~�-u��7������1/������z�{��>,���'�{����q�������{VY�o�q��:���Y�������_E�Z�������zGg��1�:����G(�r����W�� ���M�ix�uF*
�
.�H�7�{=��`
a���N���p������oc�/Kr=����B�������iP�w��c��a(9�
�?������7G),����8�yY��op��^:J��F��3��K��1RI�}(���A�JZ7��^��.��T��*a��;}���K�F����#h�6��K����@m��<o�������g�~�ip)���'-��\�e.<��#�^�no�y/��F]�J&Ro���}C=���M�HIG�(��fn�_Yp��M5�Bi��������vF���p�?��G�I�����:�������z�yEZ�4(�94�s��]a�����2�6V1A�I���:��R��|g����Dn]U;�'��:���:����XG/�@�(�~T,kp.�����K]h�$�P�>�����c
���^~O=��H������f�1
.[�9�t�)�5��EQ�\�D����$���~ur��a�=�h;/��Cz���m�"F�}��0������5/����/!����+��c"s-^������=4��s�j��V*nzj��'�-r�!������$���A������vM_������F�{��C��B�e/cR"%����q�I:^�0���B�H�S�~�f�O�}�����qX�����������=�6�����F�l�u`�"�����4�2�r��~��Co}�l��0rS/s]�g~����,m4$SW�Ea���\��y��b�@���:K;7Ea`�#���M�7��k�u�K���5��a�)7�O��t_��8r]����d%�&�*����ip��;���~�(J�j��Ug=d�NS_h��OA 	b�u��Re%5����~�Z
O&�.�t��bX�z��T���3�h����h����>��;�O�S=��P�BX$�d�|�MU�6��Q�S/�A�i������b
���p��
j���{��'���t_��=_z}�"�����a=�ri#��RX�};�q���,��
�`i���W(��[�}F��5��
��9�p]�������K]�/��`?��W��F�*�Y��SEwh}�����^S/��A�[��`?[<e���PF��|e�s��\d�E��$E�P���1����\�B��W�5Q�nzX���a��l��o�����|w�o�o���ny6��c�a�u�q�2�&�=�5�f5��_��?�������?��������GQ�}���t�3�'�C�\��k�H/�8�0�.4�7�s[?�d�{�)yH���;��1�O;s��;�T��|z�5��0�;B��aP������g���R:G~(M^�g��r�j���&�
)�����=�#� n�k0�I:���L��(m.�cO@����n�NusC��:���`��C�������_~2��$:���Fd ��y�v0<B;B����z��7K�#Xo�v���Q�DOx�F�q������Y���fAC_����K}��|�������[�u22?�?q��c�K�7c��X�I:����~&����q��t�fa�-9���oH�	M����j���}����}��`V�x���u����s��
��<Z���6j����Q�{M���t��q���~��G3��!�>�~�����%�q�'Sz�:���#��f��N�3�'� ����~%/��������f���0O��
��|��%U�����6P�c,��Y������Y���{R����q���.Y�7�\�����3��e�G�Z�u���Tb��OA�,�e7�a��Ko� ��[I�r�qz����%�,R���2�����o�X!_z���j�0������P���t4���\�r��nS���H��%��+o~��t���C3��Btg'���"�G�i��	I��E.mD���\���|�IV�E����XB�vy����Zi���pyEZ� "Z�w��`^�(��PN���Xz�����W���t�Qe}��ko3�A����c/#�{1��B��QGD���������zR���a��~��sS\�|�����]�e.�~�����N2\�W���������wCRK!�"��������gI���jypi3��Y����� :����	G��
�L���k��j�Q����s'��\f���5�����{Bq��j��%�/y�M:�f�vy�g��c������.<Q5lP��	��S��y���1�+�������{-Y�	����P`��s�u�l��CD�S�x`|�7?����e?*�58���bM��.I���x}�����
�?nR(��K�3�����h����7�5�z$�{�'�5��p����B��g�� ���o��k9<���y���5��SG�����\yH����Y>�sd���/�9����=`������'?7*��K��y� "}����4����g(�x�N9�id���}��8�A_>��6��&��[s�p�b�����i3���w�m>P������%2X�����g��i�����������Q�~��}�^�gd-ek ��6Y�D^3a�����pY�����t�fa�-O9�=6�8��x�����|��,�a�3�~g�����c�9���d���_x[��_��[{k.��o�?V�y�Kq�Z.�t.u�v����8������R��n*]2����6�h��9�������o���2>�Y�<�����^���b�{�������Ot�x.�~n�VI����#dk�i���	
9�������:J~���k�n.���K!Ww?Om�^����{��������5)�.������b�+�K�G����~�����g.���~���r=�8@�����~�W�T#$�b��A���%���o����?�:�0'�S�b�0=�7�B%�@�>^����x�=��B��{&f�F
���:��#�S�k%���������Ba
eH2Y��
�k%8�P��+�k���@��0���=Dj��_yO}�����AD�'8�-��x��:J��(����D���[�!�=�!���~��F�Ga�ui����$}��������$�����O>�`R��5
#H��V{�`8��S����.��.q�
��e:���E����6H���6��~���k���Bg�M�uu
��zJ�� "�2�H�����_��5M�n���I���:�F�o��3����S���Q�	�7
�9s��\z{k�L�b��;��R��\�B��q���l�D�L�^�U��B��F���b_�q��
���6�y���HN�B���<��w�&
TO��`���gR����/%�~�K��(�NmN6��k���,������d��z��{�^�Q�0�<��[�����V'/�6��D����������J:�f�v���������g�����u�|�S���5����z����:��~�z���)ypH���t3�W����I����a����V,`��<�R��T�x���n���g<�;�3�~T�kp������k]b�C�{�C�>����k0�*(�y�]m��f��"�|�
��Qx����9�cW��i���!���=���n�������0��l�p���0D�����
�C;6�W������Yw�����2���CRvb-���T��V��q4}RG�N�<��3f9�f��uTVE�Z��w4�{�hv���%H�l���^��|�61��\j}�Q:m[s�H���|M�������k�ca�	_\L\����<��>g�t2u`�A��c��LD@`8�=��@'���k\�OVDT#���~��W������q�Z�i#u�~;y���v�I�O�`@"���Cq������p��8��b����5.�)O9d��~�����3��o�d�+i����ao�A���&�Y�q"F���0j�t�^G;�1������q�=fq�b�Yx/��k���q
���`���=n}�B�p:S�:h��n����>���=�1?��qe|t5��r}}5a��\�\my�zujgc���a��^�n�������{{\��������{IM�i��'�������w��Y��=�A���Bip�� ��q���)�:��5)+�.�����j�^�'	V��v�������������;(/�\�u^��k�����:�9������g�oA 
���"������� ����k	f�m-d���K��
�/e&��,��EI�K�I������C�>��a@Q�u]���@F����Xvn��V�g�~�~��I&kd<���,���}2^=�������M$���Vb������~S��USk���D_!=)�~q�(7�A������*#�6���
K�Mc�d%X$���K���0�����p����sc�xe��o�Z��L��t�gk���@���j���A��16P���9�J�����(s5=��a%��4�dA\��(�^-Vx��@���?TM-�Q`	�mhF���\�B[��m�aS�3�T(DD�9�X��a�su�^�����@>���x}A��e5V��H[���S����htk��7�IA]E+�zw;�C������Q����?i�=��$�Y���Q>��#�rc���z2�,�c�V���np�(�����a��*J9��/�^�����_����=��P��/��FA�op����z����J�#��0&����gnq	������v�q~�h0�����n�#�eN'q�,�Q!����?�����Z�$�P����_;1����)k1�X�_�~���o	2�rm�k0��2�p_.k�'lxN��p�����r..��%~�h���r<�"1�Y��!�S��<Dc�MI�����q>�U��sG��!��%d0�\BiE�_�����o�f5����C��)?���}���^}kEY��a�>J;�Ybmd���B���7��;]�K#�D�R.<dR|.�t]ml��/���2m��NG"�9&���E�I�M\���i����W�}���[om2[��.���.V���}�^��{����8��F}&�9pu�~:�Y����D�)��L���)fk2��n:j���&?�������{���k����-1��;Z}��J��C��Z���$~����N'1�R/���N�d
���������������	�7��O2?s-�l�,�����-�������\)I��3�X?(���;����c����6��c���,����X��"
.e�9}%u�hst8����7lF�8kRr]��#mY����O\��=i��8�������8�{[vZ|����s����V��1�m���*��K���s�L���� �b�
A������x,1����������&����#��m���]��5�x��g�t����z_���������xG{�{��)���{�Q\b��"����/6�l���Y��%�e��HQ�wm��Y�haCG-��#���E �`��S���b��"B��������m����`�l���l���\��F������*������S?Ckos�dq)�6�l����7>4�h����8�J���D\�6i��\��'�~�&����+��E��9@���q�����xU������Q���/��c>����	��=�Q�&n��fj�m���2���w��Hi��:�p-Q��i�}�2�~ex�1w�FD��
c����c��,�\�x��V�����a��[�h;W1��1U���������oo��oR2S?��K�e��~����:�$�S����{Ij����z��&�X����� ������ksN�>�B�2���+����{A3���2R�=j��nm�-��};�6��g�����i�=�3���e5w2��<xO��&��pW `���-����g�|�Q\�>�y;.e�v����	�6:;���re�7�L6�����Njo�~z�t�S�7���l�U3�c���8R����������J��������}��Iv�{���Q���D�X>��K��]�����VVm�&�o��?��@��*�8����^�m�{�~���/����(�5��=~��gs<�5�:�`���cz��4���:`��#%}m����j��$�0<�f�,,��'_~�����h�/*�.�g
K��(d_�i�G�LB��Z����������I��������~�m���#�i���F�x@�c�C��G'��0�O��>g���*y�O�Zb�c�����m�������v>�#�����#�������.!k!����!�^�si�)/��j��������9�r��'���	}6=���y�a��s�k J�l���t�'�u�N42��C���lO�}���a��C��M�cJ�b���u&�Q��^�}��y&���w��:��Y�Bv�/���^7x�U�~l�6J3W�z�n������������8l�e����:DkL�l"�����7F�A�:�+�����v[o���Vf�v��1����F�B�/��$�~��1��Hn9|O�vp������Q8nt�lh���-�q�\��\�d�H[V#Kq�����4�w.�P�{�N|��w$]wm��Z������ |��g�D8jL��7���B@�1�db!�A����_�m�I�E��IY�k�P������,4����2X�1<&�������HC�}.B8��0�1�{�i���5lOa�Z����)�� �0�*���t�=��{�a�x���}�;b� u[�\��a�jw��R�BE��KQf�v��|�I.�CH@��wj��7�2,����y�/����}�L�O��f
��oq���#H���'�\�,� �R�vM����Ys���
���27a�A�"R\��o���!�)<}����k�A�#=sY�	[���"m���x��/�+~0���e�e�w8�E����I��[F��4�^P�a�Y���~V��w�l�(�^�N�����Uu$?)���b���s �F{�D+�~K�G
e�%�}���~�������{L�k�5a��������Y�+����v[�0�������u[�����A�F9
O���/���*K��q���S�V��'�5���v�)������21�l��3��vC����wh�9z�N*����&y�k�����]X����P�&Y��:��_e���
D`
_��}%����gM[��3�Z<�����_���g��{{����DV��s/F|"���)�<����M��7<�����E�����=(�q�W����2��I��\��T���|���+�7{,~����3
l�>����u��0k7�0N>q)�������u�{~���"z<(#X�2>�������U����~���C��~��������A���"��1����#~���������"�z��\��\g1K�~�i����GV��g>�;Y(��A{E����d��w$]wy&�s.�V�E?A?b~I���~!A ���jr� P����c>D���,$�!�5�O;�n�p������A@�C����hG*2
�_��  �������'wW,�����dQ�^g��G�U��b}g��  �C���p��!��v}Ker�W�\%��  T>��[��\��H ]^�O2^s�'e\!A@��Xc�5��v���R��.w�@EA��C��h�g�u�P!A@�Y���O��XTD�M���8l�#$��E�����m������4�"���)��  ��79+��'J~��F�S���L��C,TH�2T��B����{V���g���=A e�z��y�s�a�����8)N���\��X^0!��;�O{�����o.�>3!<r�  (���X�;�x5��o��:S[�D{������a�=�>�t|�-r^A@�# \�
�@"�g"�Pw-O�yF�R�$G��:���L:��
��M��|R[A��`��a�vR�� "�HH�C@������+��K��E�����X�ZR�H�\l�ASue�e�*�==�+�����S��}�]4����[��wkn�;�3I]3�^��.��  �7 9-�Z��v�n3�����a�G���@�B����k�5T��k�G���{��
�~�2�� ����:����,R�zVK���^�R� P	�5�6��r$����r�  Tv�aKu���*�.��m��>c����T����~��������|J��5��OA@A@���pc�$�e���+����j���x��������<A��!P���&�n���T�>���W�w��t8K+���k�����^�N�R� P��5�7��z k7n��_�F-��/���u�"�o���  ��WS;7�B��;��m�������
5��1���
�L���  �@b�
F��Q��u���9����lA�B!��l����'W�jUyA }V^��j���l_�>�Rb%F@��J�����4�h�rx��7��	A@(j�XC��SS�]�C�������n[�(uA@(T��[�-#�A@A@A@A@A@A@�J��p+]����  ��  ��  ��  ��  ��  �@�" �Bm��  ��  ��  ��  ��  ��  ��  T:��[��\^XA@A@A@A@A@A@
1�j�H�A@A@A@A@A@A@A��! �J������  ��  ��  ��  ��  ��  �� P���P[F�%��  ��  ��  ��  ��  ��  �1�V�&�A@A@A@A@A@A@�BE@���2R/A@A@A@A@A@A@A@�t���5���  ��  ��  ��  ��  ��  ��  *b�-���z	��  ��  ��  ��  ��  ��  �@�C@������A@A@A@A@A@A@A�Pn����KA@A@A@A@A@A@*b��tM./,��  ��  ��  ��  ��  ��  ���p�e�^��  ��  ��  ��  �� P��R���R��Zi����S��W^�8�s������R�
��p+h��k	��  ��  ��  ��  �@�"���j��b�-�&��5n�n~yyA@(��[� UA@A@A@A@���@�jUU��B�@�" �m��  T��[i�Z^TA@A@A@A@�1��w���  �8(�5��  d��p��VJA@A@A@A@A��6�Z����E(T��[�-#�����p+KK�{
��  ��  ��  ��  �@�"P}����m���<<.b����\'�@6�7\�TA@A�h�Z���z���������M�����@5�o}g��;s��^�*���*�Fi��:j���j���������V]Em�l=5��I�B�j�k��j�,R�g�)�*%�����-^�&O�>�<�

�Q
��Q_M�Q������A��# ����+�����5���  b�-�&�
"�l��j��j��_��/������B���IPM�WG��Z�t�zF������"�B�V{���<������U�H�������'���}�j�;������%��k7n��_�F-��/���uj�����_l��}�aj�=Z��o�Qw?�B�U�����Ym�
���+��5��S���X���4m�@������?U�K�y�(��{����Q/����~��b����+�����'U�����!�i#nq�����E���z~�	m��_��?��(�!�JW�������V��Z�J����r�Tn�pJa��  $F@��!��������h����O�}��E�K(�_p��q��Mm^9Z=��+R3�Fy"p���C�o����M��>�<�"�.2��D7�`�Q�Wl�u3�������=�����Y��wmw��f�����P�=P.����R���������_g���f��{I����r�W�-�>v�>;�3N<�T�'�0zY�[s}�r�o�UVVw�n�q��#�[��-�:I*8�\���y����G]|��E7/��j���s�*�{$}r��J�i# ����A@H��p}��p�&j����7�����T�>U�CI�^�C��W���2��z����+<F����@���Q���������U�=������pp��j�����O��^#+b`�8mY�o"}'q�(#�uj�P���B}���j�O�xO��]���R*��m��m�m��gmP�Z�q~���:K�4c�-ghr~|�>���������n���ST�����]��q�[���j��_��H�Y�����Fk�U���XiQ�D�_��z�p��b��|(��I����]��R];�	���p������� PQ�O����l�>Z#^~O=��H�+�PEG iX��j�]�3�G�;�3����w�"~?���t��%:���w��?.�}��6�P]��t�lm�y@��
����10em�/X�NtF�y�8��}������q��=�N��R���J�XA���}���H�j���_���z�B��q�XV������\�[[�|������{WY�����n����#�����a$�0t�\Y"��v���~��N�>)��Iu�����] ����n�l�
�Rb���M+/&E��p}j��m�>�oh������BUt�T����B`��T�{���t�F�Xq�Sq�_y�^�N4��Q4F�+�:�P���;#U�+n��.�;�q����@yp�l��:�G0�Y����+��c���Us
5|@����|����\�a���BD@t��*�]'1�w�U���������  b��ia�}@�d��T���-�@eT@��S%j��_U�N4��Q4F�+���E$���4��z�����vQ��O�q1�z���p��B�ULD7P1��<�*_���6Qo�������S�������P���~�q��������b]}}�����:���t6�e���YC5�{H�B.�SM���\n���}��3f�9��F�v[�5n�H��^S�t��U�k������L�����-\v{�s��U3�5]om�p�7g��������	j�����;��aU{��m��>��gUs��1��H�wA@�1��`.�(�����J�����n)H*�10U�������D�+Ec��B�^D��.On�OR-��T}<�k5����U���np��79S1�@�h�Bz�\
��j�T�����Vj%�/�������Q���_J]���{������=;\e��\�M����g.��v����3f�!}�3���u�}�W���~���[�W��w��j�������Cw>�~�n��y��u7XG�t�q�V�Z�S%����G5��7���SJ��}i{���:�KN�������F����>b�
BU��� P6��ga�}@�d��T���-��pKAR����B7o�/'}'^�(#�b��"��wyp����Z�iCU���npn06r�b  ��������p�����y�Q���_1$b�]}���q�f�+-;��`���������"�T&��:����O2���������^�����������^PcF}Zr���n��T�JU�Z��4����@��]cC�<��~4��:������wo��a��7
�m�c~?�vOj���rL��C@���
���6YO�":���U��k�o������ss��Y�h����o�R���=�t=1�
5i�������Zl��&j�-6V����z�����(9�~���zj�M7P
t:���Qk����9k��U���N������t������U���=�p�W�#O�^r�z�4R�G�v��n��j�������/�f��?k~�\�����v�~s��������_j�N�1}�\5��/�����(e�%�'��&���k��_�t���l���������9�c�g/���wk���w�_�����v���]�Ou�M�n��-���F��z���������.[?>Qn�����N_S�^-���f����h��z����;`7���M����P�/1���A�|�
�Q�t���;�/����>[}��Wj�/3����}��V1���MM�����j�~/����c�Rk�5����i��t����MW����f��e�1c��I?�)F�������=��r�
T}=V���_�O�|�����=.������[m�v���
�]C�?}���5f��������b���#�T'O3�7�i�����Z��f|.��oxG�i��	��Ze��j��V�l������TQ?������L���k�y����P�i���_)y��U���������
�k�����9��S�)�>���8UM������{;l�������_������n����f����Z��������;��r�����k�R���]��~�s�;������Q����\�j��#O����`�`,�Xs
�����P~Y��|����	,��b#=G/��5j�a�%��I�g�J�v���)��f���,E�$���**-���-K��)$�B�J�"�,�JE��,Y�B�=[���{������<3�<����s|�3���;�~���{���s9��g/�r�{*��m�Q��J���>|S��;�q����_U^|���V�;�>���z�Q�?��J���I��t���{��y�i�.��|E(��[y�����A�~M��&� 0�Z��W���R��^�����]��|��i���<c�k�a���Z��v�~�r�-�'��<F��53?qJ��06��i�������e���2��T�� ����1��Ys��c��K?�����o�'r
c�*w��|�c�h�����j��P��9�?��c�C
�-<��3����-Q�U(w�*CV��>c��q<^�
$7�zE;��uO�}FY�����pN���<v���C�����9�W
��������������o��U���y���Z���%��	���6s�;s]�o�x�y��e(�70����7��]��b<�9\�R�H5����W��j��0��e�c.�wV������\�>���y	�[i���� ��<��<b��]�X�o����J��|�nKh�q�-�
�������8��=��6oW��6nR��MT��&�[��b�U���KQ}\:~�0/����o �"���F�����������'\s^T���i���z�To�����o�������~f���4�MUjTRp�����������1v���Uh����Q:������	}h�$���Z�0��+T�PY�=g6}*������S����+}��E]�Nt��;C1����4����OP��7�C��3�����*��Y2��*���1z�
cl���k�[�j�SM���k7��o�������T�R������b%�����Y�_���w@W���:�����C���^����9�D������s��  �!p��l)���W�>�sZ�|M(y��OSV�-\���z>)���(/`L���
��PBa��$	�p�q��s�"�)��^������P��O5�rL�8	{(��L��I�\�����Q
�pi7l�A�^V��j�u;+I�+%�S~ n@���;E�9�������
'��k�R�I{�m�Q����������o��m�q��-�5$(>�n��o�3/D�����k���t5�xS����������\}�}�/\`K�_������x�:5*�r$�c� '��^�b��QM��V(������]����@��k�����:����3���wi��B��[>��tv���y�@��y�B�S��f_����C��Sn7NMu+���}��'UveY����e���� ����f?���h��h�$j�V��ek{�����|�������Y9_�r9�����i�~F
��Sz(�kT�
BN(�5PV�X�A��'��������8z���{�j�`h���p:����h���`��O��+�'�����}�dn�1�]:k�A0��	�%��y�-��m�yH)�Sd� ���l|W��E��'st����%A}��mG���~���������>�s^���<���N�������]T�Io�Q���58��W�tT��E/����2�uE�?[��f�Z�:+FA�_�M
���l���IN1)5���x����&���I_b��U�[���M�0��J�����d�����Q���&A2:}�1������p T�<�V����'���L.�AB�!p���~O�������@���9��0�w|_a�����K�#��Y<�a��@R�;:m�jNc#�Z���`���kz������ws�����79��@F��do8<a1�7���U��X�����Q��3���1�u&N�H��r�=I�s��
k���:u*]�[���c���	{<�y�6���L�5�<\��3��?8���������)0Oh�l#6HL��$�b~4a����<�I�A��p�h��fu��q��}�m��������u���>A��|�w�rl�r&�3����1�3e��Q�c��N����(�U�]����\4D�5/�7�����:����������1��^����'i��S�:�����>�P��:�y���f�%#���o;E�"�2���C;;I����!�.
�i���T�m�����������K�XN�?[jwY�X4���QvA@H��e��aT���4�K}x�������V�g����L<=G���
��[C_�B�a"�b�g�� F���3/*�������9P\����q%���;
qXF���c��V� �p�.��:z������_��<��@$-H��lm�^x��cr�2�2�W&������������9���@p���
XTg�	^^V(aoO��3lM6|����2���}�+���
M�A�AQz��z������	�*w�����1��7��R�k��('[)Cv���	�_��F��r���'��.<f�mV��S9o*�WM�a�G6��A��b���eopx�B���D"����
��Sg)+�J��T�,�'�y�-���`�������m��.����M�&���AP^�;�H���3P���eo��b���@IDAT������U����u� ��n�z�70F���7,����������/xGvz�	�?Z�x�����2x��?�LB<�<:�q"������J\X�;��W�[
6l�<B�P��9�]����@o.��sl��UV�2�:@�{X�Wc�s(� /�����U��?f?���X��!Mz&��T�z>xf�yB�s��iS��Y��~�}<��@����k^�P�{�f���<P��:)����3��b��CSMt���s���������$������^15�����sO5PY}�-�w����k�f�L��~����;�����A��_��oo�?���(�
/X��c#(MO8+k����b�y��@?���J'��2���1L&b,���~�A
�_��z^�wi>��~��t���7c�[F
l�����d��i��{��A+ }�������y�������vI��7N �Y����h$6�_A]!C^���������~��+��Y��������<�}��OG����|��!�1��f'�w���W`�Z�Ri�q�r����0�Q���������m~�^xi"�y������7�!]�4Q�/�M���������
<|a��1�7�
�a���[��E���x8b����b�����FJ���}��H;�;I��f}s�<&h��2(#�Dk��2����zDk�} ���y������\�a	"��]�����}���}N���I��lcI1~2�M��1������wC��+������W��#��6�4�����1�S� �Q2d����~��0'����Ql�"q��`L��Ece��"��c�5���=��y��jU�]�W�c��v �eb9��mK�lA����~�5���S<�����30�G����=7�G6��Z>����v��y��.]��<��<1G�|;��bl���,�Y�	�>	���_��������:���O�89����uO�g���M���>�z�G���C^7���h��p7������*�U����9�[����'v��2��
z�����?������R�
�����Z��-�K��G����y�����.�oW��^/u�9{j���U�pYF����
���A ��k������v��0����k���g+�M��P��w����(��V�A��I��.���vyZM��^�fzS��
v�9z��!��L7@}L����c������;
V� 7�B�a��=��p�4c�yK���8��"�@���]�!����
,��mB
��L69YG�h��y�(�du����U�5O�����Q�z(���� ��6`*gM��1�9��k�&�P��<�#���C���
j*�
�~\F;OS��9�����5�#�b�	������*�Y�H���&>�x��@�������Q�6nWQ�NC�)7Q��l�%(�s�K�C7lg=��m���PX�8A�v���G{��X�$]{`���+U�=QUm^�kB��o}� ^�t\��������W	��v��
U��y���������V/��{_�fZ�����e�d�v
|�
;1�	SiM�=Xp�m����X�������_h�����m��|����]��O)�#B�;�	�_�^y�/Q��U������zw���m���WwP^���f�k�Z/w�
C�=�U��<=^��X��A�{��g�qV���A�gW��=0���.&��hFf'f��,��

�(�<��������_����%VJ����(�n;P,[�x��5���������J`������Pd(HQ���Y���_v[���#0����%1�M4i�	�*�i��(����R��E���`c����+�#`lq�"����w6)����~9�10e��k��)`t�An�	����~@���������6�eP���D�kAp�Im&����x�G�0��X���]ey�.|��Dz�C����
��&p�}�6V�'��t"pa`����x���/I�h;�#�;"b�?���������F��~_��� ��)0�|�#"8E��X��G1�0�M��y>�c9��wK�`#�w>����b�0�j��7���	���7\�s������@�����=z��K�����k�F�1�7��?��&�k����~
�c��"� ���A���3Vc��~f~	\�{������d=x�F*�y����	\��=w��-��;��0���M�S����^�������K9��'CF���j�����~�_G��u^^�&!<��W�g�#���eK��'��L]�$�<w!p�P����  $B����w����
�����(�[��C��d�@o/A�8����J`����@T��84�r�Mb��,��k���;+�J���	2n�G*l�5=������	�U�x�

gMc�F:���u�*�z�gyB��h��h&U�U��i ��)�Hb�:<�����@b"��]hV�m@+g������^2��yi�-6:-���	wS��k����=�h8f��5���T��F��[��^� r ^���2I#	SLe�5���N�C�/+X����v���A�����a�!"W�7��3<��/��!�������s�)�^��	�kB�����I��w{�k�C}�I[��i�~")\�������0Np����_hg'
s������N��Q��(���Q���K��]Cs�{h�������j�w4�w�|GPp���Y9�k��zx���m��h�?0�������=�[���X(������c����{��N����E}� ��0&��9����N��E��q�:+%��7����P�c����d�n��}A8�ba����>#�OP�Os�!\&�����}�}���j	y�<!����M�q�
�����o�����3O���;Q�7�w[��x�<�7,��u��Ez��
�#@�!���Z��`����pn���g
(D�����z�����-�\8C�+��}�aM:�Xk���4��mf)���W��-u:�m4.����'ok6��x�K "(�{5~�,������H��z��kD���]8A���}[)�N���L`7?�s�7"V8����h�O�f�g8��X��������M������v�V� qz���1�L�g���M�}T��qKga�5��h:�Q�"�h��p�K��Y�=?���c���.��=���J�k�n�:nF(}e^��.�Y���Gzo���s�������n5�#Er�G�������X����[��h���
b�#�@�! �
�~�\����'�y-�������/�����V1�xBi�Pd����/wW�Sg�]F�|�<Ar���8,���b��<�B���r�V�vQ�5|i�����r�N��2�h=��*N8o5kyA
���u���J0�6���8|�Z�����7B#�m��Mf~P���kiu�3&�����m���:�y�s�bXt�yf[�a	�
����0�8������i��W��B��$w���m&����+`�
eA8�:����@��m8LL�A`�IA�����9�Q�_���0�aO�4o�����.+�cX�|��N�m8�����`d������)�|G�|��w^��&P��<�X=���f�q}���	C����&��2V��PR�M�m����`�j���W��8FA}�<���|��a�r(�#��A�.����7b{�����Y�������h��_���i�5���<��(x��C����;�T�
�
����.�[4���~��Z��y9�]���)y�H�<�� ~���YB����z������*��?�{�]D���+�����)!1���7��N���'a�Nb�����ZDdA?�&3�]�������Jb�c��EG�*�f��:������_�����	�_�|�����6v���u�Q�"qm��F��v�����$���U�;��[���9�qg,��I����x���=Gy��-��o�~^c�m:�K���h��p�2	���n�2�P���*~��;����p���(i���^*�5�q�G������y7)T� =��)��g/�U��U��R���uW������������kA@�!pm0�?��=��s���]��mn�z��b�8)�Mb&\�e��^�p���P�g�C�Kf����n���Z��l�7_S�(������p���R�m@+g��������|�z���E`����*)�0�"�S.��y
�1������TL���8��_~iV�p_/�xF]o��u��h(L����\��y�Ck��������^-T�F�K&�W��W<��%WRk�6y��*�;:kl�	�Nxe9y���Dr pQ_?@�ddb*�zx=lh^7��A��l,EY8y�Qm�{wE��k�q	�X��A�{�����Mib��j�^��!\�u�J�0��u-���_����/J������8}��	�z���l�*���,�	��1}��;��p���f����pD,��k#�*,�����#|��������" ��h�p.��K+�:�+yD5@t�O����|�2��M�q�.L4n�c`���g9�7����x��@��1��Nn,��w��N:
�g D50y����\���	\�'���m�Q��L�k��h� �����H�����m��J���&�<�6"A�#/����/uP^�N��� �#��^����X���������������N�h,��kf������u��Y�hq6�J��X����&�3���A����*�=�1<p�YH�������{�|4D^(���N��Y�"��@�������K��&FC���#�\��@�g4�L�fT����	/sh����On����^�X5����<&4��}���T���T/c-^^�N�y�B��!*�A@H<��������0��"<Q,��������J{cb
�����{��t�[��7����n!������V�N��zMx��p	B����uR`�{������^at��G��9
��S���d6���	�r���l`fz��u��Z�5���}�z&���ktujx_Uu�]������Kk=��[s��p>E��K�*���x�.#X�[��
w����0HS��n��S
�j���c�'��k'����k�W�]��n��39��v���>5����kc���%�5���B�jO&'�ix9!
d�E���O����8�g�
c�>�)�Nnz��x�s��>���zG�|�PV�i��e�p��N�v����:_N����/�5�����r����_���'J������8}�P���	o�k���H�=v�
ie�~#�yxMC�����u A��W��'0���#`��"5�_;oV]F�~�~�9N�[�UX�p�[����n}�n�W�N1�L1���W��DC�9FXm�����9^E`������zY�t���/Ao>C'7���;��Wi�'
�9�SD�+���)�M���|EH����O4n����USk�����ED������}A��"%�k���nz�,�9�U����{0.5���Ox�k�m�1�>0O��a���9�G�u��$pc���9�n������'�3����#��bY��� 
���d�W���o��y^��ML"k�_8�|���W����	\�� q!s?���]��
J�1�3e��Q���H:u����1��������']�rJ��L���V��t��4�����V�Z��5���a^��k'�s_8o	-_��.��1!pma����  $B��@�W����iU��d�z��kxm�w�����c@�g����bO�U�c3 ��������ZbRh'U�(E-�5T��i��5j����~����4�SS�T\w�bB]�C�V�T��<r�:pxB���P�9�g!'����o�m@+g���TC�i���#X�/K��f=O�'���J:?�ZY�r<�vp�!g�� ���?>,?�h���9��^�����=�j���rB?�S�FuK��w�kx����{�d�y������<��>�Y�6
�h�}��~�W�x�-K��=�:���+B�y1���
��>�����O\j.�M��>����}��>���;��o729;�7����� �X����w&��u^���T!Q�B2kC(��H��*�zG�|�Pf��`��<�#��Zy���n����}r�/cU^`%�
���������=�D��p�*�wPg�������q��U	�?�X��Y/W`F.H�65��U�9�Ug�k����
��Q�Q������#xx�;�C��
����
j\�����
r�#@���o^%+�����?�iXDFE��������<X��k�9�P�e�������}J���"��S�&��K`�����	.��	\`��K�@��9�F���J�$7�~FGmB]���o!��=N����K�y���;�������`���&��/��Nt��]{~�~�&�K���������u��$pc���9�^�$�A�;c16N�g)����+wNz����7�x������j���r����yI!����������%���}��'��0����D?��������������I����8�Pj�?��:f��U}��s~�k\t��l���(^��z����W�v��CS�N�w^�0����
�� �<��9�U<x��FJ����)?�+U��� <��2Y++'�B��#f��z!pq�]w�QV���j����v���?������*�t+W�(�,^�	��������8�o�{A���p�A��6���N
l�v�W�h��s}�{�a�R�0|���.|����R�vk�8�*q��<w��
����Qq+{V��R�=��DH3&��[n�G]h�1	��L�d��M�(���&���n��v/%7_���+v�
���z��|u�p.�5u�W����`88�n����y�����D7�����=���(\��X��A�{���w�@��^����_����7J*�	������k��*��Ey2b)�������3���.�_���X����I��u���F?4���*��? ������m�
}��������@����N����/�q
�y4�F�z}gQ����>^��:-�^�m�5�w�1��:��a��|����3��i��D?��\zy'�X��m��hn1���!x��seWs�N=������x,���{T�t��pG���c�M<��coTF�:_����
��A�|x�G#���1��3r��x�����%��I$�"�������Q�c~]� 	�X����m�Z/�g�;�qg,��I���!p��,\�U�y'�X�]�����N�5��=k'&��������.�:v��4�`��y�Dr'p=�������GoB�W���:���r��3�3d<:��
q���W��������!�nbh�_�^
��v\���c�����&H��_{�.[V5�y�������.n��  �4B��<��/`��P�����R����N��x��I:s����c��5-���8
�R^�
�;���^�,���(����7�>��	��y�VD�>n�B��^���j���Q�x�����A�c=R��bG��D�1��R�z
�i-��o�m �r6�2j����h��C��f�k�c}�4#y����YB�-:K@��{>ve����nQ�m�����S�X���RK9m*�@�������:��z�/p�u���@9�����do����QiVk}� of��<
�I�6M�"�)�5�(R��@�����X�A�{(��wD�5��4~�����X���BI�|��M��v���M��1
�jX�����k�z]�m�NM9nA�[�\�	����Cqo
��`D�����~���c�Q���}x��#N�#����/|/���l�����n@�k�w}�f3!��)~�Y}�ID�C�w)�x�AB@����{��_�q1��|���������-}�8�n	
��i����(�Kw������14<��>ZW�1P�oG� ��c>�1A�L�R�q��	�kO��6�>kc<�L��9���/���;����
�����~�����C��0�?�����x�$d��b���������8B������^��7���RNE�0#�zl�������k���U�=���z�>A�����3��qR=� \�k��Q�:��t�[�!:����C_����f'�����d�4�) \(b����t�k��sp�m(�=g|���r]�t�Y�0��J��-)w���4�H���t���B��!3�}H���hMBB�&�D�� �l��Qy��K���5��}@f��k����m(0\���3��U�'%��������D�B�t�:i����w��ZcM9MFC����U�c�~�L�����������v.�M��`���������
�)g_�U����l��n_�e���4��w�%IjEk�������x���C{g9�A��V�B����/[���X��;Q��{�=l�����ekh%{���PY�Z�~k/7/�y=�$�G���$!PO{�3V^:+?�nHR���{���/\ ��gT��� �=���;��G0!M$�On�e���"QR�p�0����W������h"�.��Q����Z'^k�K��u���#o��uI!�a����'�u��7���3���z�^���`d�)��C��#���N�B3/�������r$pQ��L�_�d�����V������-�ny����������~g�x�y{Iz����:����=���]��������J$�O�i�����WC�X�E�n�8�@��t�/R�n)V0�D�������8{_)P\�\*�&���.���tb�cP�g;UF�^{?�w��{i���Q&pL�1����C�E{<�����c96��0����u�����e�j�9�}����#R����w��N�}�c��66N�g$���-T� =�\����f}2�S}:����eb���	�
��S���S5[���.�ub�	�Q�
g���������B��n��ndi��q�i��=j���3P������/��9zC�P(?��q^�X��w�$�U4�]��M�A Q�n?�\���S
�J�R�H�3d"�����\<�pX�B�u^s��������k���7� gs���N�������A�� g��M����e��a�4�����1�(Q�e���6����e���L�uH�{�goJ�����Or��:5�,�"]�M�����rB+�zVq{>��B�����Z�4a�0��#����2�e��9*:�`p���V^>�����
����k{��k��w�x��W�{�zG��%�w��}G�3���5p��	:s����X����RG����V"�%p�g�}/�t���v`������y��j@����:6]�a����](=�on��X�Lq���a�������[���Z���C��R���;�������o�?������Ff>n���W{����E�h����a�����U���H�
�����1p����������Q����V#���o��
����W���4�9|��.��e����u��3z
���m���\G���5��
��������\n���������E���}�]�[�5n�/�n�;���uT?����%����j����������&��l�Sm%0��z�H�����}�4z�y[��W�s�x7q��g���$�������{�w����.}���S7�'�3��,��(�H\��9s�u�H*�$"/.�����-V�c�����m��^Z��K����d��	��
���W*�~�b2u���H�;��C���N�^��q�wt���<w!p�)A@HT����[hW~��&��m�"�!��m�o^�'5[���u��Ox�tz�Tz^�.��:u*RJ���n�q�?�UP�r��Q�V����/O��x
&7�$��������*��?�u�g��hE?S�m�M9����Dj�QvJb�2���������k�%KjE������Pe0�M�No���aza���7�&������|��7�z��x}c3t�/?������-��.�O�:�7+)*W�Q�w��v��N��&w@�@a� ����b������~�]E7�)��'��2V���������`n��+�y$�X3��}���o��FW�����!��''
��J�����a��4kr/�]�<�a/�>�y����}�5���b�[��;����2���������{������G/"���������=[��8�F&��H�����OP�a������/��;ncE�6y��W���;O ���nv,T�'4��%U0vD�/b��N����W�F�PU�5w}��r��i/*$t"p�9S����2��{����5�b���H�M�z��~;5}��:�'����M�1�Vpd�pb��j���������6N�n5
W� �%�<��[� \�uhg��g��s^C���Hj6���JO��m��A,�F^��4^����nz���n ���������q��[>8�g/� ��qg,��I��bE��^��������7Z�S�q-�y:�mr'p�����v����y��8����c�n��nEd����
���}'�S��J���]��ko��������T���~.Qi�2�|�IeS�y�B��H��  ����6�wd��6&2�3�$�0��!��������R&p����@�&�j2��@!�[��L�}��;���;�f8��?��N'Z������_��={�X�,����^��zm>xBIl
{+�]Y���M�Z���w�2�.}���O��r���=d c'}��8�y�������V����@&����O���pa=.���ME��v��Z����?�kS�$���L�wA.�*Oax�Q���w�|_�}�Pw������`�������Uy�qk-���^#dd�.3��N��A�(C�o:��m;X����������5����xw��~�@��ip���e=NPkm����7����-�7�hE���L:\y�b��r{����{F4�C������������@�����O��<���a�f]�����.y��[s���;�T����1�1�?��g8N��eU���X�Kc�B(#45���3^p��b�g���y�Y'�l���c�V^��G�?&f���A��qB�b�;��/���5�e%����8�w"��?8�K�3e�"��[?����/u �K��������#������x�|��1(������Q�<�P��C������Jp��c����V�Gs�&Hcw@G���#�����tf�������}�9��u����)O=�2N1�ib7��o=������~���t��q�qg,��I��bI�6o��
��;�������DC���X��;��"E
��bG��)���^C����c���}�n���J;o�|�f�w!H���Fz��y��������=�C������u��i�����m�]W��RT���*��9�h����%����J�!�@�# �
���L=��W\��#k�	���,�����b�0���Zm�IP��W�d�=w.l���j�U��V���������������4���N�O������AT����:�Vp��K�*����������f&����������%x�B	�a��F����R�X��7<p�
�c������v:/(�������W���-��|���x�����s����X/�TXN������kg����\:��dG��h�&�8)��4�&o�\�Q�����6:9��O�4���k�vCa�&�|G���2D���}Gt���H�����X�7�_��\U�F|G���z���k���t��T'7�o:n��� ��h�������"���L�l��s�*Fu�����-\E�y���
j�vV�tQ�S`H�kF1��o�Wr�/F��n���@Ta�`x{�4iR���u��;�{�+^������(ung�����c��p�10����Sts���������}CJVz���5�`�:��(�+x�����W6^r��A���r"pq.V���|�o,� "l@���zc�'j���>i�\�%X�FX^�H�z� ��-��.m�������{�}��������^%?e}��f�K�����Qf^��.fz����y�^5z�CDC�F{R�������No��,V�}�+F�b7���"n���z���UXj�*Z�����>������<�W�o���0������X���>#�1�����W7����,"��V� qF{	J����bl�|������>�3e�����=�����G����S�:�I�.�U�Ve�s���+E_|j�$o��x��R�O�s�M��m��vn��fLx�u��:���0��}�csub���4���	G�������3�F:[�xP�A@!pm ��:��B2�
�mnA
�VQ�C�{������}v����rf��tH��.tE'B����:A3��3{���P&.�z=!,�)����U*�)T�]�=��KM���\�oZ���}������	�������F�F��d�m��r�,#pD�r"��E��Ec�$�����X*@��8�����mB�� ��k�����7P�~c�)�����o�N��I�#�Y!���M�SkI#����xy>�uB���1�W*xCn:��L��W�P&N����cq���;i��wl��T��}��{���eU��gbG�����5m�i��t)������2��O��	AaN@�`�����
�Z�vg�{�5�],�Q�}A��������L��?����(/0�q
�FA, �>�~�	�
�V����K}S�T�#p����,���N!��#�Q��^>������X��8d��u���2r���iLe�|1v��9���� ����FC���&�`�<��/��c%�/Z��f~����	�!O��F���WPFt�p�S"}g�+�}�������z/�^,	�<�gW�D0>i2q�l�9�T��������2+�����SZ��f����B�q3��Z>���0�F�G���^^	\�]�}
�&H)x�I��y��B�a���s�[N]�i'C -<7�6,�����}b1~r{7�u�q���|��g��Y�KZ�7���D#X+^�x��9t5��v��8]�����m8�����nK����}�d>�8~��S�7�e����x���t�	��<Gr�����d�a��D�1`$s}�l��UK Y�>�k���}�!�z�����a���(c~]7��n �+��T(/}�b-��sX���p��X�;�k�b��t�v[�.�O%��B[6ne����T���r;���^�����4����P���K��@�c��1��S��-?���4l,b5W������1SF&y�+�	w��C3���N��������<������F���&���zg6��.��)���oD/6�K����y�X~#��.�z�G�$C@\��wF
�1	E���X��?�vz�p`���w[���)o�j`���s��m(	K/L�7�M����
&r��l4�<A��{��S�I�/�`���k���5X�B�MY�|I����� ]��/\�@���Q� ����ZH���!L�����{!�5�����c�\�����'���~I;�H������XO!��r����T�������Q����*g�r76q�&�Y�H��+;�1=������)�����K�-(A��g�JK�����6n��� ��l��k�������C9��Jb����F^P�#�zXyg�.�"5��:�?�U�����[N��������CV����@b`��}��]�r���U����_J�����������}	h���?��'o����xV����$�����_�B}v�l����U�E(�������c[ed��$����y���n}�i�w4�w��;���F0EZ��������1����{��-���e�����7�Q�CxO����RD���#�oz$m��#�}��q�Z�|�^��)qM�����(�H�r<������N�l����#���~!
��]�<N�J�����f������q+�I�����N��r�]��
����;��X�=��#
k��L�B��a��������W���� c��x��|��2&s���Lz`��������+�1'���~���u����q|���]{S���\ (�q��/���Z��|����/��C�b\��cS�`��	�}`��"�!#�K�{�(��;-Ta�Q�p.��q!�����`m��3�=O��?7Uen-��J��@�<�� pc1~r{7o.Z�zu|R����0�Y�~��^��?0-��y �0���u�V`8��A0���0f�<�.r�[]�������XA��|��7���b����m���7�z�ujV����9�y�/R��<�@��������_�H����.�O�`	x"�M�����{W��mt��Iz�b�y����A�R�p����Q
���Y����rl�1�Yv�?��d�
���B`X���i4��A����zl�����<��~	��7�g�7��'N��
[��q�0a�\j&�d�L�{K���;��~2�S���u	�`y�v�Ib9�|�;G|�r�����K�7�z�iP+������;����
�P��
��y�W���]�O��G��S�������������q�����Z6X����[�� ���I3����T�-�$.�^��W�F��y�I�N��|�n�k3h����K�y�B��`�A@�!p`�2��sS���Bw"9NsXig��
7v�^�����CJ7���������tBY�)�%4����:U ��+�&�M��(��j�R����Y��N)��qX�bR4�&�i��!5��u ����b�L�*"�@p�z�]�Z���X�5�M�x�<���N�������o��
����;�����(g��������	3HZ���8����h��u��X+@�0��b7����pZ:��i���5}|��=��<�����>o�n�����{���V�B��}'��|�GD.{C����h'~�
(��Axh'�^#Xa�~�)t��~!F1��������J�q�?�D����>��bT�d���xE�:&c��;�����]x�M�1[�m����u�@�8{�<+OQF&�R�q��ye�R��c��h����"�'�tb��A�{~����i�u���������:�0l��o����@_�e/v����R���u"p���G�v@6�o�@i2��X�^�����tx7��^�_��@�o���H1r{�������Z�}��9�s����Np�
����'\�2�w�+���{���Zw ���s�G��n��8����6}���:������ -h'O���Jz�6�q|��v�L6R��������7���a\n�A������	�[�6�1��PQ�t���!��z��[�����>�q�-<` #���:��.!����8w��<'��Q��|@�I�P����P�3��
�����?yy7a��G����|���{����X'��7?�d�f����q+��1�s`��5������zP]��.v�����=����m
����������w�2�H"Z��<��IJ�%1&8��8L�?��o�����d��s��!ry���F�Z���s����i�m�c����f��\?�]����
�nV�:_��O�s���;H��w�NA��M��zff�v�~	\3�]~�c nA���I����;�����
�S�c��*���]����}_u���E]�]6>Bo���+�zk�����z�K��G���n����lJ�:�>�z��$���
�5����n��T�F%���E+i��%��D����
�(;��  $	B���
X��`O�l��'.Vb�Y�{����0A�M?��W��w*%,���y�����W�z��>,�������P<�O4�;F�'��PLFeG���k�������f3�&�:�a&��8Z��{�h������^����Ca&-��]���s��k�@��uV2l"����"��I�R���t��b�H!x�S��k��=�Z��'z�.4@IDAT�n+Z���+��G,�:(V���Eh7��#�a��������-=�d��=�U��w�!+At!�,h#��Y���Sn�;~�g����%?���ya�7S��X�z�����y�<e�{��I����������P��s����'*Op���ne���F�P�0��s�
.
o������B����?��q�M�S�JeT�r�V����'�����(��=�y#{���e�����)���yk���#� ���U�Ne���\����[x 4+�lL�M��^��x��5��p�Y��C�{�U������Z���a�l��������:vb������I(O}��"�5��~�����~%���A�Wc��B]��C��Y2�#���������d�.K�s�����m4z�{:��������f����
�
�z����%1����5���>�?G
	��Z�	iZ�Y���2�L��1VD;��V�y���<N���qp��B����p�C>�=a���h|'���X���}`����l�e�IT��r�=a`d�6��l����5+��iP	2e+{X���(�TK�eZ��Y�V��(���A^���
C�"����-��B;��#s��!^�M>zPE����[��c��
�a���#��>�s�w��)��s^�~a���Lx��?�b�D�����O^����.QD�7���mA_���a
��A�����;�#����nb�3��z�r��AYL#A��,�
���O	R����!�.��xf��_�-m������keG��;�@��U:����Z����2���a��(�1�_=�W��Y� �W��}�>�<��w_���f�H�k����j�������;�g��r��%p��p-U�R^yjf�q�S������������
��xc>q�����h�U�fh=a}������4�y����R�G�c������k��9CY�������o��-E�i����m[v0Q�O�����s����4a�d}��y�B��`�A@�!p=���c �qa�aBr0I�!t�q(D�k*r�����wx;	&��\y	j�����TPVtY�dP�_x�!4�{ck��x=�^���U�@?�I;�Kf�p2��G(AX�G*nm��B�*(c��([�c-����GP.(4�^��R8%��rX��LxgR��BeG����b�-
������g����2x�-���a��������	��:��p������r�#��'�	x��(O3��0����d�/B�e��lx������>�0��K�.��\F�l�W�ep����e6��v]&��y����O�w4�w��;���C{��������M������ �w���E����K�s���k�� �!�	^��h���|��������J���Y� ��&��WC7���YC�����c�~9�����f���1���:]��}�S9�����@��6�$���'��0BD�#iSf~^������f�1���g77����H4����p��pe��<?! ,�!6P��ipcW?X�LY3gd��,j\�1�u����1Y�,�7��S$�S��1��s��~�k_�9y9�wmsZ�8��v��f���"}8���6`w� �E;��#pu�0&i���G��3��:o���������XZ���Q�c�H�8h���C���m4���!�!�3�$���q������l�'�q���c166��33�1�������^#O��t
����$�Q�	���uoE�E �����G�;�m��t��I�[�8�������Qn6<O�.������THl�[{�������r	������W���z
��  D�<�
�����
�"A�2G
�/�U�[�����  ��  $D ��0��#�\9I�X#�7wv�>�tq|�o'���h�M�  nH��  �����
��LA �!��-U��%�^����%��Ky�K�r���u}�A��o��#"��  	7!&W��~B�/k3f���%W".����RoM��Z)����W�c�J
��%��������	��  $-%����T�V��{����"�� Xk�#~��Fzc���2��A@��!p/����je���k�m�1�>�x�$��H�K�\��tE�������)IA�rF@����J�A@�B�G�'����i�zV�����  D��M��R����Lz
�@�~=]�r�  �� p# �e�p}T
��b���Z��G�$i\�k(�H�2�+�A �7�1�;
�� `" ������  �!�?oN���������s�~�F�D�c�G������-?���oG��\.��  \��{y?_������Q�r�}r������IyA�rC@����J}A@A�B�[�]�F�n�6=��@��L�+���P��2�p&o����J�C�/��  �E��j��Y�{i;3��6����Z4�
�n������>]��rI/Q  ^�Q�'�&*B�&*�r3A@  nH��  �� @t�UD���=��&���A!�)���*e
x,�%A@A��E �Z-Y�0���w:v�d��&}Z*zc~���n:w�|�yKf�� ��@
_�"�{BI!$!B�&!�rkA@!p���  ��  ��  ��  �� �����`��"B@��`��A@!p�R2A@A@A@A@A�*�2<qJD���{ RA@����{�RaA@A@A@A@A@�T�{��W3��D�����c���!��A@Hb��M� �A@A@A@A@A@A@A@# �FB���  ��  ��  ��  ��  ��  ��  $1B�&�����  ��  ��  ��  ��  ��  �� �W#![A@A@A@A@A@A@A@�!p����A@A@A@A@A@A@A@������  ��  ��  ��  ��  ��  ��  I���I�����  ��  ��  ��  ��  ��  ��  h���H�VA@A@A@A@A@A@�$F@�$~r{A@A@A@A@A@A@A@4B�j$d+��  ��  ��  ��  ��  ��  �@# n?��  ��  ��  ��  ��  ��  ��  !p�TiRk<d+��  \���������������+��  ��  ��  ��  ��  �B��C�h�RA@�K&r���!r/�'"%A@A@A@A@�+!p�a�{�x��  ��g�e���xN/	A@A@A@A@A@�G@\�P���� �� py" $���\�V��  ��  ��  ��  �� p�" .?!p/�*%A@Hz��pA�)'�c���  ��  ��  ��  ����������BZ�TSA 2xM����Gv�\%��  ��  ��  ��  �� �!p.!p}�I,�� p" ^�����i����i�O������ Y�
f@��  ��  ��  ��  �� `A@\D\K�����  ������_�p���h��Y�F*Y�0}�f}�zS`o��1U(w3��|}�������+���
�=wN���������
�����_��@��i��}U)�uY��E�h���~�����  ��  ��  �� p�" .?X!p/��-�A@�dF9��4�o+��?��K�z�L�O��A�����������&L�$�<�I��S�2�U~�|Ko�E`y'uF��,��%��"P������������/M����A@A@A@��!p��
�{��n��  �@��?{.��+�\9�������H�9n��e:��Y��*��J�(B_����;i��<+2�u����%��>���>���y�Kj�B�[(�u�h��o���v�Xa�v�X�Oq��T���t��_�z��X�N������\��}�~U�?��#��k%EA@A@A@��@@\F]\oM/�
y���e)��9h����t�roJ*A@��T�R����#�V.b�@r%p��m���������l��O�n��H�|�������K���"#�d�@w�U��2i�d��t��}�Kj�&M*��rJ�2
�6m���k�b����c��D�B����t��?���:�L
b�e���6�*UJ�~gY��-3-Y������G*'��  ��  ��  ����[���zE���:*��/W����](�A��f��|��T�0����>��)=r�W�X���������������7���z��JF�V���2^C�GtU��O��kov��r����n���6R����nW�$A�B@�@�z\RXA@A@A@A��F@\~<B�zk�B�z�IR	�"P�V�}�P6�}E�.�N.;9�[?S������t��C���d��|7����P�y��f�}{���=�����l����h��'nJMz�����h���	��O���B�`��{���	kjD���yw�+uO���������  ��  ��  �@����
���a	��
'I%D���O7��en	e���3&���u�`�j8�yv:��}~�������#���;o�(��_Os������Y3S����P�	N�����M|���u��t���(O��O*ok��;y�
�=2A���I��q@\��v��\�*���H\�[iW������  ��  ��  A# .#*��f%�7�$� -�K�'Z<����g���B����#�u��Tr�b�:n��KI4����m�BLk9v��������}�N?I�Ly����:�����T>��7!:B�
��[��w�+w+m��}�RsA@A@A@A h��eD������������E�����"7�D�K��[v�O�������;��&p�f�J2^$[u��^��W*�~�Y������Th���#���R���;S�����M����-������;'���V������iD��t���PA�Y�x���<:��}�h]~�;r:�|�}��~�|yr��U�O|E{~��.;�c����j��v��O�[f���`�(]�44��	��)Y�jV��>���~�p�Z��y����p����"�a��_���~����o�d_��6:g�+;7M�TtG�T��Mt]�L�����I?���~��s�|�~�L��Z<Y���<M�|��]��#+=�Pm������g�>�zsa����3{������1��a~��o7�?�'f�)@�
��<TK�/�z=�2����|�~����?��L�����OKU+����R��2Q�4i�W������
l��K'�h�"��T�T1����~��
��z��*��/"�Sv�������g�����'���8�$�Oa���?i�[sT�x}���� 
s��N��e���e�����nkhox�xn�{�]�<}����g��<�vl�m��;�P��9�mf�����C���]{S��y!����n.ZP�9�u`����t����k}��G[��2Y�S\}5�V�8�*q#���2g�@������=���!Z��:}�L��h�2�Oe*�WS��K'O���i'5���$�=��e�xmz:�m���Gi�O��������$U��R,9-��  ��  ��  �! .!���A\o8I*A�RA �	\'�5~�]B��tJ:^�Jy���>�{��{i�ko��5%u���L��(o�<���?�oW�6��dyL�u�+S����"*gP���������F�r�T��g�g"E_�r�2�v{$K/D=;<�����P:s�\�����4�1?-�jMy{n���>��:��3}���x��P*+Y�t1z�iy�$ ��z>;~�)	�R� ���T�o�s4���J������U��k�&��Q��S��${�����O}��f^ �&�����y4�~��3q�G(Q�0�7	��a
�}����I���JS�-x!������u�V����=v��iz����9���2�N11��c�Q��%)m���t���|��:{��AJ������<�� |�����b�waI=�~,\�Z�M������G��u���U��g�R�������0�7���U�e	�~Dy����<���v�U�D����Y��n?s�-Y�!�����L�G�R���:�7]�mv�����7�-�=5�P�1��l���Z����y%E��@rRA@A@A@�! .��F���%{}�ce����?R�JE��_T�E[���Hg��e�������"E
��Z?e��n��
�H�Q����S�
������N�/���������m48�L�R�7k���|������+G$�.m�c���-{��������5���^hz�Q��d��W�����z��<A�A�?.�%����FO������Z������!�j��54���]�C����J���f���\c%
ux����F�|d�;J�5LR�������Y�b
e�<-f���3O��Nm��Gy��g�j|A6���(��/�:���C')�Y�en�1��Mu�K��t�����jd���**o�qS>��LR�M�Z��[������A'����E��.]�&uH�����_���	nbs��������P�l��Sy�����<�7��Xa�[�pl��1������n����y?q���lT ��T���
+<������.����S�B���3�'+d)
��l�y�^�%������x�����ly����<����[y�n�vRN�}�+���
����<F�7m�CQ�������\�*w���\_���/�Vg�`����+��n��m���*�h�1n��nj��������|x���|ys�P�����>f�	��%sz��S!�^�7� %�����Z�V���W�w�K�������F��1�}U�)�:_����GD<[M��{��7�j����
�}�2����6�����(}K�J��i�W?p��#�:�h����H�.��W>��DE@������p*�A@A@A@����xX	\(���(JwT������2r�4X�C6������/��������[�g���>ZO����������P�<�%N�����a���2g���]�']�7e��Q�X������u��A������d���*U�H�8�h6V�f�|��P0�d�&Bs����V-���3	\����3���<q�6��{������0����� (��2������e�k��BY����sb��i�������tc��r�������h?+!�[���8x����I,p��<�U,K7�ZL���i����'u���Q�n)U\��[~�����=�{�bi;4�)� ��T(ME��HY�ea"5�:y��������t�����W�n*^X��z��1~��B�_S��M�nlfo��}������;�����jPn�/ i���������N����I���Z��%���E9�t�V����� �6!l1��nM9s�Pi��<��n�6���A�'(�5�|������W��a�)��\�j=���4\�����>Y<�vE�/������^y�Gv��l�1q�lZ�M�����BX�	�>��&��x���d���;�����",s>������g~Fk7&lw�3���zu�T�L��e��vb�{n.<x�TE�����
�o���5Vc��6�$��d���Ac���#��x��Y��[r�-IR�m�F��z����&�������x����\��mm�Nx�JG�T$@��w�����P!���38y�v�u�E���A�c����"��[������2���\�/�Y�6�>;�~5A��������(��%�Ox�vR�`���E?4����\�o(�	�g<��N�@���|.^�rhx�����K6����o�J�>X�.z�^5u�
LZ�[����]wd�����!n'��M����\rLA@A@A@��q0	\9��6�|����������I�����e�5� oM|W�N�������[��:n���;��$LCmv���r�*Z8���i6�(-�9w��B�U`�VJ�$�c{8D��1�B����<�V\0gQ���i�^�y]��!�����K�VyZ��7�����{����y(q?�5_�v����+�����u�2�����M@�:���6U�QI����/����m�Y��6�v��
����I ������*����Z��'
�O�F�y����-H���69�?��)�X4���1���oL�9�r4v?��f��_,Wk��+]��L���/,W�A�	�<A�A�'?.��������5����O�.������������{�~���v�C�N�i���	�X�(��c�����.,�G�o�<�~��0u0^���5�=7W_8n�Gj�O��n�,���uHa����h�|���.���h\�$f���'6��Z��A��Q�)���Yj-aK_?��w�3��@H��<L0U�����YnkX��M2g��F����uyA���~?p.�g�H��*t2ZwIN@@�b��E�W'�!TV����0���S~O>|a]l�a��p�:�C�=����m<����F�)\��M`�7G�1�o�B��g2vP��8"���/?�#���r�+!��q����8������.)�}o�
��  ��  ��  �@B��eLL��f����%R}/�?8���6ExTx��`��������w��:
jQ��q�5����Ck���gX�mz��Yz���x�ZMb�o��G�1�4@(�k9�[���(o��3�}���������9/��Zx�aO^�������2��]j!=������;��'X����zs��>^y�c��M�6����^��P;�m���'9,�_*_x����X��0��?YH��'T���W�I���w�e�e�>�.?�C�8�l�
Z4�K���^��Dy�$p�h�����G�E�g�x���'���/T�HA*���Z����73+wK�+
�<v���w&pWLo�H�TZ�%DD�"{����$��}_�}O�e�N��})T��,Y����J��?��:o��;s�����w�=}z����9�=3g���y�3Q��[o6�*+��c�s_���z�&#������9`���w:l�v2���~�����>��*-�\4����s�����|������E���60?��������v�0�Q�'*�5�t�1HV�Eh��*��&���� �p3�+&���=����gT��aB���c�`�����s�z��u�e�t
��������27/���.
���Lp[�{b���\q/��O@x&2�:�toj6C�ixq�b�
�Q2������{�s-%:�z*0^�s������*t�b��wnc~�v���A���o_����v�az���Xe�tow�T:�(A�h����_����6J~������e&l5�If�ml�g�^a�f��1�u�=����;����;���,z�������z|w�=Ux���$@$@$@$@$@$@$@�P�UV�5��T���N1U6n�:��Ap���y��Z��0#������>��
r���+���Gb�s���_{G^T��5�3W����KX�r���1"����c���^{W��.�c�a.R���9q]s������������
��O9������2q�}n��]�������SM�27�i��P����_�q��]���Zr�%!��<f�
=���$��8�Q�'*7��9V�?�����
�c5��:�ve=
57�
�{�z:_��0��9�tL��?Gj���m����n2ye\��eaS��oB������A(��71�W~���?��&)��:?Q	�Q�N�.����� �i"!����R�����od��s�vn}���h8U?��cN���Z���:MV~��|���U�\�+����{��z�q��y}��v*V��a�7���{a���f	��Md�s���f�)3��[�~�h���Sp�d�������u�4��]�?��{2�n���������i����wn#��S���`�u�-W�9a�a:p�L�8���Q�	�����-����p�s���(��c�3���o�qC:��E
��.�'=�r~��	��=ax'A�x���g�Z��Z��|�_}�F���e7W��S�G�Lq	�	�	�	�	�	�	�@&
���
��`����2s�C�����A�(�_��*�����
us�'��sW0��������}��~��g��0l������t0�j�&�X1K��=�d-L~
��������
�,?��1��\7��[o���T<1c���:_��O���8XP������g�����%'�c�0����{]�B9��R����(8	������{�����%K����KJ�����U:� �?n��w���w 3h�������YB=9��jm�����1�%J�:?����8��d?h}���G�N���<;�
o��?�J����yk�C@H3��/VDl�d��=z`[�����d��tl"V����Z�*]��:_{�>��l^���;`�L���^s��0�]�G��<ut7m7�kC��X*n����yI1�*�Y���f��M�6�����K�	��G�H�.�_���25��y���
�e����f\�]��e��8�n��aO>��<��[�<��)�������iR�����]��^���{�9�1Ww���5�H���2��a��va�(�����y�C����
�{������I�H�H�H�H�H�H�v���,��[�`A�>������+��B���W��sk�c6�g(��z�e�fr�a�������M7�lF0��n�6�w������W����_�`��}����a��O����r�J&��3�O|B�*�BPE�i<k��#[6o1����C���S���:$�s:�y�����OTn����iv~���{F�9Hx=Z�(���Vsz����wt`��c����kX�tnkXT���<(���}`')|@a�`P���-����>�5$������4pWb|�:?���({~���}�J������c��v��$hN����~W/��^�� ����W�����%��cT��\�Ai���@�.�������!%�~a
�d���^"w�����o����(^�}q�<���y�w��"7Jf��t�^CC��;������LUDMV�s�8U���/<�j��}
�-����}{a�j�w���5r�h�i���?�,���]���eJ��!�5��]��W����d*�����
��Z@Ey��T�B�M����:g�kg�q�4ot�Y4���>��d��Dn��'���f,Cj��kv�����|�v]��V����GL���H�H�H�H�H�H�H �\�a\�AHc�6�A�����,���n2���q�(�y�*\�����N��30��XXa��s��9/���.��f��w���gG�'kV��Y�G��\�s��o`�f�������lw�I���W��13l���w�i$��i�M5M��vWX����9��D%��d���}�&�0��0*��^����W����Rp���CIw�N��I1����k��o�5f������}�w��~G�Wx�f���{`�0'������>�t|�oav\�
B���I��'f9D�����S+�K�/�Y�������A=��6�;���i��W�e�������mn6sC��:��i����a�t	���u�"Q�/�9>�ZNp�d�YE��N8&,�����$��f�������U��+���y�<��,��e��|�8���R��kF�,�\����I>���5l�:�N<���k/9��#44��o��z���>t���[����~��f��*l�W�;U�Z����s��k.0�B�qS�Ay��#�#,=���f�=Y���w       �Xp��+�X�@A�V�i
����j�|�x�,�`�����vU�O7��(�w���k���������������oWe~���w�������t����r�Q�>���^������;
Jz�Y�����x��+d��Wl���
���y�\{�U&��4���6�]s]�r�If�'�#��Q��a�+�(������	,_����L7��.�]X�x/#���}U�|����c��{XPXv7X����w��|�:?���`�L��J'�1�+�w���c#�t��Sd�w�B�X���;��FI�Kz�,�U�q
-�x�A�q�������W��~!#5m<����N^p���U*W����@��d��u��7��_!�x~�'+�����
|��}Y���������4B�V���@D,?��2�9����kY��r�z�V9��+z�]l>7k���{|�����?b��?�.WFZ4�k�����l����6o�u��7���<�T����{������5������X�>��y��(8n�y<�����K�`������p�&��	�	�	�	�	�	���%@W��.����\�}y-9�����P0���m�&���X��^B!�
��B���
���g�b~N?K��"vT<V�O�|^]�
��W>DC4��P�0x��:�l��O���HU�����R���N� ���nY �5�Z[�����O�.�1������gT�aT�)�f���'Y��v9Q��fLx@��<k�Y7wGSN����,Z��syhj|a��7������U:�dW�E����4���������������_lD]��^�����:�+>a;NCc�0��pw���n��u��U��������V�:B�CP�g�p��tE�0������+���o���v.�=-�Zp�-�%O��]+V(''�N_��Y
!w����[g����t��&�8�E�l����������Qr0(d��^fw?��y��m
.3����$����*���^������y����/r���y�3x�|�*6�x���H�H�H�H�H�H��<
�Z^�����S��,�V�"e�<�.6�?��Q���8�*6�H��gK��y�f
�9�x�"�[�������;dP���!���7
a��"H�M���sA2y0��E~�!Sp���^��ex/y�y��gm����!t��=���������8�(�s��I���2�
A���v���B���g�^���N�.��0*�p�Xz�'+�^���9����zJ���,n!�Bd��^�=�E���������*�����P��e���F���R�n��c����<o< �������t�������bD��[~��Z��������kI��I�3���2\���F��v��}�!��:��s�t	�1G�oe���,^.�?�����r��{V���v���=X�P���A�1��:5����0��AC���		nC�����������$�r�n��0?�4
����
f}�A f���	�;���<�?[>Z�e��Zm�\,����]�����w      �y(�j�	�nq������y�T;�4��6m�,#��1s�����o����pr%���������2�*wu��,�'>F!,� �u���B�x���6�����*?8n��1���F-n6��`�\a����yE��sk�c<�q>^�m��p�}�{=�i$�=��0*�A.��q����V,���N���7wmT��c���|�_}������:�w��	3Wt�D#�(Yu��u/1G���O��7�4�RN��6�o^��;�Xu~��)��PV�L;��y���0[�y~)��&��61���~����n��3/Vqs�b.\X�����G���^~Gf?���~�������w������I�~��G)F�@a�3s��
PSf���������0��T>V:������6�6���=~���e����e��g�=��}����w����s�;6����D�uX�pm�/�UMn�!cp^�~d������������/V��Q3�n���s#��������h,����d�����Y��i�k���>0W�X�k�${��q�y|��������s�"V��Ws�.#���	���_~'      �%@Wy�p-���KJ�NM��z�����)�|��]�����;�a��b�����\X�
�[�,��Y�������V9������<:O/|�|O�OT��qSp�/*���3�����2��(����|Oi�}��Nw��B��7���f��Eq^Q�s�b�^���?�'�Rp��'{=�����u� �5�y%#�&����%K����Y���d��)~��]V��b��o[���?���'��~w�LV�-}hi��ja��m�o2���,�6�����F��V�2I~Z��]�g���/.�wh�@N9��,�p��������'��K>����,�#;X�jf�w�u�����a����x���]s���>������g�'�}#��$86]7Ka<]!(H��)���]�p��B�`�`VT����F�
�'����E������a�������|�2�&`=��<L���s'���@�p�X�x���f<����Vl{u���r~���e��G�Y\���u���'�by�N������p��{��?S��%{
$p�fo�c���_6J�>�F�>�4J�������p���Rx���$@$@$@$@$@$@$@(�*�d\`�����2:��~�y���2h��\����_�7f�x��i)������ ���7`�h��E��_ZS.��0�l�w�$����&kQ����
�C�Z����2Iz�"��������~���	1���>^y��k8��u�����6s
��8=����s�����B�Ez�^�n���u�� ���z�bo��U�u������{��~�}��64n��d�����
�8�m�o�c+ci����I�a����_���3t��tX���On����v����;u����.3���������C��w�C�:������>������5�w��PEV4�|���q�V
;:J�������&���=��#?D�{�S7�	�T}���Y�:�I]
���j6�?���o�a������7��I�O������p]O��f��7�����i�O����{b\0�}�6;q?z�=��6��>)���I��ww�C�.W��G���$fu��C.!�{5��[7H��y�)���+���<���{iaL:A?
����>����n���ZF}������S�MD��I�H�H�H�H�H�H`�������O�������b>o]uCJ���Qj�;���rK���'��U�H���y���/�k�BX������wu�SJ�!v����������5J�'���0�
�HbD�����>����W8ZnlRO;f�3��	?Q�WT��a��m�h��&�/�{M�x�-����H�M����!��Q]�Qp�
.���M�����{�7u������z�u��;I�
�����,�`����&�,;�� ;�>&��*���5&��9X�h�8�����C
?��O~p���m����V��~=��#e�������R��=M�x����c��-^��L{�Yy}�������@IDATJ2�hX�D
�{WF(�0Y�t�L�5������e�t���?���=�,�����Cz���R����,��:a�����k.?���n���n��[��&��4��d�~��`����#�BhFH]x<�^�1`�l����5�9U��t�l��w�'��pk��o��	����o���f���Vt�O�����.�!�_u�Y�����������E�H��7��:-�$�jU"=;4���9�x��z�y#N�����+���g�*���Z�h�I����5F�u�����x��x��_����~�@iP�B�{���3�Y�\���*+������f��s	�	�	�	�	�	�	�@|p��p����M[���>xkb��-��l����U*	B�� ��:9axNt4���\�0����0�����,*a�vxi��I���X�y5�_�F6h'�o�a�!�^��/QL�����2-��D!�"c��0^Ar�T?\�����{^s��zF��s�!�����(�+�� kn�����_���u�F���J�j'I�#�jy�������y��u=�f�O�!�*��"�^q�erf�j��v��~� E,"+So�����[_Y[0�2�����w�Z,?|���9�Wf:�>ax�!��8����H���W�F�������/
( ��;Z�%_D�F���^+G�gc6L�OT��O.p���aM/:;�����!O?�f���/��/����&$(��S����<<��d2���_�����/.0"��5�OdN&]{nyM�-r@!�(��x�BL��������so�Y���X'��o�@�4Ly���
����Ou���F���wd�C����c�*k�����R��T��J��s���e������R�%;v��%���H���
���?��U07�R�����������5�������������J�%���l����,�|��	ynEg����g��>����QP�_x����X��)m����
��qb������4���2i�����5&� +!4"L1���I�����d}��>L�V=�8�����uC��x�O��d��0.��������#h(����l���6^��u��I�}�v��6�����t	���u��qK�W��_����p4       �$	P�U`V����#�����i�R�d����N$,�x:c|���\/��}���	O����TJ�<�]�����C��Z�2?Q	��������*9���6����.]!O=<7�|��Q�W�A�0`��-�J��2:�l^�>��t^�A��<���H�snpK��<��F���!,{����>�t���]�	B2Y�t������
����~�����������7���'��?���J'�������C5��������c��D����m��9gf�Z��R����#�/�dDC�5m���2���L
a`7��E��'��5�tmbyM��ya�`��EhYk�f�Y��xh����t1���*YL���*�t\9��|�{q�H� �-�t�������&�Ox�"�7�I�s��^��'���'��,����}��E�k�P��7n�&���y~?�'��q��gUx��+���,]e���7��D)�������7[��/����'��)��<�[5�^N��R��\������O�{��[�9��If��9�����sP�U��/��84f-��5m�+!������';fc�G�k ����K �:�8��[��b�k�>Z��z�>(�b�t	���p�\T+c������Y�w       ��(�*(+������N��+�������/?�J��D���K����{���t��:3,�c�����/�+��������!���&E**��^�W��o����#[6o�\e~�pm�������C�rm������My����Q��M7���4�Y���rv�3�����~���k�[����������3��s����#���=\v9��
�����R�w	IX��h��d�=�q�A�eA��ko�Z�Q/U���������xZ�kh��a�p�7/��<�U���������~(/�}��vy:?��O~pQ������o,������ |�;�o�4��v��\&��;]x�E����������W�WY=���`�FB�NWA��+z�����q?�����V����������� (���Z��u���H�2��w�vj%iu����������eY��]�������S3�A�%�<?+�?�x,Z��;�i�����vZ����yxg����kBg��������aRB�[0���0`�?�8L�����4t�1G��k
!�_���Y����BD������2Q�R�,�e���R�����T%����@�]������;��Nd{��?���	��]w���/V���LzH�o����u�ASd���Mz�/��z�\T����rs
y�N=�_|�=Y���Z����[�k)�����z�OV}�c�DQw�yze�G�	���?�F��E�u�R�	�����U@`����
����9��s/����\B$@$@$@$@$@$@$��\Ed\K�iG�E�4�:�����
���+r`n�rGa��{�	aj��'>��Y�t	)Y�����Um�����9r�87Za
!w��XE�B�f���:���SH;�����������u[���=�t]�6�T?��������s��:8��!�L�s�������
�ex��~�=�����YC{��\�H�5�������_~Mx?Dql�4rZ~�����"�V�y�#�Ad�!$k�����t,�S@ Y���W���7'�.b�����d���o�W�����ML�Wx�!oa
�����	
q6�x��|!\BL�;N���	,!���C�c�Fv�X����^�yC)�)�0�E]x�o����;U,v
�)��x�����c�1���������u�W��������|���\C��k����%��W��
�=�{�{�x��4�Hv�e�	/yP1�c!S�a��dC�#lt:���9*����I�H�H�H�H�H�H������nbl�oq���{c�zf��yI�����S�$@$@{�@���<�H�H�H�H�H�H�H�H�H�r2
�Z:�p1�����Ko��f��������h$@$@$�k���_b�F$@$@$@$@$@$@$@$@$@$�^p�o�������k��[o�{b�I�������2c������7��@N&p�9�Ki�70[��/�h��T���$@9�������rHa0$@$@$@$@$@$@$@$@$@y�\-�(�s/<G.��v���O�Zz������WdY�$�	����T�tlJY�r�J�9������$@9��N�����y5sF��         �[(�jyF)��yn5��[�xQ���)�(�V�����%[�l�[W�&O@��c�?:�s���od���RJ�;�	���
�Y         �/(�jIG)��g�������i$@$@$��	��67��N$@$@$@$@$@$@$@$@$��P��K����.��H�H����s����4          ��C��r���{.6�H�H w�x�����%        �(�j9R��3��H�H "��z������d2$@$@$@$@$@$@$@$@$@$@a	P�URp�^.��H�H ��o������m�-d�	�	�	�	�	�	�	�	�	�@n @77��H$@$@$@$@$@$@$@$@$@$@$@$@$�/P�����$            �
(���RbI�H�H�H�H�H�H�H�H�H�H�H�H��
����y�$@$@$@$@$@$@$@$@$@$@$@$@$@����PJ�#	�	�	�	�	�	�	�	�	�	�	�	�	�@� @7_3O�H�H�H�H�H�H�H�H�H�H�H�H�H 7���J�y$            �(���b�I�	�	�	�	�	�	�	�	�	�	�	�	�	�psC)1�$@$@$@$@$@$@$@$@$@$@$@$@$@���|Q�<I             ��@�nn(%��H�H�H�H�H�H�H�H�H�H�H�H�H _���/��'I$@$@$@$@$@$@$@$@$@$@$@$@$�P��
��<�	�	�	�	�	�	�	�	�	�	�	�	�	�p�E1�$I�H�H�H�H�H�H�H�H�H�H�H�H�r
�����G             �|A�n�(f�$	�	�	�	�	�	�	�	�	�	�	�	�	�@n @77��H$@$@$@$@$@$@$@$@$@$@$@$@$�/P�����$            �
(���RbI�H�H�H�H�H�H�H�H�H�H�H�H��
����y�$@$@$@$@$@$@$@$@$@$@$@$@$@����PJ�#	�	�	�	�	�	�	�	�	�	�	�	�	�@� @7_3O�H�H�H�H�H�H�H�H�H�H�H�H�H 7���J�y$            �(���b�I�	�	�	�	�	�	�	�	�	�	�	�	�	�psC)1�$@$@$@$@$@$@$@$@$@$@$@$@$@���|Q�<I             ��@�nn(%��H�H�H�H�H�H�H�H�H�H�H�H�H _���/��'I$@$@$@$@$@$@$@$@$@$@$@$@$�P��
��<�	�	�	�	�	�	�	�	�	�	�	�	�	�p�E1�$I�H�H�H�H�H�H�H�H�H�H�H�H�r
�����G             �|A�n�(f�$	�	�	�	�	�	�	�	�	�	�	�	�	�@n @77��H$@$@$@$@$@$@$@$@$@$@$@$@$�/P�����$            �
(���RbI�H�H�H�H�H�H�H�H�H�H�H�H��
����y�$@$@$@$@$@$@$@$@$@$@$@$@$@����PJ�#	�	�	�	�	�	�	�	�	�	�	�	�	�@� @7_3O�H�H�H�H�H�H�H�H�H�H�H�H�H 7���J�y$            �(���b�I�	�	�	�	�	�	�	�	�	�	�	�	�	�psC)1�$@$@$@$@$@$@$@$@$@$@$@$@$@���|Q�<I             ��@�nn(�\��
��;�������r���	�	�	�	�${����X��|����y����5�%8�Tq)U��|��7����?c�s�����n
�v?��q����C)!
��~�`�8}G�H�u^�	���������G��m���1����H '(\h?9�����*���������O��IYd^H�H _�����;�'{���dH���_K���d����<	�@�%p��g���w�,7/��D��,����\}��R�Dqy��w��o�d��H 7(sHI�������3��%7m����k�\��������_}'w��k����>����d���R��d��Y*�~���<���)%�{����I�N�}��Uo�+��������rq�3�����1G���
�k�������:�u^.�hvc�,RXF��F�-b����������1<	�@^'��kS9A��6�����7���}8��c��������9�����@.xH ��M���&���w����]
�Yv�5V~�ec?c�	�@v	q��2�W3�����H��#d����M���0��&��p���j}��1`r�!�C�#���zr�)��/���<����%��B����������F����/����fNz�s���E ��,��O���t��\�P�bE�e�����}����������-�����:/mh�D�u������8�\:�/_s�d� ��
�/��������������,���GQA����5��`��/<	��#@w�������7��gE$�,�CJ$���u�1^+-������������H�������M�B�Rw����*Mn���j���K�a|�K���i&��q]9���&�s_x[��Z�m�����?��|��K�v�	3H1#!�|�A:�0��0�?a�
t�]Z���[�Q�"a��M7aB�@�~�F�:����i���,���������X��b�G�]��R���c65���2� ���A$�'	�������~�O�,Z&#'�v�|/�o�ZI�g�;���?��Y�~�Q�)Y�����{�=c�y/B����H������< �|�M~@�s$�|O�n����#�#25 ��L��*����>�TV|�*�Tzwj,�_{����g^
��+��<�Gj�}��IXL^U��G#��@/��k�����o�/�m��Nk����W�/W]ZCV~�F�
����3"���S�
�,�����\�h���f�������P�p�n��M�����	�_��4nXG��p�����u�&�4�6��$���)������� �vz�����>�t1�G����?�fh����v�.wO�����g��,X��L���C�>(�W|���}/*w��r�YU���o�����o��p# �\N�n./����09-��	�@�'0��]�y�^S!p�C�r~�sx�3��Gy�@�/�t�&���g\�3���Q�� �n�Q���y�	&�/���$�:���f����s��7�����&QM%�6�������(���b�I�@�#Ph��rZ��f
�O��V��`�������������"�<�wQ��^���2~hGs���=)�.�$��&�^*QnD$��P��sE�gO(��={&<:	�@N"�l�WN�{N�y��Ra�H H���bF��.R9K
����� ��~���]
������)�l��u��x(����u$@{���''=��sh��:`��1���
�y�tyn$��	P���e��r@7O'O�r�d;�rL�shF�3��E���|P���)��f�y�pG~��q�9�n�'F>	&�����/�O�n�0�	�@$(��P���Rb"$@(��������=n��H >�d;�������k�H`O��������R���e����j���]����]�l�+�	���������]�@���#�{����{9L�p�I7��S�fw�'�2Bz�x!dSe��Q3HG��fTLw����1�Y9)/(G
�pq�H��G�n��s��vJE9�dq)]�������n�F���������n��Y�>��Ar�
��k>�>�<��0��x��8�d3��o�wd�R���R����������?���(�����JTT���u���:����#�BN��y����@��O�|��*QL�-"�7o����jF#���ow���$�bT3�����������~�<��~\/o��������ic�4�����evR�cd?
�����Q�`�Q��Z����h����
���~�7~$~��w�,�vWY+z��<�T��%K�}�#�7l�k�WY�h�|�jm���-����%o,�P�$��:���+e�����/%�������N�C���z�����~���'z-.�d�6�&���?\�:�D�����w�a�c�#`�a�m'�p��r���:�����v�^?+��^�z�c����0I����#��8T�>�l���������<��x@!��e�����o�dI����5�9U�u������OY��:Y�u��Z�����!����^$�}����m(���?N8�(M��v$�+6n���[yK����?b��K	�	�A(���?��5?����~._���4��y������������b_������j<@������������@C����_vQ��9��{"(��W��Rz
���������:�������>�K�������1��*���X����K��o��+��W���6����(#�z6K=��e��Kk�s/�c����B|9���R��������U��b����Z����g������4s �~�es{���T;���~jES,P@~���px���M��=[������a��2��M���}���O�e�+��g�m�O�n�R��t���������������U����lg�wk~����Z�{wA��g�aeJ�W_�c�hh���:����M����6����(�SN� ���l�g�����[~�'�}=fW���}��l�k�����g�I��k����)R��i��m����L}���l�?�������>��l���7�<?^�eZ�#�����m��h����{���?��������yY��������e���.8��)RA�VH����}����u���gP��"�A�_~�n3��w5�S�X�g�z�q��7P�W=�x���`^K�w0��h��������t�z�O�9��c���Wx��^�E�������N�]}�!�0�1��{����~D��������K}N�0��y���e-�6W^�����w���]�CG�{�N��~��e���}�����z*�{����0�xO}����
u���E���^|{�R����1����N��jU�"��+o-��_y7���T� ��}y�>�*v�y�@}����7��T��!��o�����64���?�C��07���6�=��,\�L����	���68���M��y�o�M>�����o&���s����3�6��=��'���0��e���g�������7�}����m�r����u�/���p�2��E�wh�������s����������x.�q��H�N+f��~������d��gx�N����:
u3��(s���/�p��z�~`�M�n�CJJ��kJ��G�k������|c��pM\[��i;�=�-7\k�,��*^�Q�+���h?�Q��ic�(^�$���O>_���w�{Jd�
h��?TJh�L�?A#�m6�9��B��OyQ�3L(W�
0�g�_�49O��j{�Y��>�{��������s�{1��c�.��*��';L����k�	m��]d8���<_��v����i��B�<���#.�u��8��A&oh��?�����������H���e6���n�b�,��k������y����kk?Zes�����k��h-�L��T;�����+���
�n2}D?��>f�d�������
��^,aN�jp���;���������w�T���{���V���.������q.�i]Y��a��,����zAC7p�i��=`��y��'��2���I ����Sbhd�T��1��:s���W�c���I�J����7��6�G�o�!��Q;f`[A��;�N�����6�_z}��z��H�;o�J��dx�b[t��1�xi�8dh`������y+K�������(�y
24b�����mx���?:�7�cS���<��i�������� �X^��gD���������B-@lFC�A �<�I���.�
��\�r4bz�5k�s�i��l���B�����/9�������Fkv6�;�2��A�@GX������|��9� h6���\���M������"�P]@��ic�FM����|��7r���j������Y���3^��Yu��T_s��&
���R1q�S�=�����e��
���z�q��q]A'��a0�����*V��pl{�
z�������L������~�
������
���t���>��������D��^�S� ���.�YMN��5x6:�1�e���w�Tyzy�C�mEap-[��������q=.���5��\~�Y���{���g����;^��tnl����/���y��I��Q�_��cQ�tk{�9�f��N��-�@����x�E�N�N����4��#�A�P�L}p�v(w�x�O�8�(�����l�*L��-�V]G��M[�$MN�����}�>�,RHnox�����x��m�0G�'1Si���o�������3�V�.Cg���=���+f` B����	�y�������tx6��	!������o���sj�G������(�+���0~�9U�8/O0g<��,�N� s��	Z�r���Mg^P���E�<N4�*���	������]��v?�� sY�=������}�To@lo����w��[��n�iX���t37�|9����~�u0��5�~B��0��P�e<����Z��2����b������}�O��8�T���r���T�\y����uTS�(�o�x����a���^�H0�s`�f����}mz�����b���X��z�6�!y��OLzh��7��E��teD�[�]j���r�'���y�����;O� ��p�4W���'�0���������
?�K1H��@�viu������\'��c���m��mi:����c����*m��6> Rt�;�nn>�����zF��Y����#:Xe��7������U$�6y�������C�q���s{��������e�����4����c��x�a@�}���@� ���F#�����`�q���?}?�����\��0jf�o��T�+oZQ��zW]`�&������o��3K�e0���*w��jr�,���]��q��0�����~���SGw3�vy��� �n�o��3�z�	�y��Y&�x�l��zso����M�3~���P�����b��E��������a2o��7V����8L{wnu��e���,�~GOzT���K�uX���������3��<������[���$�������J�����������K�	FTO�����cL����&���x����\�@� ����1i����O�M}"n��e�7.'��F����0R��nM�(.<����d�hx����`��7B*v�h�6j��F
���i����������e�	�"0DM��h�������h�}���k:@V|������6�a\�	�Q/�M�uY���+T�QQ�*<&��t!,���<H���tw;=��&�~t�| x�C��!�����h7+H���H\�P���	������
�L�z�3z�3��vX����5�o:������������M9�����e}���PoT������;h��8��*�_��;R�{����|��8�::q^A�L�k�Fy�\�F�vms��������w?<�;�u�`?
At��c���6�:�C���/A��Q�N��q	��'�I�j���z�OFd��/�(w�F���Q�Q��5����h����(X����qxa�/�h����v2N������-��=�}q�`�t��Sq��������l�>��
��0�Z�x���9&���\��C������W�	��6�c������[�Iz��-�;�.xQA�|qma*:�q�a$,�S�]d�^�7�D�a
LO���v`d,<|1�\�����KQT�S�\~�YR��3_�"�k
� ����e�N�qf)shIs��t�`�0U�.��Z��l����������i�&��x���k}�
)��DH��b��g�}	����]��Q�Sz
�/pp�{,��Q�_��c�gj��S�s��#����\�x>T��Sxl��	��}���)_<�Q��@�g&F
�##�ax��1�����/���I����A�c��-�r��k���4�y������u~���^	����v�=C0x��]
��m��*����":��l�@5��hb���1�2=T��N���M/�7�g8�}�];A;�������]C�
����f�?x��0[;%��;�k+����O��?<���(j���7�M���7��:������`�����6k�B����v �C����h��-�^������=�k�2"�����>�/�Ni�!�~z�������3o�
�����H=�V}���p
��U�3�����A!�3�x����j�/���R���/r��^�g�HT'���@#(����CD�0��t|h-���-��'P�0�;1x�yhO����Z���:�� �>2;p���6��q��M�:l���F���T-��y����CTm��T��kM,������A�j�k�P�1{����;���j�n�k�d��g��Z:����������������y�n�A�]TT����a��t��`�O�������z��>c���h�y�E=0yD�����{<{���{��.�3�9Y�?�������Q����}����F_Z��SK}����>��x��7|�wq�o�K�6�h�Pr�P���^C��rwf�#�B��}�����>�G� ��6���M�U��#��kv�A~�D_D'������\����C���B;���A�~�w�?�N:���b���� \��D��}�on����:#���G��z���d�	����g>���6B���~Md�G����,���_^���`�����iu�o}��L�o]�eQ�9�h���fV�=���O�2�!�|��������Q���6�=D�w��6��j�#��a8/�����G=�^��F�
������C����������}�3+�O�el��'	�p=��
��t4B��^�o��"�Y���):���x�3��V�~�pv[���A�L'����!O?���Zl%��V�(��t���|EO�������?#okx� Chbx���3v�c�&���{�y��B'�Ykn'��_�6^�[&>�Qt���2?3:#o4
I]T(�������L;���e��W<�Nx��}��O
xX��6#��@���ut :����[�W�����%��@>����t�����okc�u�F:��R�I�Z�p�v�D�I���~O���>`T��E't�=6>�I�P����(��=���#��k��k^
�Q��p�6���e�:���c6�t\9�~��tz��B%F)�%�k��C�i������f>��w3����f�FC2S_��!
!��+�x�x|�����$4���/��M��<�y���m�8��������
�vm��7i�K�{1�!���AGM@ �������K^��m���y�O�wN���`�z�:@G<D��������!b;D��**���v�dyZ � \�Q�<��k�O�E^p�u����_������z�y�3�W���W=:�����ov-��+]u��LE}O=xx�����a�#��E�-���������w��t��6�=�r����G�2b�O[�{��H��`��d�n��	�~��A����,��h�T��/�Cz�0�0��7��8���]c��}q���w�(��~�&Zf=�����g\���u
oJ<��l�����K�����Y]t���m� ���~:(���
�<������g�3���d�v��8����wT�t6���N�V\�h�z�-����p��������kn�w|w�o�!D�����8F���G�`Lt0^�S"��qeE�������0�c&?�+R��S_��\��@a;h��,�*��,��ea��0��xx��T�"X��X��ox��0����N�C]���Sm�������,��C����+�w��^��L��wM�;mu���y��'��t��q�d����w]v�\y��C(��[�k�����y�����+\y=p��p����a�����T�����6-3�7�k;�5��q���� /Bl���P�o�����w�z�E9�`�?�k��k�S��e���~.8����wxyF-[��������8�w���O�B���Jl3up�\���a �#L��E�>D� q	u�����_��M���e-^�`��'�
��g<��l������b>���G�o(���.:��|�m0���i��l�r���g�2��&���b����_�k+�����'�~���3\�(� ���D��Sk����]
���O0��$�A��hmx
"�{���~�,���g���b/��10�� �a�����mo^�]��{0E�+$�nk�4�=����@��x�//`a��Qn��`�y����:�I_�{����������h�C�?�����iO�ru��~��{y%p���F����P�0�O�������L#N������
$�F]��`�Cy�\�0��w��n���Yx��3<��r�Q��\c6�U���:��Q�y�����n;�����K���.#N�A���x�l��3�ws����a:�*��Gh����1�m���N�~���uT�s�h`�\;a���D��.���C��3d��HGw?+����s���rr������x����(�#���	#Ek�A��2
?K��������0��Od�t�__��������r��0���g���0����o��	Y��`p�f�Qx��8%�Z�K�6��m��l���,t;����G�h��\h��54��"Z�uly������`��� ��h@��gr���,�\kh��adn=��j���RNp��=��MxQ�(�D������1/�����d��L��%
}���k=�����!�a�F�w����m��g������3�^6M"!�DY����>Sg����dc12���N*�=���0F��%��#5�����"����p�a�n��*�6�P��#n��������~�����g�y���x��<m��A�B2{E��������>+�pmz���v7ZE�.#}�&�3������W����l����������������(�+H�"�)����~��/doo�|�'���7��m* t�{����S3 �,��jw�w��m�0u�w)7����`Cxlt0y�e�,S'���v�E�p!�����/xC�{�����mOx��0�� �!t2���8,��e��y/��X������NW�GO�{�gl�j�+��y��g��#���}������k0lw��'�Uz]Y!
�Zv�R���Hi��������A�ko���1�Dc�o�O�����w.���;��B��' �<��!��'�3���������5����{����0�x
����^m�s����5?a���7 ��7�����i��&�ZG�����k~B��Q�>�HH�"n}������}wv�#,�����K>^�er����{m��E���k3^� b@1�bp������o��N��oL��`�a[�u��E0���l��Ha���
��+�a0�4����t
�Q�Wn��������������>��x�L��p�	��z��A�l?#Br��w\�����Gd"X���<�Db� 9�������,�����p��;5"� ������o���yk!�B�������[���T�������kA�6�����4v��|��S��=�;���z�=���R�r��7�{��/~���p�H��`�)^\�L�c�SkXg�%�,�g�m�t��s���Q�WC�ED0X�A��F&����T2��,H����q	�5p���;�'����Ds6���G�/J�ttQ�4��**6�7O���ln'#<�HLdYC�������[�_f�t�V/�
���XX�vN@8C��mx�������G������=�G�	!��#�im6	�p�i����@IDATA�����-:7
��~�y��,Fh�����ag
�c������P(^K����4��
�9:��)�C��ovW;������c�����`�&X"v{����L1�P�^s;�x�=��x<�!��&h�c��x�vB��4��0�n�
2�N�xf�l��
�B�<(l�U���^������L�����%��-���Yo����LCf�7#��w;��{�uz����9O�����
�����B]�:'���JG��s��hO �BPdl�N�I�<�Fh�bZ��^n�'��p�t����~��l�4��-��[.����x��-"c�3����G-:g�X�j�H�=7�l���������#�"���$2xA���3�@���D���-�a4������a�����l�#h9�����i::y�/D�[#�C�0���BF
hc�^��v��x�f'����#���4�c�&���"�Y�N�
��^�Q	���� �Nz��dNdy�E;��.��R�P����b"�@���:����6G����������u�c�wp�dE�B�bC���������V�P����Y�����]n�7��J����mo�yE� �K��
-HsLb.[����������z�a
��uU?���j�f��=��z�� k�OX��9���N^C=w���a>�z F���	�xg�����H9~e�N}��$�a�/� �l0��\����(VDv�����Q�E�a�ff`�4���~b�������:��;p(�n���=��>�L�^����1���M���Xx�����������'��ug+!Pb�__cA����Kms������}�=c��:������k���s!R���������!w9����Z���:u�����V�5��5xJ�:�*��]C�-�[���%�=��6�
u�7�l�����>��l��T7��jD��T~��-h����/�v��p�,c���Ny�\�$]O�0����Y��/8�7�p��u�m���b����\D^���coo=0����s�������^�7��M���������z�C����&|�w=~��G8�f��$�}����Pw�x�D"�=&A�Ze�g�!����-k�eY��a0���0��D�x���M��0ed#����\R�����<uh��A����Y�f���,a�t�]�nm��,�7��knG`X!��9��Dz���G�=,���d����M�6���Y H�v��wt���c�c�8�w{/�dwO��2M�����hZ�b
�T:�]���Md������a�),s�W�����i8GX��l�O����{2�E�y��gjXq�tB� th��vpB�8���Txk�m?�)��_��-s?�:(l��s�%�.����!��
Q,��y0P�w=��:��
���@�y�(���xa?���7^�o��=?eTW��������%�����Ux����$�es�Z����d>1�*A'���h�[xZ�1{�a�����$�_t���>*�7�=F��Oxl��� �T.��gY�:!�6�����~O�."����x����|��s�o`��0owK��2�`��S����9r�,�d�$�6W^���,{���2�m��C�{�kAb(<�p�>������m�]�2?^k]��n�O?��/41_��E����n��OXAhb�Gh�������k�����Ox�
J�<W��0'������LD�Y��O+�
��7I�-��l�n#c�M�b`Pd�xO�k��L����;z���F@k�����|���1��Ar��\��'�n���{�~?����A��u
'B����>�;��):/3��]�G�G�E�����;6rw��������	��w���k|���)�����g��7�S�����C�f�g���W�`�Zy�S!-x����k�����s2�����{����eO
�Q�9lZQ��s��e5�����D�~������n���<H����;���|�7	�dp�����):� ��~�y�Fi(A��(g��_��������la��Z�]��;�q��m'#���0�z 93�|i�.�;e�b�o��,�0�����D?UC\����:�{�����ch���=�����F�\p����~oO�a�k����C:6/Q�F�]}��&����L��=�>Z��jY"�b�{&����t�V�C�Ax�|���,��]�A5uL7��y���]�50mlw�8A8=��k�(3JDx�E��/��Xf��A�����vM��?��������h��(V���L��_��v����x6��9�5��*�"��Pt�����r���'��1��B9�U�AA�JgbS@/�D�^�A��nxc>\�Z��F�����0�Q1���x�nz��GY���u��A���<��7�!�2�0�#����!��{���sJ}��\�����
���1�r��@���n���1������vhu����/�m�������O�i�u@�h��\�������}��q�^wu�{��:������yo
��u
�o���	m����S�\|�N��B���o��K+��J���[g������72b@cXCG;�10�q��I���o�(6O�9�+��/z��"�Y�N�
2����(]�5�`�$��^�h���O?�`*�a�����?H0��_s����&�ms?Y�a�l�I���'�<
=:{r�L�p�8���W�'�x�[��D���^t
�p�G���uj�hd�r?�S?��z��p
�?f`�����K4�6}L��p�v��1�|����x�~����n6����������1���7$���[�IL����T��D�Z�x�i�\C����d.
���b�� �1��F�(��I�A1��%\S��]#�z|�X
��5�s�l��S�E�Q�W6�Q|�o�@��i�\{V�3��j���;�*��G����
&R����;B��]�^�~��7��������t`������[��W[�A��M{"�T{~�6�N�|4�g=�:�JE����\C�h�g�	:G�!�c� ��\����p�js�����(��1h�}���2�f�"���
AnTe6O��r
��s���(��-D�����/����t���4_�+�������� �1�����
��'��9��������}��e����3K}�L��6�����0�!�
��9��!�:V�G���l^�,+vadh�V��i�����r�	�|��q^K����9�A�w/���[j�3J��4�p�7R<C�������#�s9��:k�m}w�d��B8�����t;�����.x�`�!�DNd�3aXG�Y�t�C�����\"�|B�u;/��������zM^����:�����������)|�1��Y�}��}
�-U�����&��u�	��� �ng��t�Z����[H������%���^:j?��
���&��F��Q�_��cc�������:�A=��y��a�7/�����Z�q��,��+/��
��(�\�p�����Z#�1�j<����UH{,d�d�w���s�b>toD���j�7^����+�b������I����$%�����a����tx�b��dD7�(�+n�~�1�s����3����aG�����6��g������zm�8���a��\����D~�m��^>~�\A��0uB�m�)Y!1Q��Do��Sx��5T>�]y�	���������c6C���-S�,��?��������UL�?��6�!���M���ms��:PZ5�������U���y1?��UW|�m������mL��w���g�`b��]�wzk�TE�R����8x��<�A����u��kxW���=�����7�a�&�P�1?a�/��^`r~!�Y������O(E�D���}��������}�O���z7�������4f3L�5c���e~���V���Xgn�|3�]�4Dv1<�]����Z���e�j������g�;�w��M���F;�����)��Q�W6�Q|��7|qP�D�CX�nmbEJx�9}��!\0D'��/���]������N#'f�
�������������3�=g.p�4jPG����Mx^��)��mD���������y�Aw������\���w=���I7�6G:��9Q��������s�d3��{���:�	�Q�q�|p	�fp}J�����F�5����b�g_�9���i�m�'F�`>Lt�u�`[+�a�����d��1�����H��s��&�+�i����FH�i�}O�����%#��=����d=(�x%�!�v��QnX�d<����5$�U���W3�1^�\K���c���6�G�p����3�s��
�`��g����s�����w��{���:��umb%��M�~w�YO���,���������^1LG��XV��e�&i�s�wu��a�V�A]�Z�6Kd��vN�0��a��d:�R���"�������U��+����-TJ�@�2���7���|e�Z�~8���������}5���0Vf��DW]����I	:��~W�(����L�n����tZ���!l�^f�)�sR0x��%^�7��0�b�5x�5�8��S<�6o2y�lp�K�����p�U��T,�n��o�?��T=�x��%1B�B�$���O�������]n�'���S���x_{w�'��o��,r���N��k<�p�BZ���1����|�D�,���/V��$rF:���{�g�=�d?�i�"��V����f����2	����g+W�4��������8�J�%����|����QUZ��1@�.��B�|��O�Ah�x���~�cP�R��D��
+H/�o�Xo��7�X�7���BNdgW;I��6���+��qD�<����Lq��>
l�pg��a�mw�i�f�����]��{�����k^Q9�n���{.�|�6�{��	��28�1.���w��3� h��'��y��f���w�Cv�}?��m����|OE��'��	���=�����?r���s�������>8�,��t�e��6�vp�5����3������/H���Q��=&?I ���P��Q���f�V��G���vS��B����B#P��~��]X���APk�����|��"/���fkG��Q����Q	�a{��3�yH�;���Q��9s|o9��!��X15�s.e�t�3^~�|��sz���)���	�C��(��.�B!��X�a�\�x	����:�&�
s�z�b��#1�=��,��|����D�JXv���^P���<a:�y��+&#�a<�R���?���t	��0@�����}���y�v��m&lL���1�s[�f��������74Z�[�k-��+]ulv��yY�M�����yv�����%
�^h.������3�
�}������
��6+���7�ETm�P�����Y�KQ���m	A1�����QAAJ�RDAPBBP��������bc����rfvvwv��r��9���;q���g�����}+�Q���PDL���������n�w?$�s�����9��T��J�����|��-���0����o�]��O������?�]!�3Y���)����[H�_�x�[��*n���A�����i1��7u��`�w&D=��i��y&��#v�
���Ug�:��4��o{&�~2��<�|������'Y���z�,W�r���/�;�}f�@����W�m���=�����wN>�B.z����&n�=�<��s����p�������w��P���*|�,z�F2��`��r��%�'��+�NYx�I��E�����j�i����l����GS�b�����\b�5�9�f{P���{��~Y���'��f�p1����h?im�J��O9����5�����4�l?�6h\0��_�fz��7u��[$��ae�Z(s�nN+~�[��r�#��DgM'�W�Y�_����Kj��������51����|nxi���e���F7�m�`�@��CD#-�5�l%1�k<�jkT�(���[�$k����8Z��P����19��K�d��I.�o���w	��wJ����O�(����I����
�Y}X�r���.Z~h���{�����Z�s!�re��Y{'�A���� ���>�1���T����SY�������bT��vi������2���0�:�u� 
��+w�C��lRJg��/��Ir����O�
�ko���'MG`�L���d7�����"�{��n���	A"��S�0�^�&v>)�&��a��uU�U'�b���!AZ��8�o�YwyW����M���J=cKy��g��C)��R��(������N���w]�kb�:�s�m��-v]M��i��Y�����*�yn�O���8f��A�4	a���-g{E�QL���upj�(~����u������1�@��<�(�@�k�o��=�Iw�M�u�����>g�IwE��m��{F.����������Y�E���[�u<v�������T*J��RK8Y����$!w�Wg�iL[�J��U��.���g^���Xy����$z7Jd�}����=��8�������P�T?)��a�=�����UAh�[�y4�,�)E�I#�61:�F�w��mw����LfA�Q�Mb!vGt�v?�]����;t��b
��x�m�e��/8g�;;\�FXNc���fY��5h._��Y����B�������R\�����xX�|.z��r�W���`B�+���S>KR?_|z��Ww�:Y����:�������sY����6\���T�������[M7�k)m�J��~�\a�������.*���og%i~����_��x�C�0�N����|w�r��V����pK8y� U'�b�R)��|X�A��j7s_�F�\��n�@��! �]�6���;�Q��6I��x�g.�N9��`w�X���?���W	w�����9�<!^���a�S.��8�H��{��Y��*�������\.v�?r/����Xg�i�9M�c#�����5JV_e�`UZ��r
����X]������o��b;.}�,I���t�u�txm�:���i��v��|��+^��i�����A�%W���3;�Zr%6��yJ�k��&k�IW
���x�J:^��z~)=�����e���/���R��R���U�������_R��}�-���"E��5�*m��� ���+:���{vn�hi�+&_�m�4e�y7����
��'�*��������lqZo�nV�_�}Z���*���k����V)������O�fMR	���5�m�*��c7������G"�U��IV��
�f���7���������������&��V��X���Md������q��`�Jz����Vn�m6�2
�pr��I��d�\�+�{Rs-+�
[��y����6X��f�~7�p���S\�=c�$�m�}�J�7�{�r1,K��:�87p������G�$?���,/U�v��8I�e<�������p_��}�x���/%�FW�q��3�D�CO��yx�G���c��������w7��PfY%�~��	��evf�G\n$��eM������)���/�u:���0���K8�E�5���cO�9(U�&���-$�f2�;Q�}/��y��P�?i ��w�>���|�
ws��"Y>0��n.�,K�6�$���������~o��W������5='�x-��9*��O����+�]T��m!�eq���W��i�m7r���c��7^~�s��y��%�E��{0}�?��n���M7t���������<�\�����q��6"�g�?{���>8��J<��}��+Q�yZw~��X�!���pCKY{�h���2��4I~�G>%�z�5w��z6k�J��MM��p��Q��%<�MaG\�k1D���3]����*V���Q�����N>���:+;}���}X�����<Y��3r���Rv2��N� �_}��w]����-��G��z���Oi:���F�����Z"^��g~�|�n������A�/M���f
"(d�>���,����b���zX����~�	��
=�*��-��Z_�J���P���b��D�3z�w���?����=�)�t1In��~����p����3A1e���6o�r�����|�)�M����*��������K�W,n�����V
��&���^	��P /,���:��E���i,p�����R���6)<L��;=h����,��4��GmG�!���]����&�������,f��o1��O<;���������)����������W;Mj��L����%��=S�6���6?�T�4I�s�r���t���<�&�zS��%&����p����+�"��p�?�O��$�c&��/J=][�=��5V���_4��[���39�M�c�vs�M����J�{�]�p��cw���-�:<���N��R�y�5�k�O:��[,�/2okCr>�5����gX�+���{_���2}�~;�}l���d1���<�)��4��Y&��q��8��t������X��
E�_,��1'���/��G�
�����7���l��>q=�YV��Bm�J�������ns���&��9��.
��|�p��0��i �7p������SY����aH����~
��0��Opkp6��2����#� ���=������.�{0Z���*Y��N+����n�#����*�*$��P'B<8_	W�X{�|x��b^�ByR��A�kX����B�����=���C-�����Y�t1�%YH�R"��6���j\�h��K���o���k��Z6�C��;Fz�
��Y��P�K���/�H�]������i�h�����n���������� ���'w �
�B���������<�c@9���k���!�V0�������fb��qq/�������Bn�<��!��\��3*��A���vr��T��<���
=�����[�;��
���Jy��=/�4�~���
Y�u�e���?)��������3�^t!5���6o�]�h�xz������
��Jy�h�BbF��Sgb�A���]5R9�+��X��
�����������Q��4�:\k��+
��P�F}���\���S�r
�#OIm�a\�Y�j��y&���:.�j17��	Aue��N�gF�r�q&w�_�������r�*u�o�"���JI�j��.�{^��..W�F�Ku���<u�����h o���W��4� L�q�u�l�Wm������
#���6't�d��+�"��pw�f#'9~�����\iI{(���/������I�a!������������FVV�~*U(U[m����~����������>)mon��}�O_}�m0��Pj���P�x<��!]{aR���>#r
�I����@�x�A����o
�|�.��MN0��,��qA[���J�W�_M��K���x��;�����5mP��o��f|7{��I��s��S����I�������3�)��G�����V?��c�EV�M�y��~d�2�jsT���y�S�P�>~n%��J��-$���aHM���� \���%������/��9�W�A��@����>�����I�&'���l�5�eM����0b#��.�ht�6<��k��m$�:
�J=���tmw��{�O����^�*�i��~J�����+!����<�K�l��n�W3�\��|I���fU��jR���T�s��D����m��3�t=���>JO�="��fT,���u�s�����/��J��h����>�.I���;�Y��.�3�sLh�2(u�!a�|#�.u��cq������_��%��8��C�SWe�A��
%�%�F��u�Y����o�����6�bgX��w?�u������8�b��*!��VA/��)�W��OV�Bn�<���:�����q�u�����[1���n��Br���u�M���>��r?�*������N��n���R��i�K�A�sY����]�t�lT{QVvic���������y�l��x�%YFu=q���4�Jy�h���u6
3o�U'��i���4`��T��Jh��zt=}�S�w��g�B�h�Ji�����
:��`�|�|�j�k5�t��j�W����Bi�����A��z&���&��,�3!M�[n���O)����\�g}���{���6��&���������'�k��y�:��\t�{����ZX����[I�m.Cm}��.r�~���)�=k����\��'�Fb��&]}������������f���$2}���e���d��RD����

i/������}�E7f��]������WK�{��w���]����
��[_�x��e�w�	�@�����4�F����+o�/����Y����}��6��k���L� ���]�^{���2M�H��	;��I}Gr�O
����F�����/�x\k�X3��D���m����J�W�~K�=��	��Mds	�������\)��"M�����N[o�P���ft���p#+,�d`��6��Zi��J���$��^N�k������q������>����]���y��[�~�]g��?G�����lsT���{�����|O�*�]T�����
���W���uN�]�&�4����>����v�%�C��0e�o�r�����u���'PB�(�6��_�-w=bn�_��)�X��n��^��������-t-�1��|��@���,WgARY�z��O?�����:�.�����x5�E�f4��Bs�����d���!��������}+\�S� ����	�����F����r���R�Bl%���R��6�)>�SuUR����w
b�j^
5������^��Uj�!�	��m]G�p���?2
�$�Ds�����{�CO�m��g�xo�8R��W&Jw���J��������A�r�t����[����:2w���w�2��u}�����CR�A��Zu��D�P��ts��Tv��@��F�u�5��z����A��+V0����NS��Q���s��9����T������G!W�Q��,�������^����
��
�@�p�
O�����gS�:��<�7��i���~�����3VT�;��
���Jy��9/�EZ��#���=$�).���r�����s^����Y(��/��<?���h���J�U����N�1�?��om�|�����+$f(O�������
�t��d�sH^Z�v����n�O�����^��D���u���ac�Jl���s�A��-�����[��z6��S��J;����v����S;Z'�.�a�~���T���c�
�������N���
*�ej��'��,�3!M������w����6s��cr���?m�z�)���o���^�]�|�u�Ay�D#�=YY<������c\��8*fp!��a;Vt����gv�0Q���vSI�m.Cm}�i@�b1�I�>����nNd���,���H#���tj��A���p^1�.���
Z0\����.�H��/���(E�I+�j7�o�V&,���`Rb�G:I�K�^������t����Y#�j���:���\`�f�r�� �n7=��+�4�����Y��J��@u�����@h�����%Lh��?���,�Ge��*k�f��NYLd���'_�7������]�s���[��*W=�.�9�q�O�%Oi�^U>Y����x��C�-���m/�4��>����`C���i�2�Yp����K��+k�R�}�,������}������g^	�Q>k�$_L�:+O������T�_F02��`�NM8�J|��}�{����c����@���x���[�����M7X�#m�H����E��[�s��5��3z��������	�!P	 �zgE#�5BYi�	0k����vV��t��W:����
�~�n���'}�������F��Kt4b&Cl��a	���	GLJT�����R��:o�5�����(h���t��K#��v"��UJ��~����u�n�8::�g�Hn�xs�v�%�{n�t~�qt�w��v�����:������6b����.|����M7h�Z����>��X����qG�)��x����������N�[�p�l�n +���5��\�\����&��������~4�����z�kD��[}wJh��^��}�6k�����5��%�N��������R��:��X)��\eiD�
�9��x���B��O�\k��d��������N�r���%�4��������e1�x���J�}��0
���7y; �,��+TB�U]�q�����kK����~��Y/��_�����
X�sCn��y�k�������Y,�_�Z�������g�XR(���x���/Xck0J)��x~��+���S���+��Jy�:/z�i`��!r'��rx
����U$�j$��w=lb��w-��������v�n�������p�c.�*�������k��m�<
����S�����mKy�h�4b�Xv3��k�6+n�3Y��c�K{��>��y���e��V�g:#/��n7%��_�|��l��p�^�h�`w�&������"x?�m+q���v�vc�����m�F���v��rIk'*��f�&>z���@��P���v�;z����U�V}?�N]���b������FZm@}�}:����z*)z�i�B���y���a$��.
�W�w��-�6��U����lg{�`7�xq��'���7���h������%	�j���w��H�V�O;��`���>x��7��(��Zm+KB��p��A�]{w0]��J��U�J
�*��m.���>��=l����:��b�&�q������6Y�4����n�Z_�������4������=�� Qf��y<�y��3�i'4hK��aJr
{��k����0K���y��O�p6�j��!��1CN�,� �AN���r��������o
�N�^j�F:%w1��<9}���Z�s�N�E�w�����00�8��+%��\y�}�z_�).�V����_����.�����mP��r{�����`�p;�/�t9*h������k��'%�*��������/����=�w8��+���w�,�\!Q���,�t�����.t�q�r_}���X#���6�$3Xi�-1�-6�@�	�Tl]������sj�@��B�r����D��R����Q��R�M�3�������������y:�_��y��6`X���[�V)��@�K�.3�}�$�����RU�������N�;��@?����8{j �I��Z��lTg� �:��f�r��H���7�/����W�!,AZV�Y�'��R��>��$!���.,�v"�����p�YK��-�Nu�/LZ6v�M�sU�/U�\h �\lk�P��A�?5�$A��������%���M�
;��W���������iP�D���
Y����\-�m�M�1��k��Se�~�w�y\Ck����^�����y���L��?V�3��Kp����B*)I$�w�����/WX��i�F�
3��F�����+�j��s�����\�=�������J�X�v���r��/�^� ���; �Y�A�|�u����/����>���_�x����K��[��&|�#�58E.�e��N������b����bx�C�-t��^)d�&�������!_�duo>k.�FN��_�z����J<cKy�+�n���������W�����Bu,��P�3H�W*t^N�j15�������z��s��f����F���{���,Fy�wy����Z�NE����~�����Ur%
�E����#L������3&8��>L�^�vi���{|�-��:�K����0���O��Q�k�#���K�.W{E���}�UF�_mxy_	�fE9���=;�d	���/'����"t=��`aX��OZ��}�w��	7�|I�s��������t
-"�$W�����x*�LP�4y�rei�A!_u+��J -��)W�r�	+����^j^��$k�����4f��o�Hd�:,u/kpV�FQ�E�N���W>X"�%���m���y&�&��J��T������v{q�����ww>rj��5����!_�p]���I#����h����p��W�G��5�
~�	'a��@�o
���d�w���m�\y���W^x�����Cr+C��a�qw���[t���E�������m�8�������1X��o���������-��+���U��D��x}�X�����zE�v	��W\�T����$S~��n�����%�����^|h������X��U����-.��U���Z�d?U����W�����	�>%Y\'�)C�.����av��	]���M�]�T������G��j�%%
|3dV�I�^��#:W��5V����reZ�>���z�����v��\G����|�����p��;���/�;r�I0��Vr%Y�������r�>�'��,�D�u	7^��I���\I���BC���D�C��J��5�^^���&=����<��������b\W3���
��'_'�:�{����	���H�W2����R<�����C0�T������$�u�	g��}lo#�7������+��wh��=�8:�����7^�D�����gO��(����w�z���qZLC#�M]N��������>�[&:2K�jP���S�C0��D�Dr?#�lr�����C�����z��k���w	:��K��Mz!��f�AU�:��j��G�+�H��q.4zK
�m�!o��SN�f��.��S��%��:p��w5�@ub�����U#`��uv���H�uu���:{���Z.*�0�M����r5��n�������\������:��N$]���~�:�d5��BI1.�2Qe�g�'�}�g����<5(Z�l���3@d��FMw<v���K�l����\��u������6��J`���K&��k�BIe���:�z�0��(��f��N8zo����"���Z��,�W[y�`t�:��t�j$k�Rm��Fb�b�xIV�a�������0�����U'����9��[�l9���p�Y���g��;mR�={e�#�~��q=��M�%�����^���w�����J��T��W9����S�>�N�{#M�Dq���j�����^���b�Ab��������yYm����1����$	�����S�W��y� ��D�� x���}���gD�T�6o����4�D�mY������
�)����r��n�i7�:��Y����a��
m����[��E���p6���OYG��S9S��+b�o
Y��q��Tmg�su��t7����;��$�����}���M���8H���H
}k���U_�)���|�k\j�����v���<x����4��]���3A�s}��y�����H��:���\���B�1k����v��w1�j��Nf+��z�"�������{���?���ImCY���J���,�F�o^���;��!�3>g�)W��m��u���}|<���)��3[�6W}}��
|;,#W������Bh��5�Dqk�!m�whg�`�'��/m�h��&k��{n���=���0g�265��c�=(��
p}�
y��=?��|=��=K���/_����'��4�8(f��4�����Je�~��	Vg�����%�Q�OjCq���;D�}�� ��b���V��m���t��e�����omy@(t���#7�z�����z/�Kb��xH`������{���������=y&S8.���}�<]������h�{��r$���j���������I�x=Oe1�kE�z	��V��*y_%���ew6�xm��][q�O��Pbz�����B��	o[ZR�Oj�����������?����z&����������~�%��v�BF������r��\���V����	�zV��5�wk�����g��>�%
������� �#g��,�`Y���A|�����(�.a=����6�m�8tM���w������||�r���jsT����[yI�`I}����1�q�
����|����Fuxj����&�=.`}�
Y���{@�I���6��A.�7�:��x�
V����(��
��_�f��������h�� 	[�JMa�!}�u�eE��6�,&�����,�I�����p%Z�qR�a�������4B�_�4���l�>�I�K�I��]3�^Q�A�����>��7^F��B/<]�e���0i�E|������i�F�,f%��w5����V�s��kTEU�D�9���T�����A/����W�%s*Our����.�1�uTs��<��G�(�@IDAT�Ov���<{�c�=_���2N}-�L���i�{���.��G5�u/���e�����O��z&k��w�.���_�!�'=�K��
�]���5�J�r �����7_���a�\���s�4��M���~~��[�;�����1�(�5PG����P�3H��?�
��wmUz�����:}�N�������m���
��CS�u��8\��b�?m�{���r�������j�h��kl�[1U;�����>5��p��+�Y�5�����"^�|�
���F�T��+nw��i�U��$����I�qkYM����9�z��e���T���S�wY�G�u�k3���!W���������I��Fleo�d����k�_�Kt��{j������/������U���_��%����W������3O����:o��g����\��G�uJ�jRm�SlP�'s���{I} ��7h�j�n��I�N��E�[F�rK�{0��U��$��*����X�Kz���u��#���w���3�_��vo�����K�Q��,v�{Q���S�B}����1�$�{Z��<�|c��|���m7
&�9H��U�}%�R���I���
�=��+G���m����T�/�;1_����>���R}:G�/M�,,K���K���k����s�v_j��[��s��{[���vL�m ���Yz�'��KX���{h!��5�����B��hsT��������M�mr}���)�?@�K��.t��]z	wF��A6y�����/�&�8��V����_�!0�	 �V��|�����.��x��w��wv@� @ ?�|n�-Y@���=L�K�$k�sF]�/bH  ��5o�b%���mh8�/ �������M����|V��`��	 �f3������1p�������������1C� @HA7$�@���P!��~�y�X�������(&�~�o]cW���w<�7H��Y������`��>�b��9p���9D��n��&�(�#-��:D&\nA������!!@� T�n�@�@�F�\uy��{�H/���x���e�@� P3K/��u>n��+���'?�Y,T���w��A�Qn���P9����n���=v�����A�S���e@���@� @ =����	�9��|�[��#�7d�{��"��� @�f��b@�Y�����o��s�V�S?��~��W�����m��:
!��x�#7���jV����nN�<���F>�5n��]w������^� @� �� �������!0l`'�l�%2;�����A�2�L@� P>���;��]�Z?}��On�W3����Z.��k�d����>�����f'W�$@�4��q+j�&��Qgwu�~��}P�w3g�Z��d� @� P
�gC�/�5|z�����o}X���@E�|�5������k.��/���g�w�eQe�� ����w��:��m-��b��J?��q�m�}����?���HA7�rdi������������(�2 @� T��b�������x��u�Td
@�������Vh��~�
�����Q @�4Z��[�i#���j���M�'#6�!�q$
@�-D��� @� @� @��D�J��
 @� @� @� �Bpb= @� @� @� �*@��hv@� @� @� @��B�X@� @� @� @�Jp���@� @� @� @(D�!�C� @� @� @��*�f7� @� @� @�
@�-D��� @� @� @��D�J��
 @� @� @� �Bpb= @� @� @� �*@��hv@� @� @� @��B�X@� @� @� @�Jp���@� @� @� @(D�!�C� @� @� @��*�f7� @� @� @�
@�-D��� @� @� @��D�J��
 @� @� @� �Bpb= @� @� @� �*@��hv@� @� @� @��B�X@� @� @� @�Jp���@� @� @� @(D�!�C� @� @� @��*�f7� @� @� @�
@�-D��� @� @� @��D�J��
 @� @� @� �Bpb= @� @� @� �*@��hv@� @� @� @��B�X@� @� @� @�Jp���@� @� @� @(D�!�C� @� @� @��*�f7� @� @� @�
@�-D��� @� @� @��D�J��
 @� @� @� �Bpb= @� @� @� �*@��hv@� @� @� @��B�X@� @� @� @�Jp���@� @� @� @(D�!�C� @� @� @��*�f7� @� @� @�
@�-D��� @� @� @��D�J��
 @� @� @� �Bpb= @� @� @� �*@��hv@� @� @� @��B�X@� @� @� @�Jp���@� @� @� @(D�����LwNy2��� @� @� @��L���� @� @� @� �\p����!@� @� @� T�n���;@� @� @� @� ��"�r@� @� @� @U&��[e��� @� @� @�@.����� @� @� @�@�	 �V8�� @� @� @� ��n.2,� @� @� @� Pe�U�� @� @� @� �"�����!@� @� @� T�n���;@� @� @� @� ��"�r@� @� @� @U&��k��1���� @� @� @� P,�gOi_�&u.?��2�:w�Ra@� @� @� �� ���I�����9b� @� @� @u�@�^���"���sV��"��C� @� @� @��p+�����[��	5� @� @� @�@H7$��E����h� @� @� @�N@������J#����-!@� @� @� Pi��&\��G��e'��@� @� @� @�#�����/L"���2�@� @� @� PW	 ���3Wb�pK�f� @� @� @��*@�M�@��Mg��@� @� @� @ J7����!���S�B� @� @� �a�u���Ru�R��
 @� @� @��C�:�k�^pk���"� @� @� @�"�����~/@������� @� @� @��p���+���E#c@� @� @� T�n�P��!����@- @� @� @� �D7�J=^��[�O.�@� @� @� P�	 ���SX� ����� @� @� @�&�j���B��'�*@� @� @� @ �`��b��zf9.@� @� @� ��@�>��"��Xd� @� @� @�@�	 �V�����>�� @� @� @�	 ��fS/� �����AA� @� @� ����D�=����@� @� @� ��@��>�9�G�9���C� @� @� @ /��x��J��wN9"@� @� @� ��C����TG���
� @� @� @� 0G ���sn��s�={� @� @� @�@!���������r8� @� @� @��n�:���0#r@� @� @� @`N@��S���~p�xv@� @� @� �pS@�OYp����X @� @� @����vFn@�� @� @� @��$��;���]#��	��� @�������+�q������J��' PG4[���p�U�j+-����S�����O?�>�k���n%{����L��g_��>���[.����������*k��.l���u��]�M�l�������C� @ C7���1����8�% @�@�$�t<v_7����l���y��~�����4l����8���������;c��9*�6Xh��G;��i��^<������3u��������kP��?�������qM��o�e�,��w<�-0���;a����ko��o�A� @�lp���n��[7��� @�>��2��e2;������F�?��[c�`�U�YcEwR�}��t�����&?���Z��d-w��;Gv7��[�s/�YV�L��M�����y��;���G�sW�493_�'����t��3U������ �R�,d� @"��[!���X��zf� @���&p���]�F
"���w�����"��� P
��_���th���_�{�/d-4����k���2k�g_|��b\��~�M���<�����a5��=��=�F�.�O:�
<w����N�>���2�"�<����ud!3� @��
@��E"����C� @��$�9I�}C���0������u�\��d�>���;����o�����B������b�i��f:�������KY���������������}����i�^��`n���A{i���0�boI��D������A� �����Y�����$ @�<�&!��X�b�*��{~�>������v�u�����3����3/�YV��\���~t�
_�X���~n��V��n@� @ �n"������[�� @�fpk���!����m���l)���r�>�TdYMfr	�*��[p�?�LQ�/�f	wz�#�A�M��B@� d@��BR� ������A� �N�tvl	T���p�i��y�8�k�IG���!n��m�V9�D,,� @�@�,$�{n�>� @�	�������E�zz������wy�[L�����[
���O%��j����+���u,7�r��t����R� �Z�c�Ru	9�iW��������>V)���&�v77W���[�u���r�?���w��t�t�b�V��V��:gK @��*��z�J�7n���� ��/���]vI�d���R�[���/�~��7}��n����������{Y������v�~���+����.��n��7��>���i���sk����}������m�q��v��
�Y�}������r���a�~�y�v����x��\���\��8Y{)&���Ow�Ly�}������L,�t+��Fk�6K.�i��k�p!���?������6�=��k��w?N,*���h������:k��Z4k�Zp���gV�>�,pg��&�.�qn��:n���v��E���/���J+��U��������~f�K���d�wB��*m�v�l��k�D�T�����_~u�Z�������x�5��A*A\�f�M�v
,��:��Iq:7�p
��2�\�������8~��G�x)���M��z��n�
�t+����I#;��?��������=�������r����[u�e��[�����x����������?�Vw�f���������	7���K��W��,��f�6q[�5��	g��&��?�kg����������_w;�|��u��`�U�:o�CZ�{H�Pu]>���n���
�%]l�7s���r�����u/��n�]�F
�����K����z>|�������w���A�|=+T�fv-.���g��u=+�|�����_����i������������~��a�����S]t=�W������)�>�O~Y�3E�����}��s��ku�<c���7_���c'���[����{������l2��s����^q����<��rK���y��w>��}4����O>���b�v=?KI�\(��_~���p����}R(��r�[n�%rfK+���Z�WF�z���v�[�p�.�8X��I�=���g�R�n{����+�fV������{���������H���B��V[��w����?���Xy�������~���~_��7y @���9�~��w�`g�� @�����^����X��~�2��.�{��L�]w���������&:)K��d�&�Q}V��]#5�$������
��Zm���~���}k��t>k��&\�J$����@��'\��es���=�c�E�(���&p�fe��<[hV�\����t�6m����M��p�m9���
r��8)�T��&6��ot �����jV��=�����"��.��kw�n&�"+c3�&]}g D�Vef�������cI��
m�o3��g�Sf����'�i�E������V2Q�P���?�-w?��������k.7l�I��	sa���;���y5�M�tZ�@W^��6���p6�[��*RXft�y�.�*�|�{`%F�=�����D�����-h:
��L��p��yz���r������	���_�G=:��~z���U7��u������9o�B����D�	���������`>~-Kd{�-n�U�w��'s_�7�}z�5w�'m@���g��Zp�x���S6P����5����2"^�7L�6��L
J8x��2�i'���w��+�f����!�|��$�r�c��
v��K����?�83r��p�}-�unj�I����Y�e�(�Wu���{�wd��p�b��v�aS7�	�����8�5p!_R];�5�,��G��e�+t]�p�����|e�� @�n@����l�D�-J
� @���,{:��[�,O�IW�t��2��z�E7�_}�`��l���o��"��_v�{��7�e�����Y"�������v��-�&������+�Zznd�����t��e����I7GD�$W�G��k`"n���Y��861V���@����_v�[��"y8������5t�v�v���/���r�J�������%�?M��Mw>bVi��Vf��&�t�#�%k���v;���+�\y�������B�7A7�V0K��N>,��%�45���2��b���z�/
, �:���2�����*��d5zR�}��b����s���'G6�P�A���Rdy���&PJ4��gRJ���_"���%����h�W0
��5��b}�%���?1�J�����p�/��A�]�$�u4!��X��$�_}S�s��!�N��P��O�~�9����W60��b�ovL�z�g�C
[Z����������-7^+�U�z�9.38'���	��z��4+�0I����?p#��[�kQ�����'l��aX���:'�7Y�����1��H�m�#�t������wZ��^p������|��g�������f������b�o�4 @�@�#��[��IEk��[Q�@��N��Y���f<�cX�t�Rn>�"�{�"r�����f�B���E���<g��e��b���!���n��m(F��nCs��'Y����DF����1��Msky�W���_��5_.QS�?�j2��������o���$��n~��0�e��,S�M�r�����/
�����w�,�\��c��I������^��5+-��L\;����$+��g|g��sZ�Z���^��<�^����$���(S�&\��K����=	p���x�H"��
�us.��G�4���"vX�D�SO<(�
~_�#7|�lK��J����=��,�%&�������3����]�����5*/�������Xx�����n����,gt������3$������s/I�|M��U7�KS��J��E�o-Y�Y!��$��{[������B�na�������T�������_���)������w�;�,1gy-�;�&��Us��������f|QJJp�w�;�o�����������������M���K����:x��n��E=��v;�-e.�����=��.�5"W��?od�\������4:�EV�r���y��{hi��]6f�����>�����wd?��n�'q
�g���0	��3��w�{�\�� @��6��}���=n���� �Z��k��GF�'�U.?}����:��5�a�#�On"�����D6Y�}�������$��{�������$',��7�w��+c	��d!y�>;X|�6a����.qr��'�����,k>�B7�����|��������
����0��P@��b��t���"����h$w���E�b��}�b���~	���f�����bf�IB���t�9��;�(1�C����mJr?|�g\�����t��Wb���X�G��ke�f~��b��?�����<_��0���[������������r�,��:���/��X�zO�����o	s�^��Z�I�=`�m�zmE�z������VR�-�}�u eZp��%^�����?d_�;o�A�"V�����5���J���[�]��e����[���u)�oqp�Y����P����/���������q7q^���{����n��7w���{�j��j���i�������V6��\���Wy#x��L���u����6�N��<m�7�,��mh1+����m ��6�#������/kVY���m�\�m��j�bd������u~$tKL/5%	�rw��n[E�K(�*������u�b��f�b�D�4���w5�����e�n��]g���,~�����Xu9w�~;f��m����p��f�.4/~�b-������<p��;W�h?�r��n����,�}�w�3/�9��v�zC���+�e�W�I�2�1@� P' ����T�J"���%%A� �9M@��&"���QW�w��e��/����6^o5����>lBO(j]��^��8��NJg�:�����a��,1CK2-�%������qh�����Z7��;2q(��0w�~�Py�����i��83-w��6����O%��)IX���np/'�����Y}�����f�3����r	�D��dr��s#Ak����5Kl�n�I�����
�m?�����f����x��������z%�����dV��V���8lRd���M��f�6h��~��
����"�.�������UR���7�hM�8�>��
��:�}�.Gd�_��YV���E����c����/��D��#�s����[p?w�Q����������D�!�\��N�g�k�Q�m���{[�^�!��=����o1��
���s��l���O���Z��	m��)�.q��&���m���v?�+�p�X�C�
WV�a��=�,����1G���0����
���%��;����>��=������3���N=:�:��$��Sw+,�Z��$�U�����p��~'X&��`�����93�B��y����f]��:+�o�7 @�@�'��[��aQG��[.2C� �ZM@����}�����H����jV�~YuB�v�x'��xJ�$�\~]��f��3���-F���-o�\�'	I����>�R���e����5�$�<m�|/����R/����������Wf\��s	�I���6������1CG�`�JL0��[m�e]�����j����N���~zQD��J��0��ac�n��3w��[U?�cV�7��n�$�H~�mis�*��0U����UX�r�����%�_^.�S?O�� �}���U��� ��=�!��f�����Z�k?%�'I��'6�B�V�8��������b������=P�����|�3��S���~��{����wW�x��-3��V.v{�[�$W���-ks?�rC�T��*��xv0�� �$���9AI�n�1��F��E�����-��
T��\�+^�?��/�L����n�6�,�L\����!�:D�%�~��6���3;��\��&D\��:�K�e��`y� A� �w ��w�up������B� P�	�]n)��s4F���!}�	d���VN�4���
����O�D�SO<�� ��d��w��������$��*3,L��%T�)U�����O:�6�������$ae��[��/�������aV��F�'��MJ%��j���f��+%�m����*�f�tv��bf1�'�+���/3�����	3���r�$����$w���t���n����$��H�9���n������������:L�p�}_�u.�o�*Oe��x�{��Q���j}w���E6K���d���~�I���|��cxdY���	�z>%�$�I���J8l������8c\�s(��a���$X���M���i?�����7��2�rW=|`��7f��x���T��,�J�|�	w�/���}����������f���L�4n�\�k1)�{��5�i�Y'�O_������>x��}t������K#��_{��c������I���-��d�|��'r���2
@� P�	 ���sX� ����� @�V��\��q�ki�]�I�����������M
��	��^"��~'Fb��"�[�1��x����3z��d&%����$���buf6�M���J���{G���9�n1*�X,P?�������_V�t��r����_F]��e&������4�>?%	�IV��6�N�^y�7b�u��Y����0����������T�gM�w�-�<��$a�G�N<���L�M�c���V��n����w6����\c�]�VR�-�}�x�%,��\c�?�[���L8��$����� ow0�O�-��Bi�Y��������y���{����3a����W\~)��\�i�g�Zvh����~@�vn�V-�E���������{;n��o��N���v��_�����R������%d���<?[0�T��,���������#������Q���L��Mw=����aJ#�V�ZLzF����V�:��n���gh����gi����M7\���x�II^,�~�{�b��e��o,�4	� @��@���4�!����J@� P�HL����c�@$�J�{���\��xGZ�--���n��Qw����,��KT��~�LL�
Jq.��6���od��O�g���$�1���x����VTN�8��W�����K�FNrU�}���yk�����x�T�G�x�]v�=�f���
r�G-��n�����t�J���f�>g])+iFqO8r�����\IW;.�}9�g���>Q���\��E��������Z()�s|�H|�C��~��+qJ��s����D�$��{s���xbU���+W�b���c�+&���:}�]��0��^f��?aJ#�V�Z��n���bX��w����g_|3�,iF����^,�*.�&�=�A����mv�����6X�`	_|��{������~\���� @�V@�������C�-?SJ� @s���x��N��7Z��=������nu����_��i<�0���-.�\�a&u���,x�4n�(�g������/0���T���2���������>t���V�%�j���RD�{���<���yr�cLr�}�-S�����l[)wI��h����~n��;�PZ{���S�}#�|��ZQiW�(�}��j���^���~t�
�8��������	3'��[g���5�/6�Wf[hGVz3�)<����k��>�O>��tRL���R�����B
L��$y3H�z��o^s���n��=�h��@���bR��\�������n���g������Fzg�$�;t���;����v���6n��Qw�I�S�f
�y��������� @�n@����l�D�-J
� @���B.��UG�*+.�1���R>�5)��%��m1	_v�����e��"��9�R���/"���bEm�F��`�U�Js�p7����g���
[xF6�fj����N�y����V(u4t]C�$���f
S��$7�?[l�N��w��w;��|��c�������4��a*E�]n�V�O�#�"��B.z��\�Ud�%�4n���f��-�Y�fe�fq���x�~��L����s�������}F:�uS��>n�pu^�blx������������1���$�s���<$Y���N?���|n��qw�]�p{��i$�EW��7�,�3���?����;g��Zr������������"��
���os�|��(�;���4 @�@�&��[��O�k��[v�@�j-�Ei�V]i���D^v���z���?�qx��e��U�f��^��l\��U��aO2Ak�0�{�����1�d����8�>���v��\���x�����\��S��Smp�D��?���;���q�Y)W��px��u����cx�����))�d<�����v�O><�$�}��O�9#��,�g61W���KR?%�I����R�����Y&7����ee?x���>-�� ���o���m��v�=�����"��3�?�;���b�F��:$b	\����[��e�,��v=2r����O<�j8��M#�V�ZLr����7�X��-��������4�(v�+�,���rv.�6��[l�VV<��&?�n������B @���9�~��w�`g�� @`��w�����m���&��ilRJ��|����f���\���X��R�"��I#�*���A'g	����=��k���wymp��o^w���nA�����}��h�6�	e��\lQw�i�#�����������n����U���������8�/F��3s�k�s�vp�6m�/vC,��5LM��|�
&��,{�/�Lu�.6�a���R\��b�+�R��\��������]��f������?�^g�s� �+%]3��z�=�T�����,����F��=����{�}�
�F�\)��p_I.�?��k��~�	����2�i�J^�+�3��=��$W���Y�[B�0�>e1��).�&��O����k�q��n���-�k�\u�@� ��@���:�5n�;eT� �$0��p�i����3����������1��K���=��K��-����_�������37�������bE�F�U�{P.��������0�����J+�q��0���E,k���c;��]�f���{��7���o�,�g�;bO��������L���n?%�q�b�6l��5�d���	"'�Y�d�:���]��������������&*�8&r�d�*_?����Ot_~�-l��x�z���$���Ux���nm.�7��~�5������a>Y������7�����/�$��^j14}"��0��s���o�4�l3�����O�YV�����,��cV�'��'���/��<��3�����C��B�P���16��`�W%������������`��,�W��r������Y���u�
���Zp���C�wtM�����L��`,Z�F^n��n�������=����O7��~���]N8��n���$������@� PG	 ���Wj�pK%�v� @��]��bVD���v �&�R��u�D@s!��C'��5�_���� ���`��FN�����{�V��q�QZw��v�N<8k�7���PUG�Z��u����[��i��x��=�����U���R�n�K����]�����$>*����s-���\�VR��H>����Q���Z;���fg���=���������&"�j�Mw>������5������"���t�}1D1#�����Gm�$�V���T����&B
�w�[p���*�����I������������n��7v��@]=����;�u�<;m��;���I���F��G��p�:Z������K��u�8�������b�}�d�U����^�F�jF�������#��,Y����Vn���
���y=H+�V�ZLr��g������z�c�a	��@���b����2%����U��$��� ��Fks���~��u�;:s���A�^@IDAT�
�s�����e�G<��k�g���I��:y&����n�E7��2@� P� ���V��"��� �C� ��A��B��gw���,P?�r��a����)�f�����$��b2��eXh���������p�'�Ow��ef�$��XG��p��Hsq�����'YcN�������Ld���[���F����Gn����{k���J�����6!.�f������/�t;�-�7uK�Z,�8��+�i%\�{���u]:�GR��un>3�}ax�2�1~~��X���w�K���\�����I�����jL~��
���7��c�3��j�W��K��h�U�qG����,�>�A��[-���E#����n�f�{Y'��t�kk1��i�w?�i���g[��8�Y����7f	��S�����,���i��VX&�}i����4w�/�����_���T��[��
����yP���"&���?�t}�������W	��u��H|�L���2;����f-O�{����73~p��J�E�Eq8^��������0���}�w�m�^�(��z~c��g|��k���y�h��{<:���kn��&?�ld{f @����w�jTc��cc@� P��aVYU�M��q�u�
0�^#�����������y������$W���]w��\�c�����Ko����6��=;�e��4�Y_�r���U����%"v�;�}����<��C���m�����W�p�{��"��\?l�`/���i�\s��N9�������U�y&���_YYJ���@G
>%R�,^O�32�,��c���^;o�����8d��n�7�%n#w��O?���-M������!����N�l^��*��f���4�$��\��^'�W��U�]��4��W{��]n)0rh��B�DCki��C/6�+�+��N�fu��~���r������5�
~5���^�E�i���h���k.��B���I����.H���<�~e��[�������\��Z���C����>LI�\��R���_��Itn�F�
�����'_v��U�*m��d9���o��I�hO���E�&�U�g��z�c��_fY�&��o�������p}�8�i�" @���%��;g��?{�&E�E��H�A��Q$HPA1#*&�9�Y�  IAAQA�F(� bA����{�4�P��3;�;�;����tW�����{���7��N7����$@$@$@$7����=8���`�8r�x�Yc�F��?I���R���(�>�%9��3J-������I�h��k�l���d�QW�9r�u�Io4Q]��4��_� �F-1��kE�`��������o\u�K��X�k��K�{�f���o��Q�G
�R;�:��w�Y� 
a:�W��\p���k��w����r\�Y��Xo
�~����v��9Q����k�������*tE�����Q�9������8���z�>��'��>S�x��+~�$u��G_��Jv<W�fr����3���N�{��?9n�m���X��}I=u�I��g����a�������C�T��[���/a2�����}��Q���}�/��W]�R�?�Q�t[��p�q@'����.�&��d�\�^}����w�x�H~D�8;=��	�Z���r-#f8�V�f���{�#����,�@������A.��v��pe�^-4�M�~��_��Ko~�i|��z�{���q/��W��H��k�Bz��KS~�]�x�s��e��u�����I=�b�#vn�����g��
>p��jCz���v�����;M0a&��	��E<�0�	�	�	�	$?
����Q��nT��3	�	�	�	$��eK���5�d~q>Q	XT�*��v� t��V��m|�k�7����ue�X��?�<tW[9�ry��o�S��0�tE�3��
qT-Y���5�_x��U4V��N������e�� n�����%���Y��`�J���ZK�R�B��c���f��O��+,c�:1!!h������H������
��p��"}/SVR��)��Ol�����7�(7,���8�t��
���,Pk]�H��"w����I�h^L]��B���������-��H���am�~B�	�:��"�����3t�m��=v��gK�:'H��naB���#DE��,��,S6���������z�����i������'}����vD~�C\i���O>AO�������j��PB�9��;��Z���
�b�~0�[���qP[�h��po=T��p�=��;��n�\p?�P�EX�cr�7U������H<��g�������N��B4E�i�)�c���^�yN�N��W������LY�u
,��`c9�Jy�=~���1T��q����s3�p^����W��#F~*~��3snL���+�m+L�y^Em�����^���	BpO����AB.�o��������0Qd�Z8����&��@���2�2�(�[�W��_�6�GCn�}m�f���|���7��k���x��&���2	�	�	�	�@r�����RR��J��H�H�H���Q�-�����;�w��p�-0,�x��k�B0������;I=����eO�Aoi|T���&��0,2��5����X��W�1�2�)]Bc�nrY�ft}�6���C�.P�1O��d��g5AX��PBc!"��6n7ilD�|����(���D9�XX��x���vvD���C�b�N����j�����B]<��9�B����3�5�w��cP�H�+�9>U�>�����C��r�9�R�����^��{�p!������Y-�9��x��i?�{+���8��M�2�D�W�h�}�:���}�Qc�ft�D�lC���D�n��W��<L�8Z'�D�LD[��p�R%��[�#d��W�>���a����cq.���F�&Kx���cy/��?��G���Z�W�>��	����������������Q�{I�l�~�������5�j��S�5��n��[������	�	�	�	$
���q/
��#�H�H�H�H e�=+,����Zx}��;��$@�����'7���x
e�����H      �D%@7Q[&N���'�<-	�	�	�	�X�=���U�����p�b2�����@�	��n�5�l��T-p�������yyB      ��$@71�%n���7�<1	�	�	�	���.j.���U��5����S]y\!       �-
�����g����M��	�	�	�@�@l�g<$��
�e������0��+|���\        �
�����Q�M��c�I�H�H�H �*�_�����R!���'M��>�����H�H�H�H�H�H�H�H N(��	l���n���E$@$@$@�E�;�:��S����O���{�&V!Y       HApS�Q�U�n8:�F$@$@$@$`(wL)9�d1���2�C��F�e       �
�qC��'������R�	�	�	�	�	�	�	�	�	�	�	�	�P�M���n�58�K$@$@$@$@$@$@$@$@$@$@$@$�T(�&Use��p���g             �x��/�	z^
�	�0,	�	�	�	�	�	�	�	�	�	�	�	�	(
�ivP�M�guI�H�H�H�H�H�H�H�H�H�H�H�����j���n��$@$@$@$@$@$@$@$@$@$@$@$@$/p�E6A�K7A��"            %@7�n
�i���.	�	�	�	�	�	�	�	�	�	�	�	�@R���T����R��:C��H�H�H�H�H�H�H�H�H�H�H�H��E�n��&�y)�&h��X$@$@$@$@$@$@$@$@$@$@$@$@$�(���m@7���%           H*p����^X
�Yg�3�	�	�	�	�	�	�	�	�	�	�	�	�@�P���=/�m�H�H�H�H�H�H�H�H�H�H�H�H���4�
(��Y���$@$@$@$@$@$@$@$@$@$@$@$@IE�nR5W�K7�y            �
��"�������
�b�	�	�	�	�	�	�	�	�	�	�	�	�����f��4kpV�H�H�H�H�H�H�H�H�H�H�H�H �P�M���za)�f�!�@$@$@$@$@$@$@$@$@$@$@$@$@�"@7^d��p�aX,            Pp��6���f
���	�	�	�	�	�	�	�	�	�	�	�	$
�I�\Y/,��3�H�H�H�H�H�H�H�H�H�H�H�H�H ^(���l���n�6�E$@$@$@$@$@$@$@$@$@$@$@$@J�n��p���Y]            ��"@7��+�����u�<	�	�	�	�	�	�	�	�	�	�	�	�	���x�M��R�M��a�H�H�H�H�H�H�H�H�H�H�H�H�H@	P�M���n�58�K$@$@$@$@$@$@$@$@$@$@$@$�T(�&Use��p���g             �x��/�	z^
�	�0,	�	�	�	�	�	�	�	�	�	�	�	�	(
�iv����4�1�K$@$@$@$@$@$@$@$@$@$@$@$@�G`��w���Q�����������8���4-��Scw
�����	�	�	�	�	�	�	�	�	�	�	�	�@j�������p���8��  H�x�SM����Y�T#��7�Z��! �����f�-$@$�����b��N$@� `�G
����������"�	�	��������<)	��������I�H ���O��`aH�H ����1/@$��L�H7I0�bS����' H���n��(��N���Sk��$@�N�}~���?	�@�`��n-���	DJ���p#%���Q�M�d�I�H�2M�|�P��4BH9F��o����I�H �	���v�� 	�	�(��9��'H`��������Q��%M��H�H ���
���j,+	"���w	�	�������)	�	��}�$@$�O���p���\.��kRV�H�H B���n���	$>�	�,
	�	����8��I�H ���O�aqH����)�&L��� p���g' H\���n��KF���
E��$@$�z���^��F$@$���pt��H �	���n��p���YM  � ���nf�@�����M��	�@����J��H������h&�H ���n���KR��	��&	�	�@"0=p�5X�����xqo Hf�����Xv ������� �� `�G
����B7M��$ "`>z(��a	$<>�	�D, 	�	��������H�H )��O�fb!I�r���)�����$����k�	�	$��C7Z�e �������& �d&�>?�[�e' ��	����� H�����M�4igV�H�H ��������9$����&z�|$@$;��c��g" �d �~?Z�e$�	�����s�������$	�	�@B0=p�9X�����p���E
�*������d��=!Y���GN�v�,]�Z6m�r�t�p�G�	U+���{.�T���Z�\i)R����{�<x0����^�xQ�X���[�T������v��Tk���'�~+��q�j�3G���_	����F���H�^��l��MV��.1!ds���g3p^�H i���n�4Y�
J7k�x4	�	�@�0=p��
Y��%��7}�>���<�i{iK9��#d�������/����f�*r���I�<G��Od��?��	?��Nyr%}��)%Kq��d���������)�4O�_#'�p�Ca��]���#d��]1���UsiuN�|����!/�'������D����������[�~����Y�b����_)���o��}��!�s��>w�<���)�e��<%	�@J0�#��h��+A7cF��H�H 5	��
�����Uj�������
��N����-]�Z�~#�n�����^����[:�zN��?�K��f�j�-��rU���?������K�����K�opU��O��/��������*�~�^)��@>�3c�|y��O2{��8�}~J4c�+i���8���"}g���	{�x#\��T��uW�}�e�w�W^����O�g}I�"%`�G
��K��(�&y��$@$@�&`>z(�f!$�#����<yrK�Z��t��*���5�6�X�$��������]�:p����a�++���'E�t�w���l�����N+������'��_�hV�����u����w�}���e��	�����9����~���5k7I�/���m�}��/_�h����j�:���E�]`���i�i���8�!�wf2�/+e��7��7^,���pq���2��?]y���~?�Z��%����)�FJ,������
���	�	d�������i�<�r���C�{<|�T:�l��)�,[XO��h��=8������u�J��-�P��o������QU����9.��#F��}!�~+��q2�C4��d�_V��o
����~��sI�H���p��^���&
�j�	�	0=p��0���_�
j����m���^-G���u�$\�f0:���I�O"B��nY)s��
����>��v���uN@BH������-�y��i���8� �wf2�/+e��7\��a�����$@$`�G
�ir/P�M��f5I�H����
�Ah�A	O������Fk���n4����=\��Y�TI�����~(�����C\�^��o���d�����,7�~+��q2b�������J����@��u���sa.	�	���n��p���YM  � ���nf�@���K7��4���x�fT�D����e�g��-�P��o=����P���?Ln��V��cS�d�����L��EY���@������sa.	�	���n��p���YM  � ���nf�@�H����#����f)^�Gy�<]Y�(�d�������FG38�z�����B�'R!$���Q�B�=�q�?�9,��\?\�sr[���`������Q����rhi�~ES�X����i���86��o(v��3�����^f��x#�J�
�������I�~?��z$@�I���pS�}�jE7	3H�H���������&
�j��d~��\A�U� ���c�_�������e����n�&���<�m��A���-+��W�B�K�E�n�j�v��q����0��q�V�<k�m���7����D��R�xQ)����}��]�I���$��Y ��s����
��$���������~�o��l��en���_��+ZX���'��l�U�7��i���V��h���ri�kJ�Sj9���|{���M��>e�lY���78[\������?�)�����#[�nw����\�i�Z�����qB%9�EC)[��L�������w?�L�m/m)5�W�?�-�/&M�m�.YM��)%-�8�i�>�V���������wD�H������K��'������w�?�r���7�'��=����L���C�V*/�<U��-�e�K>��N08��[F���N?E�W��V�^/_|���Q^��{=#��^�������Y-R�����_��-���U2i�/�b�Zo�\�Yp���^=�B)��r����U�3[������\�K��d���/uj�����v��<���
�j�Y�Q��O�������O�*�U<F��>�Q�z_p5�[��p���3�p�r���%���_B�Y���rr�����D��R�hGL^����+�]'_}7]���-RH.<���\��,�>��/��y�.T@N;�����z�.U\
������j��h�J����}������~+^�1��V�4�S�V������}������]�;������7S~q���D�1�L�Sh���������/�����p�D�"r���}RY��r���O�
�>���.��t�!��w�_��;�������O���J�:�l��%�����s�X����F0e������Y��RO�O�M��+�s���t�j����3�J��d����5�J$�\L�H7��-�����it<�H�H �	��
�I��,~ZH����W]�R������.!�"A���������G���=��]��6���z*W6�n�u��h�������z�^���I=�x�����3������_X�b ���S���C�\�N�e�P���u��R�H�P��N����#��4��k?����\yX	581��=�	&�	 r�x�SG0����@��^�A��m��/0�a[��T��:#��w>�*���Q�r�
m�w��p�N�y�#:\z��a�;�����
�v�����������k���C�y��=k!��
�8����x>��fG�2�=��G���*�^w�y��d��Bt��^d��^'�\s�9>���x����f\�w�x��S��A���S�M	���}>�b�M����`Z$,g�����/��V6n���{�m�M�	{��6�S1��SM*U���R���
���5�!��~�����N<	��*�`r�NP1i��������1�KH���&��J����[�9"S�}��~+^����^�r���bh�b\�nwJ��z��]�� ��l�w�j9^���o��XMa0���w>7����]q�3��l���E����j������:>��kz����\/x�x�s/�LVC~<��x|#�u�V��%��g7���m*�u"H��p�
y��1:���$�p�&��d���%�C$�ZL�H7��5dm(��D�
$@$@)N�|�P�M��f�R�@2>���[��V*U�@p�x��Y}���s�:��������u����d��1��}���g���,����k[��a)	��M���C�������h����u@�H;�wy��=��0{�����K����ZK�B��S�.Cx�(��Z��T����p��f��������J�<�e�
��MZ�f��x|�Y��Va�y�T�Z!�����+��ns,_�����y�k){���|#��/T��z����6����|4�;W�H�j����4c�<e{�K6;�r��~/9�������)�~���7�{}�Z�>��h��k�)v�7Z��Z����m�D9s�	�����f=~���W��W_vvD��iX������>�����g�G��!��sv�=����s���$�s`��?�Z�qg��>�g;��71A`�3��g3�~s������p���N�9��o��9���
RU�h�d&��y~���~�E�]�^��A��2��������Ovr��h�{�	[�!z/���xo��u��A�>�����;u�=Ln����8���X~E'���\����'�f�!Px]�F�����-��H���|����|����~����47^�H �	���n���SC
�i���&	�	�@��C7
3H �	$���������+����JuY	��h������-Y%�?3�q��U�a�qRk��/����u���5���$O�`�
�b����;v:,�xh�W�f��o0\�(��CY�`�.}^pY
�caQ�X���}qn��������
(�
�N��{��wq�(+*?.���z��\C#A|xR-p���2O��p�����������������L�Z>��@�o��M���Q?~~n4���u�i�P��iiV�_�P~o�$W������e��L/W�@����*;����]���^	����������gg9�~u^�v�N�8$J��5m�N����~.��6`���U�v\�}��d_/'����?��c����aC?�D]&�|���`�MO��?w�b�{����7��a���x�u�����O4��9���C�Js��}�#�_�u�$j��O�d������~	��z�/����{g��9�Z��^������W�@,��PA�X�B��={�I���k{�we8�Z��;3��V����r��v_�����~b&��#��_��y��R�����^��x��h�����y���N�O{7g��0	�N��/\��?GX�J�yXA��	&�}|�.'c���,Y. ��"`�G
����!kC7$n  Hq���n�74�������|l9��������.�M�@l��@X�!F�����}#5��I�sVP�G����b����a7��t`��F4��P���4�",�f�6�$D\B���������l�%�7��`�9,G��d�,^��3�����K��������!��k~������_�~_k\Q���hV�xi���1��=7��F�5����g��6���]�;P��vB<[X��J�t�K�F����.j����!�b*v��� ��:r�����jX�T�X�7\u~�I� ��U������KV8�D���L��{����4�
����+�i���������"pq�`B���q�?�b��?p�Y���;n���Wi��u�N����y�-���������ixBdA_�J�&U�RQ�;����U��s�}Y'm��H��Gm~�u��}��q���X��C�j�zi�$���kT�������f��}=�!��������[I�U�;�����
l	p�;
}1b�"�Z�����������������������������>H'�����<4����\eW�����Q�N�L��;3'\���2��:_�}��k�nt\��o����Q��|#�zE*���g��,�w0�N�xslP�����]����Q����-�MY�7��D����	�.�?R�M�6v���WH�H���������F�����d{~�7�����e��g�V����m�X�|+R��|;�W'��wG��<��:"�8�s@n���Z����������u� 7�O
{�����j0���G�[o�����\b����J����w�Y�������Q��KB�A�
�(�����.�n�4hS�A�9�w��`���`��B���W����5�A���m\���<�%��6f�r�
���5]{��{�so��W�Z�x�5N�N�A��H.b�����!c����y��X�N=������s���YjU6���]�#pk�PIn���Ys�0��� �
�q���}���4��I��d����+p�B�g�T�\i�����W�7��x���g���
��	�/���33f�sMr�k��=����Cs�s�����s��HM~+�/�q�o�������b������B��b�_�n�/Tw�v�9{����;������ 	�[�n�E��wfN	���:b��/�Rv��Y�F0��T���Ln0	5�?��,]��d�~���,9W'��d�	��D�M�~?���|$@�A���pS�=3��q  �%`>z(��h�Z)M ��_X�z�eC
rG�p�p#=?���F�?A&O��:��`8{zUV�g�:@W��?��C���}1�q;a����7���e�@'v�F��$^-���M���M�Q>�br ����R�����0u�y������Ui[U�����$p���Z�D�g�=d
�������F����]$���A�jg*����������|$��X��~�t��; ���+��K�&
k���CcT�V��P	��v�����s�P�'R~���`7t��.����[1�����9���}'��w���w� �;dtO����1��������Y�~,����
8"�������z���������+,f��X����?3)���p������^b��r�����D���������~F�&�`���2zO����r2��9���'H�����-p���YM  � ���nf�@�H�������}�������.���X��v�Y�@#���9U��c�mg�nw���E�G�}���_��z�5��~�J�cJI����YW���s����!�5;���8���vV��)u���7�-d�p_S��GZ�%��k���vZ�n�:*���'m�nPs���(�Ecz����jac'?�d{{�e�������]��[�]y~+}����ZK�6%�����7d����e�����]���aqzw�M*�;le������������
��j<�����uZ������:Mc+_qN`�
1�"R���`���k]n����������5���K~hF��������.q,v�T<k�����������w��������������.&/;�-�wV,�����-g7�0V�+�w>���9���Z��8�R b/����
.�����o�-���a�"�-+�z��P�H\�O���4J� |��,;+h�����lW��)�)��d��2�	$?�?R�M����p#���H�H�R����������*�<�d{~��W�[}�`�
o����[@����
���p�'p����Z�����;wn)T(�4kT�U��}�]����`�/����o|�:�^��������4&�.y��PWD^��v�?d���.���
�������S4nF1q^?�e���N�]q��yZ};K�OR!|�w�<���MB�?X��q���H~��<���C���5����� �B�n���������~�u�g������m�A�����"p��������Q���F8������n����=o��	��l}>����E���%��-;��k<�
�������#����J����7�y���<yr;�>�_k�T5�]�Y'����Q0?��������w>�J��af����o���b�/VX�%����"�1�E���+PW������V��_��+o}�'��x�3sB��D�z��85Z�{=����W]�R�m����Y���k���-g��y3������~"rd�H�R���)��^�����/f�	�	���C7
�UL9���b���Z�������.\�NQ7��� ����O�hRW�?�w��<��H���rd����;�	J^7��������;�C�����+�=��+?���j��t�GHC��S��N%�u\y����;wK'E������X���
WH3f����-�v�����������q����:X|�g�8\� �K�HSv����;�s�
kvX�����g-l���w��{�v�!9�o2����;	���5��s���C���5�W����:���b!����������A���D"�fG��'���9F�ol{��g�����
]'��T����������	7�6����c����d-lW8����|#���9�k���9G�W&H����&-N7M��$ "`>z(��a	$<�d}~!&]z�r��u�2�R�B��n_5V �xSf�jU*:����73)�\��X�V��D�K�����@�F����g"k�K��Ovt������[��~B�{c&���q��6{�uY�r�������\���{��/j!�}�+;�5k7I�n7��BZ+�u�=y���q�kT�(������[6ve<�{�b��@^$���w��.�P���_��'FzD����}>���]M ���5��}�n��xuq���83�Q�r��W�'�5>9������nv�[~n,�c��������z�cRP�>���bE9�r!��4k�B��Gf5S��~g���;b�X���g�<��^��7B$����xv@��'�[�]���%1W���OL�,	�@�0�#�Ti��A7@�L$@$���G��mbV,�	$��[ >'�\U����J���P)��nf\X����+����_�>�OY�117o��V��d���R�X�``��p�u�3��g$1P!�����)��'_S�t�}x���%$��B���cJ���Z�����6�|�d�{���Q��b��a��V;v�4��7\,���
���6��K�s�d�����g7��.jng9B6m;E"f�����y��zF��`�����w���?��5�:C���{��hbD�Uw�M�*����s_�\uO\Ej��*��������O=��(Y��_;�Y���{�����K�r�k���2u�������~
 *�?p@���� W��)�fG��'���96�o�����.l���{�e�O����u�LK�$�X�3��;33.�����L��Y���i��{��Q<X��Jv����F�D���������Y�y��q2_CoD��n���������%{���Y ��%`�G
�)����Q�u��	�	�@�0=p���Y��!�j�/\���^����|\YWC<�?���A�<;eF�m������Xv��p��� ��7U�RA���+;����\+���*b�"�[������K4��W?�Ys�r�]�k��KW	b����n�T��������:1���:��������V���]9�jE��n���j�����v��r�o�������Y��:��������r�|2~��X{��k��l��vV�����!�b0?��'�=���{��H\��O-�0)��}���=�9��M�#�M�>���WN��>�qgoO�Xe�:������D?�X���{�o���z����I'����nv�[~n,�c����r���p}�����A:i��R!�����x�3!���Tg�-�E���0(d�S�x��}P�b-�f���o�H�F�k8��
���loO��T����=X~ ��!`�G
���&q-	�����I�H�������������@����������tp�N��-dK��:0*]��l�{J;�sM�'�e���gu2w�?������Z��|�F����_2�����kg�s�eR���,���t��7�<�R�|i���V���.[�F-����a�%�)o��N�a�Ev�u��%+�,����0v�,>���r���]�O�>G^�sW�Y��h��w��Gl�+�XZ�f�������[l;m�Q��rp?>���.W��
���v���V�7�(��1	WR��������R�@�@�,\�B>�V`�,<p�U���Yu~��$�9�${�_,��k7J�#��\�~"Xv
���o�	��|�m�~�	xX���S��L��j��w��=�V��\E����a�W�Y�2���K7���X~#D"��}?�_�L���k���r��2����	��?R����Q
����W$ H���nb�KA�H���R�2�,��%Xe=����]��E;���x��r�hVA�0���W_(������������s�v�uM���I�2%]��[��P����-z�@7 r�%��1�z�]�b�h\����@`�K�U<F}�f� ��<3R/Y�w��w�W��5����
O�8Z��?l�v�6���a��??�k=��`2.�'�_��t}�m�
�������-��������X��y��.�_�WW�B���D�H��g�"�\v�}:�^�7�
����7L*T0�
�%�l��1��l}>��B�f��y<x�U�ke�#\�v���Y
�^s�9����u,��k]\���=�����8��|��PB[v
�(W��-_W����f��+:�lg-/�w�����E��w�����'Trw��d����<���`��M�p��^��7��;�;����r91��!0��6&d�l�V�U��~��C�o"�&c��H�X ��%`�G
������Q�u��
	�	�@0=p���Y��!�l�o��������={e�o1�[|@IDATd���|��#t���k.�f���e{�u��,$>��A)T���x��?��	Sd��P�{%DG�?4	��^~����c���-����_�������Z���z��4���R�P~;�Y�J��#v��-��d����k����]s�=q]�)CuK���'��,_-��js��uj/��qE�q���H^9�0�Z��x���Z��'�b0{�A����D\ ��������G��o���������z#���7MB,����H�{,����g��gvu~#p19��N @�b'P�|4Q6m�fg;�jU�������	��v� �W����[���{��5���J���.kk��u]��^���c�d�����~X7g��@��s����w����1��N>������u�����C����3��x�[�\@��s���+p�����s/(���;��������A����u���}���[�������K7��u�'V�����#��������-�?�$?����v�!j������-%KuBqt�����tPe$[��@�X �'`�G
�)���zp
	��	�	���C7�Z��M����Y��
���,+W����rp���+��:�i�c?>�}���d���V�6�n��c �Uk���.t�.���%�I��wY��^�Q��a�l��C�qZ���rB��T��c����E��m�,K��v�@�:�h��b!DT��v�F���s==�����FX�V:�)]�x��-�q���������]�x�5��8�m�X��&���������q�!.�V����[����/&c���Q~.6�Y�!��W���:!�B�2A�
��I�p��^�Yab��Q7�e4F-��|*�zS�x��
�8�
��\�?Y��_��s��N�eX�y-���R:�[��\O�>�,e��; �,[%6n����J	W*[��;qO_�������g������<���^&�WL�3oQ`R�-�j�i�� �����18���
k�K���-��������X>�����V����w�#7�V�Z�1�_u�Y&/���~gV�wE/��f��G���E��	$oa�������R��y��^�/1�F�T�E!��M�u���i�����K����^�C^x_�\�N��)������	��?R�M�&
_
���p+	�	�@�0=pS��Y��%�l�/���/�/��k%�"�!���X�d�-��9�������/>�4��HD5o��~�]�P+C;�������|;K��q��=��g�@�x��v����%���3+��;M.� �rC�}����L��=�+��;2��>���A_Q������w�3p�ga��?�G�}���_�x��"����z��������`IZ��<��x�Y�]�<�{�b�4WVB���Q3 pN�{��3�6��y���]yp�)������u�=y�T�	U+��/r% �����m�/hy�\vQ���
������T�t��M	��l}~�:'J�[�d�#\k?���u�I�i�P�^����Z�-xK��]�{�u,�\;������"o|����;��}�_����s�D {?{���"���N�o���r���o���b�UF3`e]�j��M^W�A;d2#��L	������t�P�I$��~��y#e���z,���3bfw������4��]{�K�d��������$[����X ��!`�G
�i��p���YM  � ���nf�@�H���D�"�X������{��Tt���=�|���^W��A��5�q�i��7���!
'N~��t���oe`��.��-D\;����qg�H\��6�K����4�e��������*���}!S��^��~1Gq����E+���-iT���~}��"�^�&|�������v�SF�����~n�]2����;��k
�`=���w�@���p�}���KZ�j���{����-�v�jEh[o
�� w�p�9�g;W�D=|"M����,d���c��p��5�u��E[�z��~_-������:�z�*�;����L������g_�S�?�ub��l}>���
"���}8.�.��_����j?����i*���q���<t��b`S����;�t&2=���t�����M�	9�W��t��y�%�G�8<9�x|D`�o�O��5�U�r�o�x�[�~������I����&XM>��`���w&�
��n��P5�>o��������MW��C�����W��}��ut��`���0��=�n!�z����sn�+�����?��x�*g��*���,)��da�r�	$7�?R�M�v���p#F�4;bfc�Bd��C:x�`���%H|Gy� &�)u�;�g�����U����G����?_^��%���t� d��4�^��l��MV����L��x�Z�-_�6�J$'�����@���'�w�r�%�U�r�����n�8�C�jP��)\�����w*RA���{��>�L��g�^_j�X�q�	�V$|.Z�R��N�5���P!�J��.���W��23 �4U��W�>S
���_��	b�����nR�I�Z:��K�r.U�jt���w_��m�p�
��Bu.UP��<�����\�d��v����S�P��#lA`��0��	��P���n�a%�V�%������A�E<�jU��S`��/���q��=g�u?��p{���Kq�~��/��������U7�V�e��:Yn�w���l�F�YXM�����������|����j��k�j.\"���w���F����b������h�'b``�������91�!��R9��{�qQ.�������Oq���[s�Fy�����og��W�����2Sg���G���eJ�1��1���8?�����6�2�eO��d�����C_�I6�/�M�?/�'��vs<&c����TVy��x2x����&��c����.'�D"�k��jq?��o��.����~;l��S>��3!�\��b�&��.U\�����g�5�}�gc"	D�:V1}����;�|�5����������[�QW���No�7�y&��X/���i����<����V��{b���d�NA��0�U�E,V���D����C�pL��0�~���$M����o�������k��i�e�����QV�k�dK���'�D./&d�]�@'7���(��&2#�-=	���n��?��j����?��s*�?�0��pL$@$~�1�z��Ocq�l?����
�'�`}��.w�n�
�G���������U�nw����q��Eqb��i���5��8q���8s�\9}��*���|[@��e��X���B=���[v���`��'�7��,�b!���� FB�(�����H�d���6q��V���
����%��/��F�M3�e�n�����J����Q
�D�R%�9�"�Np�W�3� ^6oZ�uh(7���b���Br>�O6i<N�,�&�GH�����M�6pkO�����`�����}�A����:���#��X��p�����+��tb
�~��g������_�Zg<�pW�g�7���o�dM����=���zd(���uc�x�[�m�r�`[��E��m��uF)������$��~}'�<<�~+�����sd��}��O�M;5^{4)��V��cS7x�x���]���N��4�TgG��;�.7��c��A��w
�q��~A�����R	��w����g�X����by���!+uF_�����<�S��s��)U��dn��,;&� \�Y�������������w�.�H������E������U�z���7�.����:V�L��	�@&	@�����t��Oe�Gx&/���G�W�}�����IU�����;�)
w���-\5������r� �+�^{�4mT+p4��N��O�IL�����.X�7����kC&Vr���f�<������-�m����]���+�fk�x1HP���aX��������������p�	�	�!��i��V������&-]�Z�~��~���n�B\ �"`�G
�	�(�,
�x���s����o���v���R;��H�bA�����+Vw�F��z��gC�V88���������Y�2y�(7����;��}�Es�=��+e���;o��qajgG���?V���A���N;=1t��o+���[�w����,���n#�D p��g��F��|��w2A]?3�	�`������O�a�4�<,`M��4����
sN��@�H�~��^h�#ZO4���v}�������,'�]�������$�ML�H7����e(��t���pc��g"�'!���.2����������e���u�k�B�>��L��g�������n��!��X������n�x���V�rC=����6�ip-��|�����Qxy��+�����I Q��O��`9�%��UsiuN�a�5��W?���B$�&���~��o�J�re���2R�Y��
!M�(��i���0�#���%��p�����S��ip�H ^_��*il�<������������K�B��A<�F�sr�|�P��\+��x�-1\/�C��n�x���V�rC=����6�i��h$W]z���}9U>?��� 7��n\K���L��@�C��Qj�������Q+����@*���-������zPE�
��[�+$D���p���f��iW
�����		�
�>���:{��wb��2sh�|�P��\��x�-1\/�C��n�x���V�rC=����6�im/m)��h(<Yt����;v��@$L�}~0�$>x-�dV�&����|4�;��_ �R���]��<p������s�1����+p���Y�H	���n���|?
�I��V�)�Z0�H$�(�&D3��9)�y+E�K��z��(������\�G�wF�c��r�����W�[$_�<NQ^�s�:}NN��'��'�>?���A��+��3O��l]��:y��Qt���I�&���>\����)�3a	�L�H������.���M������*�1.,�/�y�p�7q�bF����o�x�Qv�=�OA"?W����0�������
�D*���n��g������3{���������� ���h�h�����2S?�G\�S2���(�qbeY�~��Z�>�����@�`��c�y�(wL)9�d1���2��go��S�@�H�~?3�;�p3��\�wRf����/7��B�/����>�^���?L�H7'[!��Hn���N�����e�b�����`�c����~�f
���|?u�l��-�P�f���S�H.���������i*\�����)Y��������7?���;I�7�'��)?�&�L�8u��U�Y��r|��R�ha��o��^�Ac6��i���V��d�@�|rj��R������x���a�6l���6��V8e�s����[��*��a--k)Z���?p@�����G���s�K�����;MjV�,�^.���,7mu)�l�lV_�b'��9Z�����?��/�����������R�J�c)R�����_6m�&�6l����/�/�?n*[V�;���w����~W^�s���{��E��?���������2��?|�*bT��O���O��jA����;�l����/i{O���,[�&�)�y����s�J��'������g(�e��~G��%��D��!�����������������=�->$��y�r����s�s�n�}0�y��?d��"�0��~
i�g_]_���1�~���R�F'K�SjJ�2%����e��
����;����\��^�8A?U��rN�v��#�����X-�&N��ub"��T�7�����:}��I�\uE�U�V5g�dg�����5�~���\�$�>�u�����4��}�Ot]�C��e����<�
�i�����.�����$��[����[��y��{��=�=�w=V��QG��:5��2���J/,��j��5����|�����X�w����9���?�{i��+���T���e��UN^�n<�^�A�\p��R�������l��h�S>��X����RQ��#����������������/�����z�6����1%n���������v��z��0�����=`���ZR������p�Vk_���������V���f�z~C��|  ��%�>?y��%' ��H�~�������U�[W(#�u�������o@�7a�a2��1�Z:����0fR�v5{��}o�Ys�0�����e��a�%�
��A
g����M�(^T
���n��L���s�l��~�A^v����\�M��K+��GWN6����~�r������wk�����c~�u�|��?�����k6�����6m��&��5�����Op���gl��N�E�4s�3����IB'��mi$eK����y��W?�f�?�o9�yC�|�e���>��g6�q���W����i�u�'"���x|E/��U��C�q���k�j��~�d_*h9+c.���A�0���p#���%����E���8x��b����c�u��/��\8l��}��/������Zg�W
l}��12S
��b}��{/�z>��#^q��r�Y�Mv�/�:A��O�O������!r@=O�q(\����/s�����;��
%���`�\�:���e4vB��^u�Y�A7
m����{c&�q��N�\+��rgv��u���9�PB�����H;�c�n!����Sd���R�������d�[L��t�����w�t������lq�|���^�p{ty��w���|�/�����j?���L��^�~/�&�m����7_jV�������Xhv��se���X��v�N�ZQn��")U���co�.oT��
uY��~ �J������~�!��{��>�{���/�<������V���~f}���������!&�$ZY�q��O}+ �b�����p���z�>5�]��pi���bG���U����9��0�{;\���A��G����;.V�J)�����&:i������R�<L|��(L�	�]"����B!V���=�]��1b���(&WE*�����0~�W��|�Zy����w>�;�������);����*��|�Jx���d�������q~��M������M�x�!���=������$@$@�O�}~��!k@$@�H�~<�_y^���;w�qD?�`�J�m��hG�H�`�rF��R���&+�/b�b���C��W���]:F��(o�	c�_xOV�Z+/
�lor����0����P�f��]�:���c�e.8��1*V�
L^��Vj������0`x�����Y����:�l�����DO
Wv�(�x�v����t�Pv:R����}��M�����r��w��V�@��~0x	dZ�DS��c$�����Z����T�}���B�(�s��T�����Q����n���u�Dp[������Vh���Un���8��mZ5�Vj�h'#.�y~�U*��G����;��}��/����r��b���}�X5oZ���<����L�P;@$�(�h:���_~~.bja��z#Ii���t���=�����z�Fy��w��7��q!���_�x;?6< �cV�]7]�z�9�f��5�����3v�j������f�%���������!��*A������EU�}��w���Z�P�\���8���}�z������e��E���~q������r���\�0����M�0����ZD����h�����SM����������Tj7����������#*AH�(��w��a��M	U�h����\E�z�L�����������d�WFz�h�3=Yp1{�>fF���\h�J�zV0;����
��R!����k�����=�K���J����U�����Y��l�A�#�:J'�=a����E��G��:�&�����(foE'�`RVv���Nt,o#��"����b������������iB.`�c��O�����B=����N$@$����'�$@$
�D��1.x�����M�DS�8R1>w���
8�9�=a������g�=�Er.�3u0"���>&;�Ua�r�
G4������P^FE*�f��]�_?�����������]y~+y��� E�
���k6J��G�����p�@+������v�/�����+^1��� ���:F���y���m�3#���p�W������!/��x�����1�$R�z����)�FK.I��i�����6�:�%�2#|pO�Alo2�����
��p�.����4�s�k�!/����?N��(�Ap�j'X������]����1e����L[�����!2�v�oo�=�U��K$���
�+b�+�-k�e���e�����Nv�N�Y�:�:b&����
�2�7��������w����=R@��Y=����>��,\|����d�+YXl������{��j���`4f���<�no��Z�>���.r�����^�(�x����r9��<���Nuiw�a�d�������p�i[����2J�u5�M~���<X�����_���� �B�m�k���
>4�)��w2#-q��[/�����\���:,�1!�s�3����
��8~��]^����,�WC	�p��o����|V.P���/jn-�;��o��ys�+�bA�LFy��K��
��
��IG=��qI���;�p1���[��N~n<�^?p)�~	�z�W����6��w���>����������]L
�&�:�9V�D|L@�D$oZ�|�,�x��/[��
j���5�� �*��_���N$@$����'�$@$
�D��/>�4�?o��/��1k��xg'����u���7�1�K���/���4X-*16���vp�gN��pC�pv���7���0H��x���<�u�-��)��;�������Z�F#�f��]X�>��������s\�����"�������������2����+�����<~���/��(~�K~B��~�������1{?,c����7��@'�E,���B�������	��o�!�1����$J�z���u�?R���$<&���}_m�Oj^np�8c�<�kIX�!�i]���s�r���0'V��|�uwv�H]��QQq@��]"�
~#p?��l���?�����u\���#\ �}�).�������m�+���g����iN��Kc�~4�;�m�����3�0��XyA�z��g������pY�R�yA�&��;e�D�����+�
�Ha����ejM�Yc���k�U��
�������?f��M�TW�~��~�*�>4L� �~����o�c��)��K�KZ
��>w��������f0 N�E�x��P�G\B�����"4f?A���m�x��z����j9��<\r��A/s|����!����� ������\����.L��`"�l�)���nEW�%]��p����4���3��V�5p�W&�oo��� W�x����{<OvJ��x\���W�����fNB�~��Q�*��=�j`�8,����
�Wi\�s5.�+��>N���M�V�������z�������}�X>+��{��W:����%+�m����Ql�����[iv����P^-�>�C�w8O$�B�/���^�}}?�Z��y��Z��$�s�DpMY��X�1��/\�}�z�S�Q�,�7�?;���	}�h��z'�a�?8�{�l��,��������������?��}3��	A�h���^'���8�&!��)��k��e  �� �>?5���  �H	$J����g<����|/�1�	%e���&�b���yC<��:�$x���
��QO�v��*\���3�`�z�D�m!m��[-|�;�/���1)�$��$mC#l������x��>t����������mv�U��_su
�qo� ��)����w/��ti}^3s*��l�yi�+�^��>��].1br�>������ZF���z����c������p1��\�Rr��z�5�?a6�����z_�r(�Iz���������'}��(����7�`�g4��mj���_�?R��*�$9>�\�(�Z�A�*�+���~��MB`p���&���K���J����W$%�,\����v�+��\�>��������A����!}�s
fb�?���l��?l�EV�Z'8b�lx���f���]�Z�4��S�_Y���V��T�~A?\0��/��
,��zb��`�T����������K����@so�;�W&��%Xw�����X�����f�Np%��������������	C���v����m��>Wvyc��{"�;����fb��
B{G���u:������7��4m�\y����]��P.M0Q1���m'��z�4{�{�A�E���O�5u�����`V�_|�����+/�V�GOf\�7��>�E�G8��bH=����o��gH��8L4���~}��6���=��&�������2�~�>���?p`����sQ�Xa�?N���������\�{�]� Q\�q7��e*eg����D?���]�����w?����m�!�{���Mnu��1n���Lr&XY�E�*�t���p��N�  ��!�>?e��! ��$J���~��=�}:>t8.Ti�������0����X�CG���&
k���U2W��<x����.o�B�vNum�x;���qU�]9��]����7�/S~�-(���p��z7F+�f��]Ln�V�^/_��FK~	>���N�&����Z��`��{[ �{�`,r�����p���0.����F\v�1�]=��	
��gH�7��}#sI�6EYc�L�H74��9-��YL�k�/}-�h�3��q��o;mo��-��Vo�^_v	s~�	{�+�U�����~�^�p�
�v���!�t��9�@�}L����qw;�s���~e�^Qy� ����Z�v�C�4nP���	��`�>�C%?���_�V�&���5�K�^��%�9���3��������l����3��k!L
����X�\c�^p��#i�����{{�x
��|��2gu9��<>L�1U�#��Xs�\��5�W��W��c�f"�B����Q������X��/���u}P=��<`?�,�^�tp����7�=�H~M�[@�u�_�������b_'�e�w:�r���xF:�z>�n�>��"�V�T^'��`N��B���������N~� �'����}�
y����������&�!��wF�]&?7a��6�Y���=�?��x�����\�Tg��bf��|�%��1�sb�h�N����e  �� �>?5���  �H	$J�_�����7����O��M~�����\����H�~4y�l�i�9���]+�q���(p4n���������f#/�/���F@(3��6n�7x�;����f
�X�D;�1�z��/\y��3����;�=��f�O�[�����������9`����$�1��nS�.YY6�#��PL�csZ��5\�j���!��>�@���Z��1���2+G�Y1��M~�r��5N����kW?�1{
|�����pq`'�}��a���8p��S(��>��3SVX�>��
�WV\�O���E?w�v�{�� W��{��v��#>�3k�{�Pv��H��a/?��ui�����������s�7��mZ��I(�d{'X�=��=v�c
zo�!��x
��|�\���J,�y�����M��fb&���>��,U��&�=O^�f_��}�����e�y�n��B9����,�2�B������,�*�����{��m��S���oh���c�cC,���"\	��~�jz������sV\�ws$�.\T?�����%���m��Bf2#+m��NSLt:��z�W���;=�r��=P�|i���V���&6�v��������
�^r\' ��#�>?���5" �p�����H6����<2P\LB;��p$��G��Bv�~F3�p�v���*b�BH�S$c&���.w�|�}XTn���E�)LD����}q����A�vW�0���"����.���k�5l��u�xC/;e��{�m�K���J=4$B��J~z��:�6r��f�%��4T���7�#�h�%��9-�6kT[nQ��~i��-�@c������a���y�����v)_�T O�8Z��_��o��:�I��#���~�O8�e�|�u��94��o �!*���C7�r
�F�>�o93e���g��h�����	����EN��	�=��k3fB�������������N^�	~�����z�Y���r��wh,�����O���4�Vc[���m��~�L����p�����(�n�u}Zv��������*P�,���G<�{o��U*�������7F0�^c+"��I~�bs����f���W����mw{�����������|��{7I������S����6�����^6�~���-��antmZ�r�<�n�M���b�i�">b���0���G����Gc��}s����Uy	�����7W�~�Yp���G��z=��� �~��7q��e$+�l#���?�3J�>�w���<���{,�{���[�d���t�}����Vt=c3g/pB6��-p
��	�@�`���m���	���D��o��"i��VP1Zm��5����2�����-6�_F���9�	-���3����[
��
��4�g����v���*�� ��)�1?��h,pc������I*��������p���p��7 X�
�:�g�@�|�����@<n����������-�z���1��%L�8����]�c�~�����T��*���?R��"�d9<�\p���}���[�f������"�_j��Mwx,�+w����������D�n���}���pN��q���~����TS��~�YV\���Y��-SR�v��Ut���K,���B�+��}h�Y�I�p[;{����})�u��j!]�uh��D�n���#~x�O���s�����{(��|�f�uV���� >���\y~+~����p�|���> 3����{��O�������Q?�(eI��O��_
.�j_�w���p�H��N�\{���X>+�u�B����YG'@x]����-g��������	L��(y�@��� �F2���["��_h�O���������]����w���3k�-p�J���	�@�`���m��	�����x)j�{��H�]z�"U�*�4i��(6�XQ�*�� EDQ@���4�
H�������wO K2����o���}���d2��d&����7����/[\{�]������	H���j�.	��$�!p�4wN�J�c}�����
���i���=�yuK�:����!p���Nd�����iS���Bw��Zc�$}W��~��5�������Z���2<+�|�~M��Z,�����J\�Cx�,�c�L���a�UC����k.�������|r|��/hEA�p p#�!�6PLr����Z�a�[V����4�
�&h8A���k�b������@H_	\�[��UU>�d�����S8�u�5M���=?s3���d6��i��W�&��������P�&s#����n�G^��/LE�����h���'�BM��:�z�d�c��{����PU���15�q��.�)�.u�fw��O��^���W�f�.��hi��i'���W���C]Y����PX�x�ym��~H�� pM���qSi�z����&���@�����}4���y����	����>��m�^��jZ��e~�1G���~���u��h���X,Q�����km�,���p���V�j�_���:��p��&��������U���k{J�k��f��;�V�F��5�7�����IA����� !&�N�������g�������u=�h�<��+-�s���5Y��*�C����8���;#n�/�2�����So����-��/r�?\\���*��ef��y����}��c@�7�8�&�p��J��O�����7v�Q�k/7��M�e!��`��&#��8�� ��d�MT:=e=|�
e]q�@\�9i�T�
���d�����J������1�|@��
x�`�D�.�@\���"V�i�W=�p��+g����y��|b�eBb1I6mRJ1m���XC�[J �S���hC��%�	\�)��~E��,$�%�o�l����������s����|VPh�G�Q���]�#p��
Z�r�0�t��Y�����t��Mz�im*S���?.\�v����V)%2�OGJ�:��<
\_���|L�e��`��i�l�~����3�����p�\��CK�9��t-��}���-r��r���'?Z,�+����E z�c~���m�E�"`0!��>���y��C���C�ea�\E�Rm��3���������-`hy:��=����[��C���e�z�:����iRQ�7������:�S�R������s[U���
��.����~liQ�������fm�$�Q����[���������s���NI���%������&^$Xn>�j�fmtUL��[�������A;�=^���x\��c������T"_��N2$�G��p�-�0��ub8�N�`�?o�l����0��)����g��<Mh��A�4)]�0�C�;,T���|��m��n!q|%p�thE���.9u�"�5o���k$�uE!pqvMa��*��s��9�����A��'��c�V�'�?,�#~����~�6_KsF�U
�9Un�)������F�'�{��GL�m;��,n��kG����c"p��	��r�H<���i����Q�;x\�U���j �S��� \�������Z�y��{���9���������+�[Se�l���/�K;n����a���u�w���#�F��q���I7�col>&���|�V�#>|�
n��W-���f,2�}�8�[��|
��K�����g�X";�Gn���[,�@��q?y��T(Na��l	v�� s����6o��A��p�R���j�����GN����L2����2*~���5�k8�'�
���4H�!p���+�gr�wO���^U�M��t"��sPK�7r�]w������O����W�z-�q���^�
Z�������w��������s�H����O�W���SX>U�u�:�p��j���=.��?h���>�R�����~C�'�3��"�:U
����U���yPfTL������j�3~[J��,V��p $��n������i��������h�DBYW\����	e�g"���R���$p\�A�W(]�TZ��>�D�w�w����5MvN�>'vXa��IL��m�~&g'i��y�1���n���I�h���Z>�z��v|����,��>W�����y�����f�\��:r������D��/j8b���
#uW��#��,..�H��
�'���`OgpN\���8	 _�;�Nj�X��+�l!XR�&�	�/�.!�I	��b��/rS����5ij�m�[}��������8n"pC9�����q�G�\GH\M��}��	��e��k�����\�F]z}�q����U�F,��E *�c~Tt�m�E�"`��H�aM�i�q��e4m�"���sfu���&m���h������oM��n��U�~�Jj3�
R�����=T����7����&�P"�����������������R����)�h������A��)}:]��D�B�
���@������$-�l��T�B	-���kr{7}�����u/�#��?�R\|��/��%�-��ZQ�'�	\���&����bD/A���9{A,�������q@v���0�
�/\�����!q|%p�e�@}��v�����\���L�29-�����K�@J(��k��}!pM�sN�<+���T�I�V(I����:y�kQ��v���������1}����T`���%;n�lZ�b�VN���vi�X����`&^i��`6�)&7T�������}�?��U*_\��7�����S�2E�s���D_~;SK�y
'�4��b�t���T��D��'7}����i���9�s�X�������z���W���Y1M��?M�����(>������ p��f{�Ue���4s�RK���@IDAT5���-����Ln(��@?&������S���?W!�&0����eN��?���9�����=��/�0]?�j���<��;����.��(O���|�X,��G���������E��e���\�������c�
yG�����k5���0Ck:���zM($@����Kj�+l�������w
�����T�k��>
��2��L�:�y����*��8/X��j^}�=P4���:�&�w���`Z���UE-�����&�|����L�_����6�;�	����1� G{vy�Myf���M`����b��/U|]P����>�V��������y���-Y�u|��@�V������7|�:>��'I��uh��B�����e�M&p�,s���/P�����8���X���yP�O�.hb����)6w��Z�Z9�V���\�|�^����B]�@	\h�����L���Q_Ou��K�����b�
�e;�������Hyj����I����n9��A��{�a'�*����q��(W������E��W3���@3�U��T��T��I����2����`��yrf�w^F��.K���7���C�`S����P�ry�<�'�N�&�<O�D��S]�=�������b���,�h�O���@\����x_D���;�>�f�f-y�]��3
)�cc�)�v��Y��|$�����=���Z�Us	�t����)t����uL�������N������w����=3Nu����c"��u��;���_����:�mB�D����7��������>�V$`���{�q��)�l��YL
;}��D.��IX��}�{��Q����������2��E�"`�D
v�����
�X,>!.���>�����{h�_h������!��?@��|LK�D$~��5���D������X[��Ln���>����ua����I-vE�kI�X\�������}��Ak6��,.��	����#�%p��������5���Q���^�X���b����HF6����N����'h����H��?�k������&�PB�?XU`�����P����m,�1�8����T�j�M������C�����#�+{���5�dI�%�9�@�\�.v;!���S=�M���%p�E.B��'�kz��@�g�!����3�(����������nt��A������= =�W�`u�A<�d���i�Ee=.1A{��1����<�b���+����P�5P�,Z8/���N_��x���"�S����wH������uI�	\�$�v|�w�ew]S�/�,����MZd�����/�~6�'������~A(`3���f�Gy�Lq���:�?�
A�_r���AJ��+y�@�y���S���S�����8M�2���e6����<�iNu��q�u
|C���]�d)�	��g��=$q9�	�������N�Q�������(o0����Y�����
���~���y�����{�r��:>h`����<9IV�����>�W.���{�E��=�9�����G�������G�c�q~����[%�j�E������>R��y)����_�q2�x�`��S�>��q�l2��r����_��~�������7k�r�:�O�xz~�q�k�X,������/mK,��/���o�(v��e��:B'Y�$W6�������������-g�?V�����L����������h���	�`q���/�}�9zJ���w6�>X���f\y�� ps��*/��������Dl����,�F�uc��8�N�!p���+���]:���e\���T&���I�{�O���A�B�;w�,��OV����.��N��.���.e���D�bS6�a:�#bs����[c����=C��I$�>h�m���*�����<'���������q9>Z��� O|����fV ��a_�h����g�������	��5&1���\�����=>5^;K^x���K��YI�	h���������>�^o��t�Zy6�]����9��# J�?@5/*�ffbUb����0�q_V]�����n��7]�T���O{}Y!�I�1�aOb�3Oy9�D�q1S�8	��x����=�k��Cg��,Z8��Uq5t��d2�m������C���x����=��E�T��������N���kwGz]���/4����kmuF<�Lu�f\Nz��?�6�t`-Po-Lu��%h��6x���zV����c+7?(�:�/���'�P�"w\�x��#��*5��$�7?<����x�;�=#�Yoyql����>}j�fU>�d�qg���R�S����8'6x�eM��w��f���1�v�a�|<r�q��O������)���0��V���
X,����9[��6�"`�$�e��E��Ut-B_���gX+�z���U����u*�|�+���*K������M��XW�g=�������^]M�	
����Zu6}��{�yAh	��s
qz�y`�?��5XgA�����olq�q��ZvO.\��%���P �g�=���I��5�m� l��WL��M��5���Sq0����%pM�DaZ|��;=0�;�Fx�z�����8E�������QU���M���a���9U|��������T�-3���i��[��U����?,�B^W�����D;}@z�#�($��@����u�0)�����p]���D����	H���G�II]�W����oU?FJ���
��PM!�c�/�G����fV�	����IL���z�se����S����R�h8���6�L �����KO��~yB����3kry�sh�
���vL���n���F�e�H�v���#&�Y�p����\mS-���_	�t51����$O��MX��
��M����	-C"r��$p������>f<U�k��7Mh�z�#�����:L>{��C&�0�V���G	|�;?F�<T��}����0���4��a��r�	��o��Muj�n����7���e9�{c�G(��IV�W���
�`aa�d�J��_�_`���b�^=H�+�-F-�p�\W�U�������n��==���6l�X,������m+,��������U�Z����F(zk���v
�0���m����%��r�L�nP��p�R�����*���`%M��PX��eK) o���J\��b�����z_�y���K�����BS���z*���*V2���lv8�nv&����$��d?LK����CF~G��'�7�������E&�x~������v����)�h����A�v��7�K��h5M�y�����M��Z�F��t;fJ����W(�I�5�/��y�����Sou���-��ZQ�'�	\@x����~5~	�d�I��v����9�������|E��<�U19�W�;�����M�/��;��'����p�G���	2�b�����a��|���O���������s��N���P�f/0�g���	�Kbpv�OT��)S�0�	����w���^�&���� �;	�	����.��)5�5����7��Z���4m�bar��98r��K��S�;MK�����?��f�?�$|zm����mg����Y�����N/�:I<����'������2]/��`���uA�UfB�?G&��������g�z7�T-G�[7�#�;� c?��{����fP1A�y�q<9��m�<���	w���S�By�&oX�z}5i�[>$Dr]��*^P��yre&���'����\��dz����I�2����<6d��K����� ���*��:}~zkk��L��S�
�I9������{���{�#����I�)3��v�&q�h�$�c_0�0q��b��f
��\�ri7"��������b3��"�+mQA��8���g���,^��*�{��G����cW)>���e�����]��nU*Q���b�n�f����m3��`_>�����}Ns� �qO��7>�U���S==�����X,��A�������%��E��m��Z&�B�0g�~K����,���	w^�6W�RZ�a�O�g�7S�G�[B��*����}����/���0��P�f
�\�v���_����j]��jW/O�4�t��{�!�3�/^���`�B�U0=��_��''����u^�d�ko���Gg��A�C)�[��7�*0��a�.�>{�X��?.���d]������
����������?�0��5()�$%��j�J��CJ��M2��l)����I����,6xZSt�A^��_}j���ir|����Ex�p p��e�����a����+�����. :�@��6���E]#�_`=�M����i��&���D�S1�Bs����6��2t�1�9K�y���B]W�e<~�����V_������!y��y:�&�M&m������}za���!�i�,�ziR��>J)��3�.����8~Y-��0v��_+^d ���*�NA[�_����s�A�J>��p�Q�1����R�]~�����=/���x�A�`�$>N�='�}�Hy���L���/[��k��/G_�_��o��	d9��H���=�'�v�I�8"'=NW�%���-���8���xl�X���J������s �M�86!�����7A;A��+o��z�xWd�e4�Qg�x�����x
��,<��G�����)���9+5��x�`����dl��q�5|a�>�<Y�P������a�@l��U��*"|E����V����f��^?\����i��{�����.Xx��P��$�ys���g�cqL�z;��?��oG�/S��6��;�'�[���o���cv����5�,�RR�������u[�
�?~}{Gr����C�D���]o�^������h1���89�%�Z+:'\�����A���^��Y�����1��X�����M�j���q���S��]h;�������
���>C�A�%&�72�fgy��|�9�������,7.�U1���S��r����x`�(&	����>�+�XRF���S���%k��PF���1]~���9^�g�~�B����X"�����	^p
������uS���c��S�H*��;��Va:i�Yp�}��SLq9>Z7&���x8���Z��:Qekv~��Oj�
[,�l�*���K�_�b7����n����X� `��;X�Ph~����+���-X����DLf�e�h�������i���2i'��,Z��V�Q�r���d��SY��E�c���~�NaM��KOe���y���������Y2�y�+��c0�8�-�D�}k�|�Gm8��|L�\��������M�XG����<�C+I
6NBp���2��2 ��UB�	6&zlT\�V�-[G���-�t�'��Zx��3Y�s�}%	h4bS��J!���9`��x�{y7��0�g;�r(\�u�7���!�B� !�nT1l.)}�L����u"��h&p{�����_�!�i��CQ���Y��!��kU������������dr�c	���iK��%���K���Z>Q�jT-���L���DI�K{�"������j�`i����������};�L��#�N��_��M;|."�����B#���ng���Y
z&K��W���WH������7�d���~�&��6�gK���FW�fu��Ma�5^����M���xB;p?�P��RX��W���O_L�.���snU&���W��=������i���J�wv�)�x��S���q�@����	�r�e����~e���$����7"�O�-���%p��Di<Z	\�`�>t����E�"`�X����X����C�>���g�Zc�6����/�����|A;�����/|�G��k����,����"���"L�A	.dV��*L8z=9�������
����/t�(�V�����6O>�U�����|����g�
,��������;�U��[Q��9�F����
����g�����y��������zN8��S��[mc�����.�p[R���aAc�/�/�����K�G�S��<np��|��9�g��O_�y����/y=��&��gD�&��=��M����a����&���)��G�<]���lgNd��DT�"�O�z#,�GK�:���x���^hJ%���zm����m�>-�F,��E �" '=��M���my�"`�����H�9|ju������.�k=M�M\��S�H>��R3���[v��c~p��=P���������	[��> ����PMi�x�Y����ou�����V�Yh���`3Q"��~�����I����!���C�L�EZ�G�Q���=��S�8��q6��?�������&��m
���P���1?��$j+"��s�]���Y:��y��M�{���?Bu��w�V��V���Y��Cn������`����;,@��=�OG���H�/2����gaeBTT,D��jR��Wq��!/w�	���~;�����~>?Mr��4��h+�x�z5+R�zU�Y��f�O���{���=s�t��&�?��yC��W�)o|��q?>���v"P����s0�&�>g�3k��-��[DR�:[$�GK�:���x��=�<G9�gv��.����'�]q�X,����X����C�>���g�^c���q_&�M{��E�����w�WA�u�=B���zb�������U���?�?U5���m����T���s���j�H���W�S��>����I������f�_&^�|��c\����1�� b�u	k{���W�j������9��)�*��1�v,�L`�1_�:	1�k
��I�$c�5&q=	6K��*�L���Dg�c\�p��m�w^���,�lT��4Z�a�L�~A^��Te��_h��-jRT��
Z�*�d#���5w)��l-A�V�W�e�@O�K��t�n�C������W��\a�h��+n$��}�h��E�����XP�BI�wGW�Ix�O8����w^�Z�<|OK����X�6�����"e����+���d��/#I}��09>Z��L�����-W���t1I�r�
dU����E�"`�X$r�c	\����D�����J�5uj�hW��8����Z���}Y�'����0����8FZ�i���H���F��2
^��U�� bU�y�^]����t���������\k���I�j�'�-M��K���S��E�m���a�5	�b0?��75�]�!|�j�a��L�:Z�������n���F��8iZ�j{|
�H������|�* p���a3�7���nU��h�X�9��Y�66�����
��8�[�L�f����q���d��#����E8����~8�N��[Fv��X��A�~��U^i�����������o����wjnM(�X$�p����T)�S�<�i����9}\��\�H��E�"`����(�
�=hk������+��I����ki���y�WK����
\����U��n���o��j>O~�?�l�f�Z=�[�M���m�H�{;?>��1?>P����H'p�=^�jW��A�8�b�P	\��0��&������[��U�<�a�f��2`�x�V�S`}!o�lZ�gc�u�vhi1EPF���hkJ_eE�p;��{%��%fS�0O|��i��v��6o�}����|��p�b��)��T&�G��+���h&p���l�,��E ��I����%��t�@< m�/�2`�0��Y��y�)�T�[� �����03�:�]��"��nl��pX�Ve*�h�5o����CE���Z�B��.�V�l1j�t}��3�.�i�ii����hi���ac~���9b�|���I ���8���$�����U]P�p pc���/4����k��t������FBE�r���
e���S�v���4��9~]�����Y�$�v��E�i����4D�
x��9�������WT��q��cm�E�"`9>Z7@#�4K�FZ���Z,�@���K�Q[�E ����ZU���\9����T������pI��>r��0��m�>��bb��4'��ME�S�.��S���s�i�nZ�&�N�����jP�2+����?D3~[��a�3{f�V�4e?�i���Os�[�����W�w��2E2�\�z���_L09&w��X��V���fw���M�nF��$����g�b��C%��gVJ�Vw�����������X���3��d���V������'��'h�+������na�W��Qv�����qY�����j�������y����0�r�V��so��I;v~�`76����@��mi�H�����x=~��
\�)����N$|=����E��y�Sn�\��
��H	�4g����~�t��d��)��k7��:u�������L�U�X�
�G[R�O�dt�}������?�� ��M����na6��f�����u@�s�������2��x���ox��_��`������[i��9>����{�W����g�8�o�j����1�]���a�9SzJ�6%��81��g�(?�y�Y�|�T�/���xl-��*��<�q�,��<M��mc��m>i���8Nuy�lQ�U]Jh�C[q�#�`�+��!�������`^��E)�m����������8�����~������IJ:�W�y>�6���M�'p������{��1�?��>V�A90����|����<�?V��.��^�tY=�-�&u
z��-E���}��6g�xg #���������([�����DKWl���p+'��x�w�;�.�1��<n�,u��E���3��n#������I[W���H�#z[}��E ����%p����U=K�I[�E�"`�Dr�c	�H�9[_���/������/<��r0���>g	�����-?*�L, {(������\Avz����Y�n;��z� ��?���3	1��<����IY����,����"
$.��JR�����:i�<Z�d55~���������c�����MS�Z���:Ua�����2g1�����F��.N���m����k9�f2A>���`1�V�rb!��G�Tk�;un��Hy���������h��ZC�>�������\�"n~�J�}����P2����{���M]������W|/&�����E��c20��)�\�q_~�A�b�C���i�s�[&3�[��2�;��h�D���L�B�T��n:�����M�v�O3����$��������Y�>{1oPY�5�sl�q6���<YR��oG�}7���.���D��#��L�"�WlM���6�����5���R3�N7]#�i;v�A�'�_�/=���vf���I?����wr���a��o�n�����ZQ�$��U�����s��	K���t������e%��c������nh����C'fM������0Te���4�����|����H��W[?��E ����%p#�/}j�%p}��f�X,�(D@Nz"�����M���@$>����3-�����}o//��7������f��/��T0H<�U��4����(=�-�����F�@���W���g�mi�����=����]�'s������v���I�������" q���95b���������a!t������F����k���~�J��UJ�0��:�%(��[?.4���vD��C�vi�"�+m	mU��!JP�\l�^l��jz��R`���������Z��J��k�]{���Z�c�e��t;q��z�q_~�A��N���uu�5��D�:a��������'��|h��������K��R_�����*�����
)������^�Z�c&�"4����i�
��T|�=��n�\}�`����/:��)�����I�)�c���;D>�ZW5��k7�������7]���}����^6x�2���`M�e�3Y���h��}��6l-$��9���g��P�~�sJ�NOQ�|9��x����K�4"}�z��Y}S6Z-\�F+�D#��3���f���i�~4��m�E�"�������?A��%p��-�"`�X"9��n�t���E@A ��_����TO��W�\��Y�D��T�(�?����09(��_��0sk\���0�d�����tS���y����z]�~�M(�5q�<*`�d��`�Y�uV+�a{������z(s�t�����$�W�y�Q7�h�.�KMm��`�@��,�U�T���N��>we�S��0���lWF%p�����v�;i�d3_��0i:x��%:���aFN�w�_���&��	b�P)�'1gL������#��m;���#'������7�u�H���l��_��M�m��M��y�����w��K�<.LX���~����
�b����u�
ve��-��{�K��n����h��{���(K�tZ-~�� F�J��(�D��_�fEj\��G�/\�"��e��?�N���:�zu}�M�Pd���8����,�'���G'��;�4���Z����l|��j���L�:\�����?��J����J/����+�8q�� a��_	���{�)<��Mu�z�:��7CR�6�)�LC���p%�?�S&�8��z���4l`^np_b|T�(��/"lHp�s�S�0��1�$�Ysu��)�C^��x�2q�U�r�L��5���Iz�A����������c�el�������������u�H��{=��E �" �GK�&�{��	��m3-��E�
9���46�"�D��l����tiRj��@�2�Z�i�`oX�����\�}��p-����/aVS�����jU{PFo��">�0�sO�Y��P��3�V���������&���[w	�� Z5�E����Y�~��{3��:=~����n�{���
�yjf�a� �9t��0�{�!A������+�S�L����S�0S�����a����Uu���o2oi������ o��������4����3��#'��n�����'�OA�����N���ga�|��]�r�V����'��>_b�=��T���:��h��U��O�bjv&�V��Uo���?�0�
MpG��d'��*;�(��c)XT���Lk���S����_,B��<��/��A�����������8��UL�<2j��#l�t���(kj�=pX�G2�����0�g�j�T��k:7l@��S6���g�������_�k�����r
S�f�� 8<i�;�
�x$���[����nd|9O�gb'���p�2��*��:�_����\0��K�����a�IFa��6A����L����au�f�=c�d%�a�����B��n�%�x��<�*����2R.�;46�[��|�{Y�|_acX���~OHp/w����������	s���.|�����f}�oN��g��x�i��k`��w�?��3�gx�q�;��D��#���������a�
�1����SMk�9�z|��
bs���0|2�n^������=�����1�w_$J$�������E�"��������~���~�e3[,�@! '=����N�MI0D���L�O�\����L����hh�b�����@,�C�&������F��|���L�����F�!�~o�8h:�bZ��e�L��eU@}���AP;��v���.�3���D,�4"��^c�P��ty�i�V�;}?"V��R�T���c*��!��^^T
��>���t�����������a�����k�&�\f0�LW���')^$uf���l��d���T���e��=>��EH��y���u�P�Y\����+��\�c�O��j�@�&�x����[����������$�B#����BsL��gl�N�7��`3�I*�-F�>YW��i��y�:�{���K��1��K��6�`����������-*��o�|0��[�-���}�
�!1bs�IL���}����7Zv����2��W]|%p%�������*0���`5)V�P�����yO)���h����_#X��$ q�xE#MG|�3��t��y��=?�������$<c3�S�)�X�o�z{w�����5���u����Ap9o`S7��c��I�~�ci�g�Dr|�nt����X�#4��E�"`�D9r�c	�(�h���D �����_t����@��aU+�i��iG�����w^{�����~�?��U����E�I�}k�d�o>U@D��o��T��+�0������h��MT��	E�	ed���W������!��6���7�uk<DMT�����l�p��t����oW�th���U3��b����b�$���@�J�`����*��I5C��EA+�@���T1�L�
��m�%F7���%��7��|]c�h�;��t��*s��8������������X�,;��H�a������oM����_�����z���������Zl�8�l�1�m8?\����K8���W�w��+�)��z#6xj[k6e�0����r�V�G��k������g���`��������F��_�@=VE;f�a%�����y sM�<������-�M�l��S��J��q?T���Z,O�������tK�FY���X,����I�%p}��f��
����D��o���Sv����.]��_�L"�6a�{����`{�LQz�u�<�8�B��1?�6�k�9gS����lF��*����*����>�� ���sn~���
}�,c�����U��K6u���.�D��-Nr���6��+�����rL��0K�����"�*�%E��D�z�'������]������6��^s�����b�f��mP�����1�r�iN[9V�H�a���f��Sh�Bc6����4�����z����DG��u�
��V�9��6��q6���.p�g��+w����f��]����P��5]I��U�J��i��������Mc��Q���;��y�1aMC=���ki"�G�I@*�(Z@�G�3�[|�Uk�����6��D���8��Y,���-�}}kl�%p���D��E�"`H�I�%p@g�&F���g����ka&�k������?d<�;;���<9����g�<�=F}>�JK3-�:����S�}�VV��d�Vf���_�Q�����������{��aZ>5b�:�FN�!��I��T�p#p��0c����m�4��oM��M|�=��n��Jh�x'N����%m��"p��266�_���J��D��r��s�X:t��02��i����K]�#�s�'�#����K�����:}�z����A�3u�"�_ZKT"07�������?�'%+������gD���H�M~��|��~Ch/���]��M��aMT����Mc[8���U�H p�9���k2A|��Bk<X��7n��\�X~�$^{�SW��V=`�#���CN���|7���>����4LK�2}�b������,������T�Pn*��i������'>��2��'F���@�D@����M �o	�������E�"�����X�
�`{"��5-�L�
�P6��a&&d���MR�L.|��y�_�L���y!����i~9�2L�/�.!���DH�J��D����>����i����Z�}f��0���u�������SCL��?a�?X�����?�D^����{�d����T�!�
?��,Z���?GM�)l2um	\���9���3�f����l�������BS=&�i{��W��u�h���\I&7���P����6|��G�I��.��1?.\�5�1��p���h�L�/Z�Z!>D�h�d+����b�Sn��1��x�2��+UbK
����.���J��g�-���E����Q�W|�#e���kZ,	9>Z7����M m�i�X,n�I�%p���	��G R��Z��Q�F�5<7o�CCG���q� �y���9�dI���^�.c"NM���MGN�1��,�k�C$m������n~�c8�u8�O�w�:]�
 -�~������1���k�p��x"p;�����8��������v��G������|�X��b���M���1��~��'��}����/����v��r���v�T������!��K�91`�H��r��}�����{���%�{_.]C�N����w����G�a�*���l�g/
����s��R�%TqZ�P�.�.4����+��@����H��{k�=f�XB��-�
t��LK��a��*Y,�@�  '=�����E,AE R�_hG�PR%&�O�����"��K�%���+:x�����d���zw���CGNR��_hY�]��|�H ���}��b3z�h>-'�Q.[���i��g���7n�~�zwkK�S%w���NOS�|�]u�b�{|?^�v���-����h�3O��&-L�����M[w;��5n	\���`��u���%���ioj��)����_�R��~�c�U�R�'��-\��_���7�c~ �+	��%��u�HDlF>v�v�z���p��e�,��[�
�1��X��+���hT�jV{P;�������/��`DL��C�Z uU�����Zg�X,q��-�h��5,��`�`�X,�����X7^����
�Hy~��A�:=���������`MkP��%b2�:��)�~�N/g�(��:��T��e;k���5G�]���+������s��;/i��M
��0/�h�������+���k"�~:�v�����h#p�gN�H����>��"�%p�\h�A������<6Zz�w��]�&dM�����L�r�CN�����7�c~ �+	\��xo�-R`��������L
���yK@�}4���
6�`C�?�
U�kG�U�J���j��3-���?��-�	�2���P�E�\?T���Z,O�������tK�FY���X,����I�%p}��f��
�����+|}�sO"
�@}r��U�=VE+�aa
���H-��ZB-�kF��K��x����Oy�w���Z�#&���4�L���&pM����;L��~cj�H{����dc]Cp��%4��%�����^��HS��7�&��f�6%���<������b	\_�
<_��y��W�k����������<���y�}K��`.-+|����$0��A�W(]�TZ�OpK����#>|��M|�o���o��hw��%�1?�N��W�_���)3�IZ�C�&T�8�cU�����?�()������;g�|�Aj������!_6����������b�b2I��������t��Y-��*?T���IE+x�w��	��/�<����f�v_N��<�2��7���-���-��@nK�&�����X,7����n���@�#I��+mQ���5L��8�f���5�@�$�����$�e�>mj����������\���'�$*�,��S?6��k����o����B�%p5H]��O�d���+�����i�O���@"�-J/<�@;u���4��9Z��<��7U*��Q�l7Y2��<�5��g�xo��pIK�h1����v_���S���l����ND��d���hW%	\,�wc������6U��X��~�����`z��n�=}�Q����r\d��-���m���@��t����>��T|�=��+U���c���t�T�P��iQG;t��9��o��]����
�3���e�05n*�^{�,�c~ n�
R��������b^�%*�FXx��1Z$I�<)k��7Zh�w��T���]w�E������G����Y����i���a���gR�B����������?X�~z�����
[v6��ew	�H��e�U��Z��Lw�;���@���H���3T���@C@����M o	�������E�"�����X�
�`{"��-R07���S��l���4�$k�b�^���;Gq�sQ�����i��������o���������ba������������ki���j	\7EB�*�`�D_0=s�}:�����H��������
����h?qZM�*��W%��!�.����k�8�A�@#��_{��`"�A�����_W92�+{AF��t�����]i�������F
��
��N��v�1�wh/��(;~Z#�31�!}��1-�?���O��H�w��1��cU���d��/_��:���6������I�B���}j���SZ�0���1�C�}��S�6;@�����	ZF���@IDAT���L�Z7xn��S��T�	n
��v&�=wAK�7�1?�����[m������ILln��F��=�v���;xs�\:�~�UA�b���
	�|�{z�oKi���;oQ�
$�Mq�x����
���e7��l�0q����H }��9���mX�
�K�������K�Md�x��C�jZ��\��J"�O��s��)��q�������F�*��J�����d
�	���I�~�ci�g�Dr|�nt����X�#4��E�"`�D9r�c	�(�h���D ��_��1t����v��1�����#[&�"��8��Z,�C�����#�N
&�rd�L����S���K����n�(���PK��=p'���	S��}�������6n����z�s����y�2fH�e�t�
���o6�x\�}E�!��$�'��"�k����X\�y��wHf������f���^�{�0�g�����>]j�����D��<�I����h%p�>�R}�7d8�&����` t������4��th���e�p�-�PM�p$����b ���&P�����S
��q���V�e~��2;�e���M)S$��$�����_��k���#p=m��Y��l�}����bc���S��-�1?������*��c�i�����SL��C.^v�	�-H\�I�k�n�R���Y�)CZ�h�������O'���=o�N0R�I�6O>�����0)}Nh��^����GLr��"�;5pe���{����t�7��H�T������(��
��,�%���p����"`�.��h	���W����Gh���E�"`�r����Q���yQ�@�=���������=_:g��e4m�"-+�;��������c�\�~Sh������up��n��=���Ec��+�FIe��7?��	<|��Z~���Jf�����j�����P-Q�t���
���y��W�X��B-?4�F|���H]P@�j�+�Q
�,gd_�3���yL�����+=-�����n�07�;��	�&i����T��8}!��p����� �L��[&N�;x��N���g
�m��$L��+���>V����S�2E�$�#���)0
���`���w?Q������+mS���b]��+��u6s���.4DD���-K���)&V��6��3d�w���>ULc�z�SZ�xFL�w��i�
U]L&�}�X��BS����SW�������6���1������P�\������7b�����b+!�1��X�f;�p���L�
m�'�Ws�\���)<�5pa�X�-���	��S���<����v����I������[�Z9j\����;j~Oa���08�ux����6��7^���@�A@����M }n	�������E�"�����X�
�`{"�����'?*L��0���K���_��� ��Fh&�"��5n������U�oO
��z+�n��C����Z��a�����45���m�R�n��F|����_h��o���\����o��u���z5+�N���o�7��oPh�����VL�N��w���oY��
�.8F�S�Z�����x#ReO�ysg�N/4�J�`M��GN��Z��H���6�i�����Co�k��7�k�w��_N����a[���4��d�/'���oR��&bK69
����LAL��j��
��O�u��]{���SW������^3��
W�����,o��:U`��"����nm��/�����H�����gQ���}�����XH�Vn4�/�@!~�3j��N��u7��8�#9��N�l����=�
D����o�<YRz�Y-���'Y�b�0������P����=���
�K�L��S']]�����q3�-�A3��~���_��_��L����Q��l2���4��%n��������6�.M*j���MUNk'�������d�S3��y��lR�a*U���9��s�K�ZOSy3����
y�����H��/{m��E �  �GK�&�>�n�h���Fk�J&��o�I+X� ��t"[y�@���K�	P[LP���E�S����F��u[i
\��]J��RH$?�X���`������v��,���ZLZ����������s5a�����[���1�G�sde���0�2a�����[o��+m1�v���z��u���s���f��_^�����)�����P�dI����L����'I�*9u~�� A���������Sv���5Y����)9���	����o]�����H������4	�$��P,�6`�>3gJ��,L��Z�����;]�tY�m�j@����D��b�`�%XC+�+�l��&�4e5�a�^�4���.M}�H�_[����fEM�h����9��f�^����a^���7����\f��h�$c�Mh��_�Z�b���6
��I�{����n6�c��q%hkY~�+<X\��N��j=@��;��,�i�H���x�q_W���<��/q�f�A�`,@���TrE��J<��yv
��%��+7��m���.g~��pgCY�r�8�g���	�<�[NLA�����+�0c.��}7t�w�yy���P�� !�?�XX��_�T\�d��Uc|���)���]#-�\�y�����{�T�����c�S���<J��x�~O��@�-���&-�����M5��3�L��x�.b��+��G=����vE
��
��y�l�[������y��q�PD�#u��pH%�wu���L9>Z7�t�%pHG�fF,���.�[jX�1���a�)��"�����xD@Nz,���`/�����(�1����M�	���bQZX<s��&o�j���c�i�OE,��:���N�>�Q�[���x��i����<�t����M_$F��i!�-?��iS������2���s�.�<O�cr��)���Rgy�#� ���.��>��=]�|E�tj��Z�?|�n��I���?7
W��6MJ��lh��DN�e����~��p��Y�	����w�2�J���>]�x7I	�=E�d������<����
c~b�p#���|�	�8�_����?��z��x��py�!��2���8��`_��S�j�3��`J��|�������`<���!��c'��uo�=�(����3���g��w�b�0��s���.����;O'y~�MG��4�8o"J��J��Z����-��uN�v���h�C��-�w�w��X�����-�Y�pm-�0t�D�@� ����T�j�k}�������M�X|F@Nz,��3d6c ��[[7�r0����`������z	���+�����E N�c~��m/f�X�;��{D|�wu�w�m���h	\E[�%p��Gm{�h����M�����3d������%�;[a
(&���chc���#^����HG@Nz,��=]�����x����)�7��&i������"(b���,[U��E�"K��K����E ���~�uXV7���`6�S�:�����a�.��#����.r|�n�,��@:�63��[��"e�/��+d��o�dIip��t�=�D�u���3�/q�8���)aN'����L��W���,���K�����7`�P����F������	��j�{�Sy��n��H�%[G��E�"��m)��E R��~��T��3���`��x�|���f��6n�M������@l���%pc�b�k	��,[��@ZNC��J)S$u����S�c�W�S�z�2��IM�������5[�4S$�4�q_&�k>M�X,�G@Nz,�{,m	�E���/���F&��5&q=Il�K����t��FB/�:Z,�� `����hK�X,����#������|W�%��]�D!W����������h����%p0�N�n����o|#�8�=4��7�j\�vC�&�
������k(�&Y,�D@Nz,�K ����@l�K�Z�X^�>����n�X";�GPg��Z,�  `�� �h��7z��,���U���C����ki6b9>Z7�"�K�F`��*�+�Y(�n�v���E�
9���46!���{)���VU���Ab,�@�"`����Z�0��E�"`D���FXlb� `	�����-���o�-��/b6|!���A.���2c�PN�]w�O��
��b�k�2,�A@Nz"��1 �ey��@�����[8�u��{I�)����e�l�Xb���c��=�"`�Dv�w��@�C�K����<]���&fY���<�5O��CA����k(0��{��o����%p�����YW��
���R�By����3�}Y2���n���g��m������HW��T�(�]T��BT�l1��1�O�JL$N�9OG���e�6���������i"�!]j*Q�eL�����/[U����h�����K�����{i��C��a�|9��{��y�S�Y\���� �;pD���>x���wx��%W����Q��Qv�����f���.\��:.�S����@�G�������������)�Q��JPY���1CZJ��^:q�,>z�v���~[���_�!����p�\T�R)��-�O�Z�r��Q���h��=�2jU+G,F�/_��K����[]�l�"��IO���J�k������MI�&NL�N���'N�������x,�5^��'��f�J�cZ�4�(U��b��xz��iZ�n��W�^��(��3��^^���������+W�y%y����L��k`�?�u^�v��c�:N�������T�����R*Q��������NoNuy�lQz�b)��{/�`?�s~_�U=����=�r"9�P��H�#[w��E�",��,$m9��E 2H��~8�����#��w��
���_�������|��?��&��]w�1^�<t����������0��b�!�}�)'�e�u�����5a�;�\���o��U�'��jge�'K������r�,����sO�:�k�gx��0��t��/�d��P����u���i-W���$��G}���(�n�����h�?{=���f[�p\���}[�`nz���bMg��u4g�_j��A:�hTC�m���f���s��kxP�����������
5{N��T�8=��� =�]�����YDx�y��E�Q�'����{�"�A>�7�v�;��ww�D�Q���2ERW����K��#�K����o���4h�DW��s��y�\�e�9��j�O
H����FY2��7����iR�Cn�����/'��}o��nn'����=Ayr����g�]F�f/"����������7��^��O��X�9��f�o�x�im��7�����m����x$:�1Y����T(N���	 [��^���J�!-��&M�G�����U�z�*jy���s�������8D�kU���Ta5�����2g1,��{6\������v���N$�O?�H�|��{i����m�>W��H�o$����E�"`&v�&��,��E�"�$�q?���{vy�r�����?��K?P�^z����=�:&q��n�W�
$`��5�FU]y�T��ga����n�c��v���]��B��yL�C������N�9'ul��J/�f�)<�o��:�Sb���,+��%��}����4�W{J�$���_M�5�����z5+R�zU]y�`5��y�x4��h	�h�]C�,�k�&�X�~�I-�R��_uX��� ^M����LM�r�i �����/��ht�{�E=���'�>�W�]�A��=Th�c-��E���C����_�I"<�g;�h�v�C�v�'��vO�D@_�x���IW<h�a��k/�����5o9O�Q���d��;a��z��f#��@@Nz�����}��_�bcL�����V��"�����6�&��qO�uw��	��5��h���N;u#��)k1	H�w�.�x_�9���]��Z~8��"[,x���j�h���4��[!�y/y��],B"���FH�jZ,�8C���q���E�"`�������jT����u��n X,2�;c6(���$j>V�G@���qnL��j������z%���C+����B�xh�W)�C��������:�]�V��i�#X��k�����V"�h�B�6�?xT�Ki�������{w����=E=>c�=Ir|�n�����X�+<�`#�x���AU���W�\f*R$O*Lf`�V��:v�F<X�~A8�y��&�a����5���id)0���	�S�o����������$0�����g�I-����<�������lF�O�s���}I�G���w|��!�q�N:��u0E�4����(?�������������<OY2�sqh��gS iS���(��g��*�|7�L���<6l���'	\�.D'�������`���3a��������yJ�$�+��q�8��O��y2e�q����������I"l�0F����5s���<~HA�&�U���Y&��l�q��	��T�X��i^G�>�M���WWZ����>�$�P�
��@4?��-����E $�1?$��B-�@�"���H���&p�iS��}���4�[aB�yv������Lp��oy(�`}Z����{�>�d��D�|W�X����v."7y�s/�L��	��9�6�g�����n��a�*�!��)0�����v�-"�:���i�#�������]�b}��K0C���Lloe�����a+^wW�DrM]M���-�M���-����=���~�
xM#���D���*�F��� ^h�� Mv=��&ka��&u
����&='|����'�d��1����3��e��_��*/jL���n���iW:0<p�W�c�Wr$k��%�h����+#�U�l��[I�D��U���^�>��/�>n�R��	���D�u��G�����OH�-�=Q�a�F�?���S��������1��	l��I,�kB���r�m.L�w���1�bS�$�������G�\0��a���-E~� ��������'��`&��������R�Vg���"l2�c���0�>�����=H��l�����)s��2����+�Z��fa�~�aV�dv�dF8�����f��%
��D���]b�n�XB���C�-�"`��%	m���ox�Sf,�M[v�o�dI�P�R���C��^]g��]4���{/O�l��w\���^��y��2����y�s
^c����x��x��q_��;����������P��j�Vm}�@�b-�d���j^k��g���a-k(m��V�,\e"���3�_���k�����^�����x[��m��`�� ����"]�p��I���w_��������[lY���uA����M�����6�7�'D�nB���l3������i�IT�>��~9�G��y�����wS�����U&��o������M]I 0��9�m�O���#?|��� �w���"MD��R��g��V�IZ��������Z�y�V."0��:�t�=����@����m���U�h��Y��C�b�{�����L�����H���F Z?
;�� E�|�g������.a��=Ud���|�����+�h�<��+�9��W�s6�G�Z�SMk������0��������s�qv�]�x��@#�l�O�L�q���k�th�B�U�X���Q_?4e�c�^�eD�o�>�������E ��1?���-�@x!����H�������?��e�P��@��d��������F�IL�����kki�4u�	��*�-F�R&��K�������Nk����M��${�LB�h;��	d���!���-�^�	��ulO���/4#X�-��}����5�K�ym�>w��2�*��)5=��r|������c	\���8G ;����V[��s����hi�D��:�����y����;����jy �A$K���o����2*~c�P��5[�O_�J�4�4l���&�-��	��2*~�}�����n5S)6;���j�[�
�O�IO4i��N��>b �Yx�k�:��[?N����b�	6�xl���}����/������<}L��9|��d��I��-c�#��[�!j�������s�������p������y/�2"�7��H�[g��E�"�1?.P���X,��@B�����~*�&��1�awk���J���������I!E�������O�$�KI���~-����Z���Z������
�}����T�z9
��+7��I��4y��g4_����)���[�e�h����%p��w
m���/����h~A����A��c���Y��F�k����,�j�i�Nl�����F�����iDeKv���fhf�M�,�X��&��������a���XW��9��&�Z�R���
,�o�����u��6�pj���}�E]�R��v��ih�+]i�>&:t�X3U�:�=�<�2�/����t�>T�(��t5��6l�/�|�*�T_?4e�c�^�eD�o4>�������E .�c~\�l�a�X���4�G�7�k�~*�4y������R�H�������e~
��A�:=���k�Y�g��mt��
�XL_���thE�����k�f���T�6y��!�/��ulO����`���S%�X������
&e��3�����JC ����G5nan;!�-��z��h	������[�g���j�
Y�=J���m�>��4����4��cph�g�!5�[��WW��N���*�Y(�����^'&S�����KV��m"b��v���\y<�l���/�K;l	\
�G��'���
�T{����_��ghi�D��z���������]��M��a����W���){)��	|����0����k�V���z�3�W������d���m���U��.�~h�����$����h|~#�l�-�@\ `���@�^�"`��	i���o��.��=>�z����"��M��q�W��l�.�S��w,#��_�I���K�w��������x��wpp��{��)"�"�(���`C@E���(� �����,�(((�* ����8�������M�6��w�M��<�$�L&3�I�d��yg�A�$��M/X�_�LL%�X�;L�JLJ��;��s��}���b��;2P7�����9B�^��n%?�>�Y���3��$�������u�x�X78l�2W�g�W�}�����5�\3*����hz�w*%�)��0o����_��WQ�&A�������v�����J��t��C�}q�h��J�
�=�����Huj^�?L?|	&���>p;5oRGEp�#���$�{���Q}�����`+�K+DC_~RI�s��������������d+�l��_N������5��<�L�1�[�O�L�_fD�
WY�������sI��K'����(3�d�����s���}���;�?<_U���&����-��N�_�%1
7w�%���k6l��K����6���,����/I8N�����������<P7�����s�JYi�P^�d������=�)�xW_1
����Z�"��i���n�eX�z�����&�����p���^�����a'�#�j���l��b�<�������[�V�P��n�P]x��W�g�\���t�G��k�P�B��`�_��*��C��G
����|�q���l���n�J
�g��_g/Q��l�����(
�����������	�F�y.s��'���C%��%����q677�}����k�� (��������tC����'�{/���7�N���_���@\>�?��yX���Y�\�>s�&�>�U����	T���~�`�\�BV�+�������������HCF|�y�����ui!�:��uu�����U[`!�R��T��5T�z*U\���}]�/l��S���S��j��Q�g�'!��5����G~���(#��*��M$���i���hL�_��8��p�4�N���In�Nm�������	S~S��l��w�xNI�_?>��{J����-����E�5w�*�f���`�L@�����d����$W�x�F`3��  �%`��3N �!�&�����|QY�/�����F��Z�pu�,X �;�g{��&�#�i�p93�Y�d1q���/�"U,_��rd��G��� �zo"��X��*�fV?v0}.\+���V�x���\we�m��z����;W.1������~�.z���V�}�����u�����v@5�����=lT�:]W��A�}w�d��e�VS��|���*���m�3f��FF:�#U�5{��w �����������/������1�g5���$��m�������cG����J~�^z��u��@ltj3�n�P����������`�L@�����d����$W�x�F`3��  �%`��3N �!�&�����|QY�]|��X�Zmdn�r%���/�K������i�����</o�r%�=�����"�9x�{<��
����L���J{����3��H{jx�����q_�DE���w��������c�%���>B�uzK_�\�4����_A���I�����������������~E%S��E������/
��������jL,���os���3�����@�,Z�C_B1.�����gW�@2q�#� ��=���\W���	"�D@��8I��Y�"��S��l��9��%����O?H���S�������W�!�P}k`O*�OI�=Wm �MfW�P��>��\"k���od�J YG6?�X�L  `n��N����A0����;[R�[�(���v6��;2�B��y�2��4T�����}�����m��|�NT�FE��-�7��o����72�;�>.���
�u�ga6�$�6��{o������7�C]�.�#\W47F����#�����r�2�\��M����
��|��T��
AQ��W�,XI����L�s�8 �������p�!\:���_L��k��>�����B\.}8x8�^��>JY�a�Y.������T��wZ�|�Rf��n�os��fT���I.h1bp/������r}"������:5*Q���w�O��
k����������m���4%^n���]�o#7�e�����e�~ �MfW�P��>��\"k���od�J YG6?�X�L  `n��N����A0�G�
��;����K�^�<*���*i���
��K���p�����&�AE2�H9�
��r��=��D�Z�����w�tf�c���u��^��n%;�|���T��5r�t���)��oQ�� DJ���i�
�k��d�n�G�m���.]Gk6l#�S�;4k\�{�v%z��9�����8�5�����HYv���p���pqq��|�����hj\�5oRGs�A"������d���@
�V���r��=��73)��i%^�Q�D���(m}���h�����!���L�d������Q���+U.'�={��D�)*Q�]S�4*��������K�~�A +	��'	�����-���M
(�#�/����������wohR�����
����{o��sH[goo�����z����{��_N���'��[n�N���?��v+�=y �M�W�P��>��\"k���od�J YC6?k8�,  `n��N����ApP���l�b�wC��X�d��k�lbT���O�Y����:d���`�'���[�Ve���^�������d�:{��'���B_z�a��+ZF�1��h��wyg�P�{o���{y����*�r#3����s��Xi/N���������H�[������Q_��\�.�#\�49\�4tT�l�����h��#�$�S��RI!V_YI�A_��G~!�����}7�����}2m�����W��p,Yx��~�����������|��5����{�C�%!6�^�U��n;��1�9b�?��*U(e86�h2��u���B��Frr ��p���x*s5���$:x��a����!>�	���	�<���\��F���Ki��:��D���������$���Vs��n�U/��y8tT����������}����6/����9��7Y@������sI�)r��z�FN��  �u`���5� v �6�����|=��-��z.�c�'���D��x��s���)N��PDT>��\{x_����8W��l�ljj��w�����'�`�8q��T�hAC���`"y��yrk�Jy�����;���z����wj�*�8_����c���e��^����ne�r����ZeuJ/��C�}�n�>B�uI�C�uICG@5��$�B)���o������z��}~].x�m�3������U��Pi� ���W��Q�C_�K����z�=i�"������tD����W�xc��^���A�Q�gie��o��As��I�%�F}�����ie������.�l���6�,l���|�q���@y�|�^	a5w�|�s�l������9���y�y%�8��Ib6�n �M���3��|Y��-y����,�����W��d?��J�Ne1sA=_�9<Q�=�^���^�v�������Q6��pP�9A@ |�h������!/u�V.�Ao�L]g�F9��{n��e��0���f����*��q�n����M=Gr~��lH���9���G���;�~.�Y[x���X��[�//Y�e�;���Aob4������>B�uj{��l��@T��Y�M��(/���
����,XA�����D����Zt�������++�3�/y�����;��K���r�?�4w��	x.�.�\��t����C?sQ���~���AO�����UY�C��wS�����H�z�2
����^V^��6rV�N����ok�Xs#��&�G!�������wm�4��75��?��}�-_�Y�����|�q���<y4l�G����K[�{Rx'�:c.-n��B�:U5��f_�����y}���G�4��7y��6�w����{�_��.x���	���|�Y�������g�L�K;���B�~�3����b�}�
�y.2������("�d��,C����-���G����Q�X���{�4�;�#�{�*X =������� ���H����Vm2$����G���y��2��+b��}4���Z��4<G,�_����Sy�����r��U?6�sdV{y
+VB}���~���T������)/���q����p]��p]��TMv��_I��[�P��w���������Xb��n�m����5�}q��������@[���K��	�w+n7L3�ym�*��[�\	��2mbR
�;v���Y���\A+��byTntt���E��a�����<?�}w�Dy�K��h��?jKO��+<?@�������"�?�Sz�_"�������i��������og�w�g�iU�`~��=�p-rJ��{Xk8��5*z���!#>�9��� ���K�S\��6�G�6v�V���*� �%>�?=��^���i���s��,^������/['l�.�_DZ�o�D����l���]�����x^�[[6��,�������s���,]:���m�g�{��fs���;���Y^v�s��kg�(�d%�����s��@�	�����?��M��{��Q������5�����%��x=������e�����Z����g�� 
�V�|q��i��q����.��_0M��V�W�+`��R�����r����;k�R������ ��V����6% Os%�����w?�i*>s>�������u�u���.�^)%3�K��^~c�"�s�����p]��p]��ZM6����R��o �<y��$1���d��TQ|�$��G�����:*D`9�l �x���k�|y�]��=w�{��m�g���0��b~��
� Z�����<����d�0�o����%�"��e$�l��k`O*R��'�G?�o$��^��a�E@��8Y����J��������������k�����d�k�R)9������@B �M�[X���<y�.Z�%1�9���q1�zz�Ne���K���l������8���Y��3��Z�R9��y���$��L#��ih�1�$�����\�[��Mw����2@v�
<'���N��r�},����>�8b�5I���/���ev�����Gs�I�WzF�������N��@�����h�/QHxX��I�b5d�;�~.[��%���������j����S:xC�G�nd}� ��i`@ =������,���g�����{h������Q���"�|@�?��0���C�<Zbv7��o�� "��!�l@�"���4�<��h���>�B�Sd���"����=	�)�RP?�]os���I����n�K7�H�!8����]�/������G]�n�OW%�+��Q�{:�$���W(��e<� ��n8��� �1�3�G�@ n��z�U���7F}E{�� �`���2�W�����?K��c<H�m�I������l���M����	t����z��R��3��os�*qn����KZ�K����!y���Kjjms\NH�������5-_���y���7x��A�?�V/�!�A@��@�
}�2F�o���h�@��wQ��5�C�O����]JBl�@���"��xJ�7tWR'M���)q���$���t�7��@�������^ sz
�S��2.�i!�z.g�@�uv��v *U+����:g(�w�L�M��d(�$ _z ���*���!��7k8�, � ��:7��p���&p�����+��}�%��&��e#����r���M?�Z���uC�G�.� ����QM��75����D����p<�������5[8
IA �	�����g�P��j��|����
�d
������������`���
(�����o�[[6�����TzI��=}��'��+�>B�u�U�%
�j�@��������E���R%�����3���Kh��t�RZ�i��A@��@�
}�2F�o���h��@h!mf������	@�5����%����q6��"������Q��h�X_|���l�����H�7���u'���u�q&p�B���`|�/~��RttNJ�m�%%���t��Y�U�q���a
�����_W43*ipm�..l��U�rp�9NhBv�
�@�E�sFQ���hb2:���������p��rD
��hFT@@ ��n�p����07�N�SF\��o�k�-[6�����#�<p��-�
���l~Hq"3���z{kr*�Q������@�P��/ $p�}W43*	 i!�/�����2�����|���
�����������U,_�nh\���)N��Q\��tRx����M��+6���{MQ������9��V����O�g-���'�������f��z�rT�x1w�Y�������y��lxmu�\���G���^N.�y��	��i��4��U��v�p�Yw���U���.NeK�2��j�S�SNQb�	����K���G����p�r��������S�E���8s�����CGi����|���������B	i!�;5iP���S�
�/�~�v�1a�^�����9�^��j�M�PT���&�&N+C����y�:���K;w�I��/U�2��d��R��t]��T�H�f���O�py<9�Dz.��E����c���my��ZQ��eJ�R���v9M����o���W�<��Y�'��S)5oR�jU�H\�����t�N���vhy���8��~ ��@�M�}����V�����F5A@�K\D���	���}ELK���n��:�����@�i�"Z�n��������M�<q����I�fkbp�g:S�\W�o�	.	����=�@�Je�u�FZgx!�������������P�G�&��_������|��k�!�/w����������s����e�6�J�x�K6�/"$�H�?j�nnk��r
������K���k� �IDAT;Sg��>L�i�c��Czi�%��5}^�M!��K;M�i���y�i�q9��C��r=]/D�BB<����+��g!��p1������;���nl�,^�}pt�R��gzn��U��.wy��xuc�����_1���Y&�o������p]��p]���&�����|���k@��=���o��( ���z[���~'�R��x��w����N�#G�������/>*F��U�6D�����'Nj#n�3�E�@{U~��/i�~����+����v����X�655���5��X9�L����_B|�eI����zo"��sP� `�l�URH�^{>�>��N���w ����B������I�|��t��������6W�������K!jf�����#btn1��������N5M��`��5�p}]���"W��J�g�>�p�,\A�s+]q7�?���q�nzw�d����E��A!���H��%M�%
�j�����4����k�&�}�Mq�^�}
���_�(1:��cz�U������D�z�YE(e��)'OS��%<i�+kE'��W;�_�y��zY�_��:u�����n��'�gT��|���QA�d��oQ96w���,I����w��`~�2�+�_g/�g����3�G��Eg6���|����m/����/6@�*�|����{��mo���=w����fa���<n�[4�G];�Q����kkW1�TNt��%!fN���wW�X��#a�������H�x���Xeq�;���9��;��l�����5�i����my���h��e����m�9-�PNo3�uF�`��C
�������p�����p]���&�����|���k@��=���o"[�����	1w��9ea�3�]e������he����<���|�uEGu-���K�����a�-�zv��s�Z���'��=A���������]��+�+������/[��6l�E��_�dUN�a�]�C��
�a�g?�����Q���&t_��J�_W��?k���Q�p�p����5��
>�8��Y�Q&��f�L�N�������Q�(?���T�V%���:R�Ng�]OB����|/ ���m75��w�d8n�����9���s����f�
t��M�t�"Zzv{����<���\}�,�n����g��������Z������t��s�/.���������a�<��Yv����!#>�Gi�z?�[q�R|4��i���,�r���<��m�������8%?3W&`��Sg��
�vi^$r����~��-����+����`���B:����KZ�K�0�/=p
h�'����Md��s�w�����
�-��B�Y`W�76�Va��_�x�������{����x�I�,�8��n���9s�V!V/�zT�zE1���=��������]��F��{��5r����y=���+��M<-�������/�n��������]){	��]� +�������d�% `�l�JH�����O	,���5�0e��yWN�>�[�V�]����6��Ud�%�������<_mz�R�RTC��k7���_��uy��\����P�y����W��$�����!����R���d�zb����K�ez����i���G�]��6����c�8l��U��VI!���H��%-�%
�j�����4����k�&�u_��$�,^H)���$Hg�L�M9y�^yc�6�+�C��#}<��2��?+7�'V�y��{�~���j��������V�f#p�Rh����E1��,�o{�y[3e��_�Q��%16@@G6_��Q�=�}`���^�]�%��-�aK(C��[���*Y��Y9+C�Eh�KO�Mm��%��
_��)�����M��������O�v�>#��	�@n�	"A@Gv_�  �# �#\'�B�ur��n  ��/=p���} `O���.�P���
�/?������1y��`���;e������|.�;��f��XO1�5��w'x�y��m�����\�<x�/�i��=���A�l�	�������y�
M��#�U�/\����<K��X!�o���"��-�������4y�}��u3���s�w�hbW��-�]Kw�M���qS47�2������G���?�a��,��:}V&���%�����}�;����x��	���%���N�m�Kv�/"$p)i!��������F5A@�K\D���	���}���������'�8%:�?
��f�������gt�(��iy.�����s�_t����fFcbrz������~����B���1�8e��8}��(�~��c�����}�78�L��f�4w�*��=�
�V��k�����k�j.��Hl���� !�O^x�*�+���j�,Z�t�gu�L��,��5�;�Y(��E?���v>6w�+6?:���yxkT)�������x������wR�F��t���}��1�~�"�>/�|���W#�\�_�����&�[V�z�Smn_% `���HH �J�>B�uI�C�uIC��  ���
"@��p����l[@��wZYp
6�	�����'|�}�?��s�6�_��V*�����3[7p9�G����~��N�e�6���W���$�S���;~���q��H��� >���P�1����5fm��G&	hi�Y��:���_-��Bg���i#����c�8Nh&���D�+��P)�N9���.���(�����t�qfn��d0l�Lb���k��A��p<i!�:���T�K�0�/=p
h�'����Md��n��:�o��o�������J\ f����FS���~���'F�v���K�Mk�����iy^�Nn6��2���m�:{	M�u�a��\�L��D����Y���K����{i����(��f���kh���,�qS��t�[������L��]<Oz�
�����H��Y:�Hc��n!�/:$���Y��H�>B������pg�#@@�A@��@�uF{��"���]����(W$����"}z�����=�����F���p��X������6w ����������	���JO�����G��_#~����^���i�i���Pm	W���0��t���W�S�����EI����|�4P�t�l�L��������M;�hb������^��~c���_|���\����JKkv�V�
T����?�L(�<���a����TO,\
����~��� �& �#\[7S�
7t,���@d�/=p#��PZ`�qK��5e���)�_��FO�I<
5������p��m>��.���������������;��2-/�e�F<�"�<
,�@�~��v��zo�W���0��t���3��Z��J�����~�s�gu##n��ww���S�4k��}�'7x4m��o�������$ybsk����������\�@��(d��+I`N�����s������p��J!,��DV  E@��@���fCaA@#��B�r���w�>G9s�P������d�%��F�n������*�I�
������?Y��Q��G'�c_�_J�L���.��_2@��$���p�w�y���%����R��������p'#�k}�2��*e���h:�b�r�L�
T��������v_f��Xq�WO��"�.�8/���	H���-��A�
Hd  q�K��k:0�@��|�=]WO�B��c�'����*��n��Z�r9���I�6��u�u�p�\W��=�N9��%ki��YJ��x���t��u���4p��r�B����T�����&to����W������p%	,�E�l�"����,^�^���P�	��K��eD��<T�j��}�"�fD�}k`O*R��<���%�-OC^�f���[���r�2�\����S�����
�C��Q�]KWn�O'��lC����J	���>N
 `k�>B��u3��ppC�9��D��7��
�&���AF��R����Y<|�>�r:IH�����z�����R�2���f��Ys�z�+�^S�����'^a��CF|A��<��pC������Ko7.o,���g���p����`�ZC~�wll.�+\��G��|��xO�!��[�� ���:g���P�
��J����[��R�W��K��y�u�F��Y}M�d��/����}p_�y?��VA)��E����(q\����*�/������������
��������K���
��y��n��=���J��is����=qp=(�F��a��S������pm�L�+���DN  �E@��@���vCiA�	���u�Q]���k4+��}�	����CG)&&���,J�������w>���
V��9m�|��a4���_!FS�B����~�yro�~�*������W��N{�jO������[�\I*]�E����]�~;���%��a ��;%����>�yQ���Y!�����^����K��m�~"��}P;4#n�F����wx�6l�E��l�4!�.����<�#��-����6��>����h���I���:}�J+D���~��1���B�M���+a$�F�85���	H������A�
K�  Y�K��j7���_\%-\!������-pV������)���)���y��������+D�~�
��sK:A/�94�@
���r�� �/��7A1����r5nPCE�'�L�VmR��
�V�����".�X1�%�(1�7#��~
�m=<H�p!��������Z\_����F�b��54q�oJ��F�����;Q����v+q/^^6h#~�;��l�|������$��y2���X�xzs@}9z���yo���R%
+����Ts��Db,���	I@\I@�G�.i~�.ihT@@�@@��@�5�A���_�7QD0**=���&�ZK�MMM�����?������<�<����$wZ[
�"��n).on��l�u�F��N���CS�,�H�}���������c1�,�g~�;�����L%v#�nlz-u�t�'����4d%����y����,l���o���Z��;)�A�/�|����"v������0�z�p"}��/�g�aO2��v�@U���S�#+!{���K�f�k�L�vv��?�a������t/�y��n��r�����&��=rX9P����Y��&O������M�0b�3
#+����1���x^^6� ����?B� �V�>B�u��%
�j�����4����k�&���;avW��~u�%�&��TN�\����/^M���������)A=�[s�|��Y���?i����t�E�@��������<����[h���t��9*'��}��vT�x!�c�Gb����� ��s�V�R�nj^_,�����n�&�s��2Q^_���s�P���!|���yB����_�T>���k��
���w�9D���^����|����A`���AM��~���y�\��v������m�a?Gt���x>�B���}/�7n�����_��Q����B�����Iz$�8���B�s��{n����g��|��@��G��[�P�������t�F�]�}�,��%�s��7jn�I��_M�&����,!�4�8��py*���.����=#�['�o�R������p]��p]���&�����|���k@��=���o��, ��������t��oS�RZ��������������J�7On*[�8�=w���J��S�,>\�#`�iy��iv�/��&���"&�8��P0>�`��!nX
,���+��b���|D����l��>��=&�
�9c�����(::�N�����g�TP�D��j+�&&k���r�+��aq�J�<������G<S��l�YX���3����	$�I��c��H�H����ik�@ 0�>B�
�[�����M����d��|����A�8�@�o���  &��a����@����	<N `{�>B��}S���pC����D��7��%���@@�=`�����)�0�}\  `N@�G��|�qM�
��X$ _z �Z�d `#�m�(
�d2��L��A@�f`�m� (��mH��6M������|�;���}	����m#�|�����A@�y`�������Gv?=:� �f�>B�u�U�%
�j�����4����k�&BA@ d`�C���@D����fB!A�@@�G�a��SB�
u�@@��K\;�����/��H&�������@�`�g�#@�A@�G��ho�����F5A@�K\D���	���}��  2��!C��@@ "��GD3�� a  �#�0��)!���:�	  `���Ze����
�R��@$�����C�A@ p���3� �  �#\w�7F����QM#����1 `w���B(���l~�X"'�����J(#�@8H�7��pN��
t�@@��K\[4
��.$��&�������@�`�F�@\B@�G�.ip�.ihT@@�@@��@�5�A���_�7
 !#�2��@"��~D4
	 �>B�
�p�n8���  v  _z ���5P����x!5�D2��Hn=�@'�83 ��>B�uG{c\��3�	  `$ _z �� �N���[�����K� �@v?Z	ei!���~���a��S�����|���k��@!@  ����  �`�#��Px��~��p��KH��%
�%
�j�����4����k�&BA@ d`�C���@D����fB!A�@@�G�a��SJ7��9A@@@@@@@@@@@@���k�S�����M�
��������������\42WQ
�nhp�4)�	  `���A!���_6
� �D6?��"[�)�}�6� vn�������{�-�k�g�^v��(��<|qN��3��m����?���~p�|��%j V��[��4 n$�&�W\�njp7���3���ox�f�= `w���B(���l~�X"'�����J(#�@8��>B�W��<7�	  `_x��mP2�G��?B� �!����DM@@�
�}+��@���d!��+�M
��u�M�@�l��N���[�����K� �@v?Z	e7��l]:��WB���IX����+����
�J���_W57* �r��.�P}���w]��� 	��>B�����=�d  .!�g�K�t$���lVT
@L	���bA$�8���c�� 7�G�P��<��p<�������uUs�� .'����p�}�59* `����#\qQ���-�H  �x���QMG����fE�@@��l�)D���c	��;�iQ1�p�}��+.75x�
  �0x:�AQW�����FeA\N6��� �:���krT@�"7�G���pS�[��@@�%�tIC���$������J���)�|S,����wl��b $�&�W\,nj��8@@�a�tX��:�"���U������l��/T@�u`�]���0��En��p�E���x ���K����F5I��#��S���X	 �%����E�@2H�M������hl�@IDAT��	�U�������9JsJJ(B��"�2�%4*dH��(�)C��d����h����y.�_��;�����{��g�s����_/����a���������v�*U*�{�Q��n�?��|���u9������`[��/ ����-oT?���P���#���=q�3 z�����
 �@�l����)��3�@ ��C8������@{���55ET�q�v�$�i|$�k��MO��y@�V���F�zGA���(R@��c�;'�B�"���HR�[�����i=6�����@�-�=0����v���?�G����7�E�i �@b��G��
���M�W@l�hk��w��Q�"u@�	0��sb/@ *��Q�$�@�lI���cS���,�@ ���?Jo�����S{�K�1��xS[@�q�6�$�i|$�k��MO��y@�V���F�zGA���(R@��c�;'�B�"���HR�[�����i=6�����@�-�=0����v���?�G����7�E�i �@b��G��
���M�W@l�hk��w��Q�"u@�	0��sb/@ *��Q�$�@�lI���cS���,�@ ���?Jo�����S{�K�1��xS[@�q�6�$�i|$�k��MO��y@�V���F�zGA���(R@��c�;'�B�"���HR�[�����i=6�����@�-�=0����v���?�G����7�E�i �@b��G��
���M�W�@�r;K���d��9�u����� Q��
,��B��kE��$ �0��@����7�E�6��$pM��)���{"��]��V�#o��������)WE�"����FN����R!@ �c~R�@")����R)�A�����i06����)�����]�A����wSe���g�Z�������9z��&������@��0��=��2%`��H��"���N���V�TA�l{��8u�����l^�keA�n��q��0���	�C0���!����"�!`�a�(2dE�����iR6<+=(�9��#�����s��ac2|5N�m���z�p��P�0��=��x`����� �@���A����i|$�kZ�M�T���y/9�D����,^�B���`6/��� @7�\�8��q|�@���!�E�`�O�C@�
0��0h�"`��H�4)���������0p�OO7����	p�.�T8B���� �)�S�6 1����� ��o6��$pM��)������n��pi�Y@��	p���[B&@�
Y�(. ��c~x��P�q?�A�� ���G��I�����_�n��s|z�9��N�{�u!����F(�TH!�����@��	0�G,�T|�i|$�k���i�rE���j�:y���v4F}�I���y�F��~mY�t�<6�M������f�=���-��nU�H�]*��ud��%������k���?��k#������}[6n�����m�����^U*�/+���w�!��7?��9
��J�
������Am�[�l����.�O��Q�m���1��P�TI9�U3��q}��ke�+��������:�m���|�����K��vr.��e3i�xwY�v�����y.��?�e�����5����C��e����M����T�2�K�BS��������s�w���%M�9p�&�����Z����*�S�N����d�����LY���"��~�|��/��wS���������U�"�+��5��-]�J&O�%�!��Am������.s���y�T�����������J����e��5�d�J���������,�����%p�/u��Yd�}���_�>�������k�������|������5l��h�uE H�� E�� ��`���/gG�&����P��M�#	\�����M��^:m�k�a��$Nu��&���)��mo5�����-[��xM�i`�����uk�y=��
69IYM�%�4�7f�@)^���xd��������y�F��o��&�<��h�S|���NN;��s�D-\�LF?���3�\7�q1�k'�8M�i�o���e���v�|�qr��x=��M�Ex��o�0&g�z����hB9���cL�{������Z-��&��I�d��-?��Gy��w$��{F�#L,����+��M8I��W�Sh������������;^���%��|(@��j�����gH���d�,�t��k�l��d�8�'K�6nXWn��Rg��~I~H�V
=�y��A]�n�������s/�/�����~�����P����P$@ C����� �@@���9�i|$�k��?�Yc�s��N����^9���rt��NRTgu�2s�I�����pf%�����G���4����L��>�(�_��x-Wvg�W���<pog����?�������}�������~�uS�V�,���W�������D�U���9xg���������&��v�3�Qd��/��/�]�T��
���U7���k��B��0�������z��N���o�95�W�fU�_��I.7��f��&&_~�cy���t���Ug�,T��������2|1a������o�o������&y���3���{8�����L�<C�\����\���r�����R�L�a<����s����^����������y���V{���M[���Q�V����)��S~���_LZ"�}{�����Uke�=O�����
��_������1qX,��o�zuj83�uV��r�n5����?%=w|��{3����g9	����ofFw,��L�s�:�,t=��(�g��2g��v] ���M�m���{n���N���|�����������s��w<*sM�fC��8K�D������eOg����q�fY�x�L�8M��L�EG���a��AM�.�=��~R��nR��"������M�8]�wg6�.���S�������BW%���������oll� ��oC��_G�������4>��5-����'p�7��t��&�tf���M�i�kp�����&u��4��3cO8���}Zg��F>+����O�j�l����$��/�[`��&M{u=[�1	Q]6W��g�v��h}����d\�M� ���Y�L�7��B^���q���Nj/�Nj��4�$�u�n"���v�^&�3#u�����2.�3p��-�7��[����~(1�I;������g83�u���w?�n������=p�������x�|����yk��ld�������_~��p�������~��)~6Q�{�����8F�����*�+8�8<�������������4	�2��
tI�D�.�}�9'8	b]��[4	��h��[�}�O��y�j�R.>��7
}��Dp�r\��|�����k�q{:������~���`�u��OW�T�@��_zxf�{	������"
����RsN�����+=��x�����%9
�`�\H(P�*U,'}�>��{V�K����?�b�
�k�M�q?l�Vy����������i�~<>��Bg�j�Tg�&�t��m�H�Z�9���w�C���������Y��Z&o]�y��	\=v��3d�����f���d��7\��gM�2�<v�������y�� 5�����o;?�3��!��������R��.i�����{u�����J����nk����/���Kg.kl
�t�������U&�0���}yO���u����/����^�,����{|>ab	\m�z�>�����%�7�b�
����:�4�{���8��c�������s�>���	����U�=����K�����;w���d������u���1��;:���kV��7]�8���s2�<'�
����e0�����c5�d��r�y��7��l^��7���d�	h��B�o��6���� ���`C ���Q�,����I�p�E���/l���x��*����x�P0��:|�.<cq��gE�mI��&�G�����m��7�H:�2��i����P�y�I*Mu�T�h���wH��Liy��W��h�|�5>���,5�K���y�UR�,E���_�`>��Na�K|����'������*m��Y���M�,��K�hrX��^�@��h��}��yM���g�h�Q����xRmg��A:��Y>Z
����m��K���ko��x�������u��)? ������<V�������L��[,���O������s�wI��>�� 3�X�i����9��s'J��IuV����s��a��0KD��.7Ky�,`MJ�\4�T��}"#��=02!��~�����n����Un��>�������g^�C����������/���4�f�A l��a���*P�|Y'y���l:�����2�"����}�'����E����Q�b��`��H��g?���Y�:�1�Kf-^�B���`��w���,|�Y�����qo~��u�&>��	aM���^cf�6k��6�<T�{[��3|n��2g�D���43���:���:�����,�ty���������] ��5&�{��$p�g�lM���f��w��${/�~�lK�����dj�6�']/9��{I�����O��{����������z��=�Y�����$��c��12�<������y+�svc�Nvm�Eb���%����y�,�]��zqGo�zi[�������=l��h�S����i��es��1/�d��
������+���iV^�U��n[��-=�O�!A��`?�*����P.������r/���}-/����C���0��&T�*(cq�����4>��5�����'p��Z��1�����?H�sC�V���~�yN�I�&�M�Ex��<��'p?��{�)�����������#��D�.�ug���Uw�$���n��W�����B�}}����Os�w���N9R:�3C�k��>3�K�����'�w.�s�}�fk���T���78K���P�KJ;'�������,�y�s��^+�B�O�j2\��n�n�������]�+pX,���=_��vO3ZG
�k����.�;w���"�<����t���c��B�ll��������	��[�����E~�-�
dU�y�����s�~�<S��%��q@p����%�_��cZ�����|�g��/���;��qA`�z�(_:�7��w_'e���t�
�7���<�M����
0����K3[�V�����i�~<>��&���y��>�S7�I;����n���ou	]}��>�4~�O����E����c�<�Y
Y��M��������[�Y����\y���������2K��4��	e�t��?.-p���v�v�p>��y>���������MB��y�����%�uf��"��7�'�{�s*MJ.[���i�����+���/�O�b��/�q�l��-���$����z����,��e���Dgp��b�.,��[�*�s��/&��s��������r�6��eC��~������-�K��R�^��KWJ���z�H���`P������n���y�9�c�}��q�@���!���@�j����u/�)��.���+���@����h���8�q�b�lI���G���}�3I��g��iKY�l�������XW���L���I�E�[�������O�����,a�3�nM��C�����>�$\���S�P/	��Mv�}�,p//d3�[�L)���W�S��������Y��-UW����,{y�k����4��3�cI�i��.w�|���x/6�W�]����o�s���c�_q�|��N�4�}{�g���7o���q�����g���h=b������v��������wy=��$@�
P0(������:�u�dV�������C$�Xd��Guy=������x=����~�C�2G.����M�#	\�����n�F���>;�h����r�ZOJ������_7~n���d�.]�[��C=%��lPGn�w�s���^����|�?/	���/�}��/k���IS
�+����_�t�|��DQ�Lo���\x���l[��&\W�Z������3�4��?�H�J��y���[�*������J��?�eS�����*b�1�0�?�bI��,�y��'����,���\�s�J�6����������.VLd�MWK������&�s�|��r~B)��=0��P�c�E�Z%]�BW�`��7�����b+���3�n#��(~B �����%,�@�*u^���}��?�!5���E4��a,~�(�6��$pM����n����,�>N�{t����7�}�AH����Y��D�>��ANV�CZ�+:CS���>.��[X`W/	�X�p��5�s����3��]e���m��e�������s��#$~�r���������[�������3��6�Z�����1	�
y��%Y��_$7��h��R������r�-�
�;w����Z�g.�6�t6�u�o:�Y����}��Y>R�
�!�@A?����+��U������d�Q�d�;����P �W��r���xM��~��C��P0��"L����'�G���U�\��x$�'.v��~���2G ��T�����i�~�kW��c��wf[���7����������g���z�S�{G� ?M���n��CM�&J�yI���9��c���W�6IL��n��o�%Uw�$K��5���m�t�������&�����d�_�
;e��b�o]���k���������o�������T?�=��������5On�t��c�v�����5!���w����������5����?��!�@b?���������B��IO�L�����s���9�����f��k����]_EW��C��5lDM�1?j�>�N=�������/�����]y�
�(
0�G1���cq�cD	���������7��$po���Q��L��7��s��� $pw�RQF��[t	��_�@��p���u��i�����q�~�$�D��~��p���|��uC��ju�����s��?���P��l�2���~�U^}�3y���S^�3;�$3KV77	�DKv]�[���gf�J0�6�d�shW�n�R%w�G��/;�T"����s�I��5��U���.�M�Yx���jW�;n��|����>/�
�qL|f^��@�]*��k/���VIyY�L����u�rgv��7���p>�g�����sD�L�m��E��|Vf�Y�jW�G ������ P���=�<k���R��{������Q`��bT�_'���������#	\)?n�EI�^`f�gf���o{���\�v�	BW+q�u�J��ue��Ur�����Z�������h���O��nB�c�<X:�u����K�����M��yQ�y�9��|���Qc^M�k�^���:+U�g��/�����e�����us������2��'S�Ww���t������_9���/�d���x�y��O��N���G��sO?�y��1	��;��n���:[��A#d�Y*���N���_���#CG<��<����q��.9��~��J�M��O>?^>�rR�]x=D~���RTK�n�����<��$�6n�,CG>#s���l^G ����!p!P�Dq��Z-������v������lDU�q?��
~���#�Kh��H��v?^�n�r;�=�v����s������{�K���[�H^%���������<�����))��IYM���o��p<�VOr�~{��W���O����R-l�tR;�tR{g���}Jf�������W�L��/���3b�s����*?�z�2��!~�d}�MW����u��T��=.����?��������K��7mqfB��
��mj��B��������;������s�M��/�o�)�����}4AtIr]�Zg��efaL������{X/��=�z�T�PN.;�$9��+�7Ka�mK������D&�0=�U�
��7����^����}��A������Ok>�9�W���s��#P��T�D��r�Yu�����L��w������[���'�������	0�G/�a�cq��e_YmI����G����������r�';�Lg�����Nb�y!��4���v�9�`���������g��$pu��A}.����'6l�Q���Tt�M�tn��E�q����N�]E�8��jK�*)n�:��������Uq�]�i�$�>;���I�������d��
I�������llM����D[�����������[�������[�6	����K�����@����R���s�-�d��y��UMy�0��l��N��3��]~�����s:���~h!�;����#��c]:O?,p����Kd�m'��!�@�������m@t6�&<ti�
�6��%+�U�CW@�S1��>Ar��������2*��-W�Y'���{-M�)(�0�����a��o����U��$k�����rt260�������8�1���6��$pM�#�EM�j���5���M�t��.9��Y�V?A�x�
gd�
e�V�j�s@�j�r��aO���|d>y�%��e�n����&�����~��L�<S�/\��j�D��uk�Qm[�A��8��s�R|����^�>���|\����E�����8�/��|���I���sEg�jY���v���v�F'q\��.�{�rl��w,O�r�Z�s�H���?�����������.}�&!5y��[n���I���m9�����q�;���^�@.9�D���Z���g8�N�����]�z����?�kN9���Y
��] :[��$��O�j;�ka���^^z�#�}�BYd��k��j��u�,�J�9�}��o|� ^3�����l�l��G����������������I��&95/#�@����� �@����dA�q?�\B)`��H�4Q?^���Mtuh�R��t������K�n��U���^��f�z'����?���|���&^���s=}��>[6����^{��N�6��&P7n�$��oi�����23u_M����w.SZ����Y"�q�p'��|��I��������7���+:�"��U?���Ug��si��d��k��<[�|n��0	�ll���������]o��uN�u��n�,}��W�l�Dg������lE�e��;�9��CE�:�Xg����;�������l��������f�h3cz�I�&�����?��q�(�[5���~"Vg	������/������}�k���&p�8Mw�����Y����@�p?���_�w@ S��L�r^@ x����	%B2)���I]��a�i|$�kZ���$*��a��Y��^��Q�]+�i�y�:Tg.�'�t��f��Gf��o'��tf�v�;o�Jj��*7����gA���\c���D�&dSm�p�gpw�j�;`�����-i��=�eS9�}+i�{�<������#|��L�>���g�?�m�\1����c�n��y��&<R�~D�������j�KL�o:����3��|��"�/�������^7�'u��p���L��L��|�K�[ge�S�o�(�������h�-W�n�G9��C����8��c�hru����������R�!Q�U�G�k)�������e��<oQg�����d�{�e�3��������s��~��������W?t��[�����	�q�r#@���;WEr!���u���N�q?w�\�-`��H��E��	D����jLw���nfI�Jf����\�zm��Z���Y-�.
�v�e���[����T�����^����Uw�l�c[��i��X�����w5���2���vS.u�d��t^�|U������'�uf��9wa�h����yf�����U=���d>��z���}_���f�&J��.�3zuvt�J����.7m�����b_%pc�������e���l��l_/�&����{��.A���}�R�^�g_l��h��F ���\�sm@ ������j �@��s��A�i|$�kZ�Mj�C��9�G�90�R�E-�3����)��-��WX7��^^+,���<~��3����Mt�w>�������9Q��S�(��]���`����O �@���a��E�i|$�kZ�M/j����	���=h����U�8�����:��
�	pt��^Q����P&@ 3���q�� �@P������i|$�kZ�M�u�����)�{�u��������"?�F����� Rk������#������*#�����V���#�@!6��$pMC�)���{����-	�F{���������!����K#I��@f�f���#�I�o&u97 ,��`��� ��`���0�G��
�4>��5������SRn��$p{u=[Z��O��w6�Y�P����-oT?���P���#���=q�3 z�����
 �@�lI��FdS�3�g8m�j�R.>����|���'}-a�+����G�x�y��O}=���
����k%��$o5ueCo��y�7A��)���c~f}9; 4���E�� �@PlI��VgS����(G�J�(.��n(s�/�5k��z�re��^{���3�����|=���U�X^J�TB��\��P�G#�=�f�@x����%G�
0�{c@ ������G��	�4>��5����g��pf@�0
pc�(3�	�i	 ��=������"�*��O;@�4>��5m���'n��� `��@[#O�� @��B� �N�1��{!�Q`��J$��-`��H����wg�| �����Qz���v���#��]��v���"����@ ��M�#	\�l
x�&�� ��
p�5��;
��(D�: ����9� ���D�z ���6��$pM��)�~w�� n������[��kw��=�%��oW��- ��O@�4>��5m���'n��� `��@[#O�� @��B� �N�1��{!�Q`��J$��-`��H����wg�| �����Qz���v���#��]��v���"����@ ��M�#	\�l
x�&�� ��
p�5��;
��(D�: ����9� ���D�z ���6��$pM��)�~w�� n������[��kw��=�%��oW��- ��O@�4>��5m���'n��� `��@[#O�� @��B� �N�1��{!�Q`��J$��-`��H����wg�| �����Qz���v���#��]��v���"����@ ��M�#	\�l
x�&�� ��
p�5��;
��(D�: ����9� ���D�z ���6��$pM��)�~w�� n������[��kw��=�%��oW��- ��O@�4>��5m���'n��� `��@[#O�� @��B� �N�1��{!�Q`��J$��-`��H����wg�| �����Qz���v���#��]��v���"����@ ��M�#	\�l
x�&�� ��
p�5��;
��(D�: ����9� ���D�z ���6��$pM��)�~w�� n������[��kw��=�%��oW��- ��O@�4>��5m���'n��� `��@[#O�� @��B� �N�1��{!�Q`��J$��-`��H����wg�| �����Qz���v���#��]��v���"����@ ��M�#	\�l
x�&�� ��
p�5��;
��(D�: ����9� ���D�z ���6��$pM�������@@@@@@@�?��}����L$pM`H��uR,@@@@@@�H��aD��X���G9��
@���@�f�@P��A��@2/���yc��I�q?H��, $��Gf���gS����( �@���>���
��*�q �@���3J��#�����"�@�lI���lS���q� ��w����8����	��d^�1?��\��~��AY@ H6��$pM��)�A�h�@ ��sJ�@Q��E��8@ |����%F�`�OG�c@ �6��$pMK�)�Q���
@���@�f�@P��A��@2/���yc��I�q?H��, $��G����� u4�� �{����%@������q >�����#��0����� e��G��%��(w\�� �]�{�w3�@ (���D�r ��`���1W@�$���hP��M�#	\��l
x�:eA������� PT�oQ�8�c~�bF�@�t����X���M�#	\��m
x�;.uC�.�=��G �oP"A9@��0�g��+ �A`�R4(I�����iy6<H�� ��^�{`�c@	(����r��O�1?|1�� �@:����q,DY�����i�6<���!�x����#��7(�� �@��3o�@� 	0�)��$`��H��<���FY@r/�=0�1�U��[T9�C�'����Qb@ ��t�8�,`��H��d���K�@�p�n�E���HP@ ����7�
 �@���
��A�i|$�kZ�MRG�, �����P�*@�-��!��`�_�(1 ���~:z�Q�i|$�kZ�M�r��n ���z7��"@�
J$( �y���s@ H��A�eA� 	�4>��5-�����Q@��p�}(E��U��@��	0��/f�HG�q?=�E�(�4>��5-���G��R7@����qA��%���c~��� $�� E�� �@�lI���gS����( �@���>���
��*�q �@���3J��#�����"�@�lI���lS���q� ��w����8����	��d^�1?��\��~��AY@ H6��$pM��)�A�h�@ ��sJ�@Q��E��8@ |����%F�`�OG�c@ �6��$pMK�)�Q���
@���@�f�@P��A��@2/���yc��I�q?H��, $��G����� u4�� �{����%@������q >�����#��0����� e��G��%��(w\�� �]�{�w3�@ (���D�r ��`���1W@�$���hP��M�#	\��l
x�:eA������� PT�oQ�8�c~�bF�@�t����X���M�#	\��m
x�;.uC�.�=��G �oP"A9@��0�g��+ �A`�R4(I�����iy6<H�� ��^�{`�c@	(����r��O�1?|1�� �@:����q,DY�����i�6<���!�x����#��7(�� �@��3o�@� 	0�)��$`��H��<���FY@r/�=0�1�U��[T9�C�'����Qb@ ��t�8�,`��H��d���K�@�p�n�E���HP@ ����7�
 �@���
��A�i|$�kZ�MRG�, �����P�*@�-��!��`�_�(1 ���~:z�Q�i|$�kZ�M�r��n ���z7��"@�
J$( �y���s@ H��A�eA� 	�4>��5-�����Q@��p�}(E��U��@��	0��/f�HG�q?=�E�(�4>��5-���G��R7@����qA��%���c~��� $�� E�� �@�lI���gS����( �@���>���
��*�q �@���3J��#�����"�@�lI�������*URk�L�n\_��ZY*�/++W���+V��Su���������n��y�#~�.&Nw�/V����OC9�ES�^��T�PNV�\#��,�y��W�N��oOynv@@ �@T@ ;���8s@ ��A�e@�'���=k����i|$�k���?��!r����\�2I[�&\��{_&M��t}���G��'��?.�CFK�U���gI����������g�2��EI��
@T��{ � �=�o��� �k��\G��#��`���7WC���4>��5�����(^\.��di�����W��S��I�.[�Zj��*������K��K��f��o~,o��U����oO8�PY�d�<��+rc�K�L�R�p�2�������e�n�F3w�k�z����sm��M�����Z�.��y@���HG�D������e�u�U@6n�,��0+yL���dl���F'��$��~8��V�I���I���e��w�~pUW)��yK��!`�y)�g�o\MI��U�����efE;���������Q�q?�Q�� �
��G��E��N'��N'�s�����e�������YM��2�h�����7��q2����b	��+���-[E�T~m�g���$\&�E�&����R�$���f�<:�����E@T��{ ��h����=�,�\�|���>�l�<3�=�Cdv���	�C2
\D�c���s:%�����~p�����W�M��?#)��H���"P�b9�s�y�G�Z	��9�����1������T�q?���� �q��G��9��z&;d@�Y/������J��?>�df���WS��� ��2J�o�T�q���7x��n���~��sGi��'��.k��O�/"� ��=�`4������Y��)Uh���0�<j�M���,�j�o��G�]
h���s�/to������O�����x�00��9z���@�
���k/*��az.]�a��g��������~�"Fy@ [6��$pM��#�]/>U����c�v�t���t������Y
�o�����{�y��	��O�N�6[��|�}���I��vq^����d��?���� �8~����@�������s���
���f�N�N9������gA�C�r�y'���>�f�K�:+
�:���c~�Fq=T(_�I������c��bw����������@���1����i|$�kZU�/�S	y����?�~1�,[���e���\�����] ��5�@�O��q�X���v���>_����;�jY�Ll �$H������}��o�Rv�[����=��L��7O��s�������W@ga�����d���v���o��x��������c~P#C�����c���x:���}-/����c���0��%R��-`��H���t��^
d`��v������i���Y�)�?�Yn�k��
<37>�{����^���8K(�6�s������ �V	�{�
+��������
���a�L����<����'����i-�v:����������<�]�1?��|����F�}��-[��i6�G�u�;��G�y:!;# ���� �@�lI����n�?t���S�F|��O����t���hs�\v�������?.�sl,#�/u@IDAT��q�f�r�]y�+��Q��J�������������]y@�b�t������������^
�x�J��������, �W��/�$�[5�|N}��g��| t���G���#P��.r���E:E�~�e��uE:����~��C�@ �6��$pMKK7�'�F�>���f��%�V�v��c�����=;;���L�1'O��%pW�Z+=�����~ �[��!���������	t<���u���x)���y97�fG���g���~=.�f�4�|�Ifu��X]��_�1?�1��Eh�G�����t���=,�,)���@�������i|$�kZZ�?���r����6���P��i����g�:rK��~I������_�K7? ��,��=���p�"�9x?�z�i���U?t���
��;J�Z��;J����1�|���}��|��#�`��Ar�]�T����E��������1�`�|�( �H�����id���v-��sOt�k�e�k����W�]~����C���-��;	�<�� ��@��@����� P�FUv�5��d�Q�d�;����P :�m!����������'^�o�����E ����,�@�b�d�I�V�\��x$�'.v��~�Fq@ k6��$pM�J7�5o"����4�{G� ?M�����|�af�����{�_V�\��X�y8�@�g�t��>��Q`@�e�&
<=p�Ynn!��yB��������*P�t)9�Z)�si��]�v����������#a`�K�(gQN=�����O�����Q`��bT��!`��H���t��.�rGo1��_�@��p��v���S�m���l��o�����y�%����@|H��sq8]��RI^{��V�J�3��c_zW>�lb�}�!���`����/��yn����R%wJy�M��������9R���Q�1?�Q��^J�����gI���\6a�ty�������]��N�M�q?l�� �-��G��U�����T7�+K����n~��/��*���n���AB������h�$p�� ���~�},�JC@����"�Y}��g�������O���t���7<���E��������Dg�&�6n�,CG>#s���l^G ����!p!P�Dq�1c-������v�<:�M��=�$�B�MB&����Q\���M�#	\���x����;wt�#O�!_N����v>�89�����[�[�*��[=	����� �����4.��>T�PN.;�$9`���xq�4H��t�*y��Od���^��0��=��E`�u�����=��*p�����������\\�=^@ J��Q�&u)L�D������<� )S&��w�;~>Q^y��+�vN�C ���a�eF�l�4>��5-��������\"{�YO6l�$��xM���[��z\�C��NG�~���/~0�a�I�/	��,�� ����@���i|��������7l�,���p����/��r @��:���@���I�����Y�h��u2�Y�hYN�����c~���NPt��&�v�jU+;����W��������RD��@F�3���@ �6��$pMC�+���U]JY�F�Y��~��L�<S�/\"�7o��{�rT�rP�&N�s�R|���D}�n"^C�K��{�_��< �^�����=@��0��=���&������@���G��]��]�T�k�>�I����&s7n�$���{�����Yf���I���	�<d�� �����@����@ ��7o#�`��P0�
 �B�q�� ���6��$pM�;�%w*!��l*��o%
v���&]�p��9��������������V�#��8S~�<CF<2��]������r��Md���d�O3��� �������W���7��\�c~P"A9@��0�g��� �@�lI�������.]�<��Y>���Z�NV�\#����S��%�����<�����L���������~ �
d�h!'UF ����rs1@ ���9��� �@���N�@ $6��$pM��)�!��@ K���e���7���c~@C�@�	0�g��"�@�lI���jS�C�;� ����}��ddU���Un.��T�1?��\���~��� �D�����i�6<$}�b"�dI�{`�����f�S"�`�h`( �!���rZ��M�#	\�\m
x�{'@�U�{������
�����@��
0�����#�Y`��:9D���4>��5������RL@�,	p�4�A ���rJ@ ���
�B2$���!XN���i|$�k��M}�� ��
p����!�U�oV�� �S����sq@ ���Y'�� ��G��Q����A�� �%��Y��2d@���TN�T�1?���X �@��3�i@ �6��$pMs�)����T@�W���rr2�*@��*7Cr*���S~.�d]�q?��\B"`��H�4J��>H1@�$�=0K�\��3��)@��
0�40���~�`9-�^�����i�6<���
 ��*�=�WNN�@V��Y��b �@N�s���@��0�g��"�@HlI��FiS�C�)& �@��f	�� ��oP9% P����b!�`��,�E���4>��5������wR@_��������7��\��c~N��8 �u����sA��M�#	\�(m
xH� �D����,As2 @��*�D*����P,@ C����� z��G�������N*� ���@_99Y��f���!�9`��)?G�.���ur.�!�i|$�k�MI�� �Y��%h.�@��@�� �@@���dH�q?C��B/`��H�4W���I@|��+''C ����rs1@ ���9��� �@���N�@ $6��$pM��)�!��@ K���e���7���c~@C�@�	0�g��"�@�lI���jS�C�;� ����}��ddU���Un.��T�1?��\���~��� �D�����i�6<$}�b"�dI�{`�����f�S"�`�h`( �!���rZ��M�#	\�\m
x�{'@�U�{������
�����@��
0�����#�Y`��:9D���4>��5������RL@�,	p�4�A ���rJ@ ���
�B2$���!XN���i|$�k��M}�� ��
p����!�U�oV�� �S����sq@ ���Y'�� ��G��Q����A�� �%��Y��2d@���TN�T�1?���X �@��3�i@ �6��$pMs�)����T@�W���rr2�*@��*7Cr*���S~.�d]�q?��\B"`��H�4J��>H1@�$�=0K�\��3��)@��
0�40���~�`9-�^�����i�6<���
 ��*�=�WNN�@V��Y��b �@N�s���@��0�g��"�@HlI��FiS�C�)& �@��f	�� ��oP9% P����b!�`��,�E���4>��5�5���\*� � � � � � �@�^��������5$p#���  � � � � � �@H�F �n�K��p7�� `��@{bMM�'@��^L� �L�1?��#��`��f\��/`���\�^l
x���3 �DI�{`��I]l���q��60��}��6
0��u��nlI��aS��t�A�G�{�=������F/��H&���L��@�h
0�G3��
��i|$�k��MO�{p@�(	p�R4��m�_�"N}@�f�|��O�@�F�}�N�@���M�#	\�"l
���> ��p�'��4z�����!������: M��h��Z!�@�6��$pM{�)��w�� %��Q�&u�M��k[��/�,��os��;�(��oc��3��i|$�kZ�Mw��@�������FO����R#@ �c~2^G�)����R+H_�����i/6<���@�$�=0J��.�	�m�8�E��m�>uG�m�:uF76��$pM��)�n:� ��#�=��XS��	��Sj�$`�O&�� �@4��Wj���4>��5������=8 �@��F)���6��m�� `�c����� `����Q�� �F�����i6�M`@{��kj=�o�bJ�@�d���dx���~4�J�@ }��G�������g@����(E���&@��-��l`��9��l`��1����4>��5-������� `��@{bMM�'@��^L� �L�1?��#��`��f\��/`��H����~�� �Q��hR����E��"�����6G��#�����6F�:#����G��E�p7�}@��hO��i�����)5B�	0�'��u@ �����+�B��lI���bS����@ J��M�b�����S_�Y�1���Sw�Q�q���Sgp#`��H������ ��=���55���7z1�F �@2��d2��DS�q?�q�V ���M�#	\�^l
x���3 �DI�{`��I]l���q��60��}��6
0��u��nlI��aS��t�A�G�{�=������F/��H&���L��@�h
0�G3��
��i|$�k��MO�{p@�(	p�R4��m�_�"N}@�f�|��O�@�F�}�N�@���M�#	\�"l
���> ��p�'��4z�����!������: M��h��Z!�@�6��$pM{�)��w�� %��Q�&u�M��k[��/�,��os��;�(��oc��3��i|$�kZ�Mw��@�������FO����R#@ �c~2^G�)����R+H_�����i/6<���@�$�=0J��.�	�m�8�E��m�>uG�m�:uF76��$pM��)�n:� ��#�=��XS��	��Sj�$`�O&�� �@4��Wj���4>��5������=8 �@��F)���6��m�� `�c����� `����Q�� �F�����i6�M`�
{���l��2a�4���inc?����� �Q�����@�0��8x(��~�8��i|$�k��M��S����]]����s���.���+k�m�����'�=�?K��@�����z �@��sg��@�\0��B�k"�@lI��iS���S��J�
rd������d�����3$Pc�]�����X1��[�I�~���M[^m����y�����?��5���� ����q���!@��C�s ��`�G�(% ����_���&`��H��^���z�)G�)�.��,����D�J��C��M�%����g���yI�q�u�J��u�����qo|�t?�@��p��9WD�/��_���c~�cD	@?����\ %��G�����(t�K�;Q:�m!�����7?�*E��n�Fj��jf�N�'���RA�$�=0L���������'@ ���Q�.uC

0�4�@@lI�Z�(tq���"	�����#`�/=�D���"@��%����U�@�q�6�$��o!$pM+�)��}�^'��h�WV���%�G�{�=������F/��H&���L��@�h
0�G3��
��i|$�k��MO�{��$ps�% ��U�������Ys%����-��@��
0�76����~&T9'DA�����i�~|��J��;�������?����&�v��MI���e�����7��w�������'��l*�U�"Uw�,k�m���8�M�6KV�Y�g�D?t<����WS�m��q�����m��R�zU�X���^��y�����$��,(p�*�*H�6J���rl��U,Z*��X>��G������B�R%��V�d��������2�4V�W�������}{b���v��k,%w���D����I��e��
�������?�l��?�&�����z�������67u�c�RI��.-M��/X"��[(�g���=c������vb;y���d��E��S�by�z��N�^����t�c�<Xv.SZ�x��<������th�B^��c;i���:��X�b�K���^����6_��njl7���M[d�i[�n��z���U�v����v3���l��]�\hbf��/3�:_���{p/��=�����������@�`0�;>��[�q�oQ��Q�i|$�kZ���I��B�t�3L6�����dc�2�v���&�yU�a�eK�dU�p��$�w�[c������a��������o����I�9P�/.#�$#���ti����}���������M~j������b[�����.Z&��xM��d�����8�JM�&�V�\#c��/�&�,�K���s��H��w�~�{���%����)��o%�PN�����^o�3�]z�Ir��9I���F$M��/C�8}��;��.��M��C����\��^w�f���mEm�7��D�|f������v���R�N���)�>�������W�BY��������8m���M�kB�M��?&������z�"{#��_�_�$9 |�����"�~
0�����@ J6��$pM��#�:��5�:��[�{�������[:IQ�������3f�!����>h�����T�����&�������Z���R�vuiy����U=N���W��L�R����[?��)C+g����d����ZM"��QU���3Uw���I��s��$r��0I���L���-^.�T�(��-�n:���I:�f�:/���&L/��dg����������s�e�WK��U�~��&��X��\�I,��������SiYu��N%J8/��W'9���b���]���?�,s��+��eM���g�&1u�9��8y�3�Sg9����}D+3#��3�y��:�r����Z�����,���Jf�2��4�����;>��g���������N�����i��m�>������!��~�@g��6�-���OCi������[���_����:k���yg8o��U>�bb���u�l���O��m����wS���9tn]�/tFwsn����s��#����
��qt5���@��E�1{:+MT���6���+d��i�<�� �(����7s��9x�a��Z�'u�*6���j����4a�t��yK�
L��Y�1�gPN�\�q?��x �3��G��������+:���)�:3[u6n�M�_��_��v�����]f���i����Z���up��k��2������$c	�9���=��r�����?�W�����-�����b������$Z23lg��W����]�rfo�����o	��:��^:���yi�I���D���"�L"36cSg6��(���z�h��Y���W?��M�0�2�Z�nW��$%g�������*���.[���k��Y�Z��~l�[)����+����M=�]�l����K	��������|H@���������7Y7���^��������j��,��|��_$j�zMw��4�X��������/�� P����BN�[Yh���s��\�|�+�����?��{/�=���x�o�CD}�t��tT�Ut��8O>?^�����.�i�c~`BAA@��0�g��� �@lI���G��`��u��&M%	c}Bx���:�P�����
$Yc���^t�����-����v����k�\�y��3d�������.�|�
W�xO���v�I��b�o�v;�y��F���.���{�����s��":Hg���J�:�T���g�����o�t	c�	\�u��OuK�T�����:���[�;��Q��*~��������r�I�����M"���������y����}/G�9H�
=��+�qa;�<�XgC'�NQ��^/S	\-��Yg�2�D����b������v}���y�������7������y�*
����y��7��l^��7���d�	h��B�o��6����&��y�Q`��jd� �X�q?��"�6��$pM{�#��	�u�6J�G�\����J���pz�����?�����3A��CtFe��]|�o3+S�Y�|ra��7^e�������/~0�`���%��W_w��p�@W3KT���\��LY=�y�F��a��{�y��	�����5����3jo��I�H6�����:��t<���|�fj������rN����y��6�|���N�{�0B�;�gI��f&�03#�����.�FfVV�g���)J��c3���R�������.&o�����rF���_&��p��c��+�s�1�
�����:;��)��W�9��V���as[��@�����������}�]g�7��a`�[�(/ ���~z~���i|$�k���O��6�3ym��){H�������K/Y!}o~0���z���2�����2��Ob/;_�����p���5fFm�F�n:�Q�{[��I2M��v���D�j��J�Y��s��Gd}F��O��v���r��G���.�[��p_/	���������Yn6�A|���N������\���9��3���s������'b�:�H�x��2}�:�g���{\������e���	/Q���r��n�{�����s��(�WO���.?~�����>Oz�������w��K����L/}����@������Q,�����</���y9'�fW���]o��]}<��;{;+����V�K��������<��@����dN�q?s���-`��H��U?��o��2�<���M�3b��	��?��������=L����<��������	�>��|
��<�'�!�\��wy�;S��j�Kz���S%z^�>Sw`������_2���)��$�~��k��>3�K�����'�w��s�}�"I�3����J�$cF��<�����
]R:��^�-o��WkW�YW����w��*�+��y�3T5��%�_6��7MB?�3�k�r�~p���^�o�:�H��}�y.s}g����F�.���~��)�w?_t�U���rv@����b�[}������\�&���~��q�opbAI�8���rn��=���q���>���q�@���!���+����'gC���4>��5�����'��$�4�6�|�\7�I;����n���oO=�����b���G���u����������$r�M?
���}���j=������Nu����'d�Y���Kf���,�?.-p���v�v�p>�����e��$��n�^��_]{��>��~�����������7l���J�f���M����^�t������i�w�I��s}��[��~c�Dw�m=d��Ud���<����Z�]�tv�n�lf}F3�����*��I�������g�{��{G����a��	�sg��3/p����u�f�/����������.���Q>@�_�}=9DG�����i�~<>�����dY����%su��t�T	���yK>������%p�-_-��H����L�1y��c����up���,a�ju���.�t�=������L�.������:33�-�	\-�&�5�������������Y����%���j�V����tf
1������y��y���5k�n��~���7v\&�O<p����n��������yE�8���Z��
�3����^��7K�k��z�H���`P���G�4�����N2������8 �����%D�`��S�s!�@�lI����G��&��4�]����7������?�4Y��f����s�g��o�3ps����
�ry�	�g�:rK���*
�E�qj����$pc���Y�>�������/]!~6Q�;��^{���^�\"�s�+V(+�#�������}��	oM�~��y������������M�G���p��w��k������.�����x��N=��0v)W_5y;s�2�w�:vuRvB �~�#N���9x?�z�i����\�jl����7v�<�@l%��{��#����O�W�1?���� �@Q����1 `��M�#	\�����X����es�Ci�M�o~lAH����\|��Nu�-�������W�]~����C�D	</	�K�?I�<� Y�r��x����{w��wu����_���;"��y����H4����m��U�]��W���\Z43�u6k�Y�^�o�~'p��w�����o���O��_@ C~�3T4N�R�V��2��k\���wc���o����a��v#��m����n=��|��4���?�`�|�(  �������"$`��H�4\?^��c������%E���e�9�d�$Y�d%	*

�D%'$�� FTP@EQ�*�@�� � ������.���f����iz�����������Su������O������kn�e	���=���Y�_��U��<���v�q�x���k�g���:���nO�3^�:�Ka���7gl��{����k���o{���9�Sa���������3�u�l����=.�3�\}� +\g����G��'��7����v�o��Pp���G��������S����0^�@��!����8����;V_�������g~;<������q���b�Co�������9��G�����u'��!�|sj����@i����JG	 ������F��@J�Q7N�<�N�'VZa����O�/�G�,�"p^h�p��������kn��xG�c;x���V������"�!�%���}>�Cx�{�F���\��j�[�������5q������vK��=��\`�p�����>��o^o�xXy����������~%L��+���������p����9 �}�e���v�ov�f�=��z�����_U����	;��Yx��7jE�i�f�����G �s`>=��`]x�p���/�P�f�S�?�����l�B	X��
����:��G�?���(��5i�����0<�t�G�4����" ��%R�I��|��|�B�@�R��
�q���v
`����)�wZ����/^����A��"p�A�v�'��+/F�<&�p���^��3�����uT�_�d����8��������`��e��G|1y�����em~5�9<�K���������7.�/�}z���:������p�w~V��6���Q��{���S����S�����nE�}���#���QuG�����)��~���[��v�c}��������y:�s��n����x��1�_�e��~����[����_�@�����V[Xh��B����6���s��,��S�����EL�Xc���q����?g����G��\|ex������	�^@�/}
�-	��-q����R��
�qb��v
`��[����#B��>��p��W4����cx�2�����E)�n�����}w��'��2������*��_.>�Z������`������o�������l��e�������G��w����>���w���=���8}�)�sO?�v��/}��������Wo�<�����K�_����]���N[�'��|������|����h���=�7���������]�h�N�=�� �\m����I��S/���>�����l���)o�9�O> 0�@����u��o�p@�}���\����_#G�	?���w?4����X��-q�t�%�UV\&������^4���?�\�"���3��4�O>����_�p
��
�l@�@�)�G�8��x��m�� ��]kK-����_�X+����bov����$�������}6-J7��)��V[e�0!��q�/j������>dc�n�;l����[��W]7��}�f�r:��v�������`����3�3�����>_l��m���q�Zc����[�na|�I���N���!�4l�Yk��Z`�yj}\����>��s�]i�ug��6���_3>�og	��X���7�O��������f���oe���^������[�So[j�p�)����#�?.��/��~����F6o�|�z����I����uI�O�I�@�<��
��.d������Zy������k����]��C�$`�����,���Cv����]v��{5<������/���:J`0r�`��K���	����� �����n�Sy��XV�<����u�Z�6��[_��?����������������^r��=�u�m6��}���>��������p�N-������g���o�k���G��/��]��b�_v��},<�3�^7{=���p�y�6������.��0�^���������y��U�r��z��+�c?�w��\��������k��g��k��%[8,���a��6��K�����]�F�"�S����}�e��{w�o����/��z3�\s�����g�eWh�3�7���kw�6[�����L�������m��x��q�T��� �������<y������y����x;���O����6���?x���
��"E������}�g������y�����;��$@�@1��b�E� 0T��P�j�����p�l�#������q��j�����SV��
U��=g�55n�k���M������Np���7�;�v������Yz-����v�{�
�=�d��'�y������{��x��5M��s���O�Y{^lOo��������X����7�|����]0����[��~�dW������_{^Yx�����������WV�<���k����:�s��u�>��o�~b���������o���9���ow��R����=�����K-���I�_�c~u�-���3��r�,v=�,6�M�o=>wm��|��6���j����������'��
�q,����@����GP�	 �������-	 Py�
Q4�B ����gP_8*/8��01����/�J�zv�E{���fW�.����[��l�'��0�)�f�o�<�g������
o�E���������
�
���Z!6+�6ze���8",�{���
��WD�{
�mX�t����l\{F���f�<�t��Ow�zb��1��g��[m�^�4����[�~1�&����>�f�����{�w��7X3-V��p�����q����$���w���n��o���f�����p�O~�t7�y�]�������
c��p���{�/l��F���_3�io��og�]9��+���������o�'\�������w�Va�x;�,fsO�
��tG��J����kmd���"@�5�<������%`��%�_@�/~���y
��yjj��*	��p���+����oN�^��A��Yg�%>�u��@������1c�7]�����%�5p���6��_}��c���������b��/X�-������,�����-t�>��z���gNY�3��y��1uc��]�	��V���6�v�m���r�wBv5�@�l,Y�Z}e��^���������������W6O�?Z����}vem3��x�/�P�.[���^�/��9�����v��v��G���	����� @`0��`��K�@�R��
�q&��*/��[���wo�A�[-�{�q�J��_u]��S�E��n
8vS��	N�����	 P&9�L��W^@�����@J�Q7���^�%kT ���s`�r�#�}���1�tJ@����� @��~1��O �����_J/�r�#���s`7����������e���-}%@������j��j
��p�N)��\�FE��
8�+g?��~�= @�@���NI;�! �#zA�@�R��
�q����-7="@��n
8vS��	N�����	 P&9�L��W^@�����@J�Q7���^�%kT ���s`�r�#�}���1�tJ@����� @��~1��O �����_J/�r�#���s`7����������e���-}%@������j��j
��p�N)��\�FE��
8�+g?��~�= @�@���NI;�! �#zA�@�R��
�q����-7="@��n
8vS��	N�����	 P&9�L��W^@�����@J�Q7���^�%kT ���s`�r�#�}���1�tJ@����� @��~1��O �����_J/�r�#���s`7����������e���-}%@������j��j
��p�N)��\�FE��
8�+g?��~�= @�@���NI;�! �#zA�@�R��
�q����-7="@��n
8vS��	N�����	 P&9�L��W^@�����@J�Q7���^�%kT ���s`�r�#�}���1�tJ@����� @��~1��O �����_J/�r�#���s`7����������e���-}%@������j��j
��p�N)��\�FE��
8�+g?��~�= @�@���NI;�! �#zA�@�R��
�q����-7="@��n
8vS��	N�����	 P&9�L��W^@�����@J�Q7���^�%kT ���s`�r�#�}���1�tJ@����� @��~1��O �����_J/�r�#���s`7����������e���-}%@������j��j
��p�N)��\�FE��
8�+g?��~�= @�@���NI;�! �#zA�@�R��
�q����-7="@��n
8vS��	N�����	 P&9�L��W^@�����@J�Q7���^�%kT ���s`�r�#�}���1�tJ@����� @��~1��O �����_J/�r�#���s`7����������e���-}%@������j��j
��p�N)��\�FE��
8�+g?��~�= @�@���NI;�! �#zA�@�R��
�q����-7="@��n
8vS��	N�����	 P&9�L��W^@�����@J�Q7���^�%kT ���s`�r�#�}���1�tJ@����� @��~1��O �����_J/�r�#���s`7����������e���-}%@������j��j
��p�N)��\�FE��
8�+g?��~�= @�@���NI;�! �#zA�@�R��
�q���xSQ� @� @� @����;W����?pchp+;�
� @� @� @�B
�
f���pSx=� @�@z�������#`�V'�FB���?�@IDAT�Fr~#!� @�Z�~��i4�'�R~tn�7)<�e�%���s`�h�
X��F��	HQ@�O1��L�@��~��7v�	��p�LH)��&�� @ =���bn���~�K#!@�@#9����	 P-y�Z�4�H)?*��y�R��[&Z"@��*8V!�������j�������n��, ��}c'@��@J�Q7���^o����pL/�F\��:�44��	������O�!@ ?����n�7)<�e�%���s`�h�
X��F��	HQ@�O1��L�@��~��7v�	��p�LH)��&�� @ =���bn���~�K#!@�@#9����	 P-y�Z�4�H)?*��y�R��[&Z"@��*8V!�������j�������n��, ��}c'@��@J�Q7���^o����pL/�F\��:�44��	������O�!@ ?����n�7)<�e�%���s`�h�
X��F��	HQ@�O1��L�@��~��7v�	��p�LH)��&�� @ =���bn���~�K#!@�@#9����	 P-y�Z�4�H)?*��y�R��[&Z"@��*8V!�������j�������n��, ��}c'@��@J�Q7���^o����pL/�F\��:�44��	������O�!@ ?����n�7)<�e�%���s`�h�
X��F��	HQ@�O1��L�@��~��7v�	��p�LH)��&�� @ =���bn���~�K#!@�@#9����	 P-y�Z�4�H)?*��y�R��[&Z"@��*8V!�������j�������n��, ��}c'@��@J�Q7���^o����pL/�F\��:�44��	������O�!@ ?����n�7)<�e�%���s`�h�
X��F��	HQ@�O1��L�@��~��7v�	��p�LH)��&�� @ =���bn���~�K#!@�@#9����	 P-y�Z�4�H)?*��y�R��[&Z"@��*8V!�������j�������n��, ��}c'@��@J�Q7���^o����pL/�F\��:�44��	������O�!@ ?����n�7)<�e�%���s`�h�
X��F��	HQ@�O1��L�@��~��7v�	��p�LH)��&�� @ =���bn���~�K#!@�@#9����	 P-y�Z�4�H)?*��y�R��[&Z"@��*8V!�������j�������n��, ��}c'@��@J�Q7�����*��I�_�>?���J���f�^c����/�1�^Mn�L��"�u,���@U���F�� 0���?��o Pey���56#�R~T��3%��/��b���>�Ly#|�������fVj��l�Q��G�{��p�������e��Xv�'PV�����o�. ��nf�Y@�/s�����H)?*����G��_{�p�a{���q�^F�<f(�h����;����]��1���']X���,�.��9��F�G���oQ#�_�_@���T�(���_����H)?*����G�p^�
���������18>�T��T#o��( ��uc&@ ey?��;�R��
�q&�p������6~!@�@��8v{�O U�7��7)
��)F��	HY@�O9��N�@=����n�	y\w�%��;��_ �m�<�����HU��M5��M�@�r~�Q7fR��S��� PO ����gBW�xI)�l�t[ �s`����R�~S��q �����b���������o��H)?*����G�p^R
���������18>�T��T#o��( ��uc&@ ey?��;�R��
�q&�p������6~!@�@��8v{�O U�7��7)
��)F��	HY@�O9��N�@=����n�	y������o�vXc����,��w����W���c��>V���io�����6�\s�-��^Xe�e�b�.��k�Z�#^�x��p����������_�mX�`���J�/]kg�E
�
^1:�8rt����'�zn�]�}���/�/	 P�<����NHP��M0��L�@�r~��7p��
�a �P �����Co����v��=v�:�3��N����W\��p������?�2�,����;���&a�9f���>�G�~{����[�
o�S���;l�q�<�<s��w������p�/n'N���>�p�p�@��B	�q,��t�@B�oB�6T�����HL@�O,��K�@�)�G�8-�x��a�����a����M�1�����dx���������Z4���Ra��W�W��Q+���W7�_����x�9g�����Zo�m�����O�g�{)�;&,��"a��m�fo������w������j+�c?�w���^�&�^+ ���Qadlg�����/\��:k�\���|�'�����n*�#@�@1�8c$z�#0l��a������#����I���/�w���0*�q�Wu�����H �H@�o$�wTK@��V<����R��
�q���f�{��M�s��k��������aB?W�.�����C�
�.�Dm��.�:�u�#���x������~����~�����o��}vEmv���[n�+v:�Ka��)�������&�F��-���;��{�����7������S���/�"���M�s�{�^
o P8�<���T�Zu�e���\`���������w�+��]�w��i_^��-|�t��	���Qj����K&�$@�)�G�8��x3��b1�����1������?��/P�g��v��a���
�_�>��o��&L�iIl��=����������7����_,�������[�����+��v�[��?����}~��C���/�|HX~�%CV�>��?�q��g�~Y|I��B�q,�@t"�}�e�g����.��z���W�*L���s�����~�"@�����C��QV@�/lht��.��p�d�#��p����������p�i_�����\�n�|����6���o�����v9��O�e�^,�?!s�Ea������i���x����]��]�{�����n�,�$@�@!�8b �w"����7�5�MI4��dM5d��
X�]�wptT@��(�� @���~�C�T ����'aoT�>�����O��������\�����'�1�����?��>�-�k{�i��}w������}~�[l�N8t�=������G|���yp�:
�'@�@�y��?�=�8�s��#F+�}�k?
>�d+���`�o��;B9q5M��
���. P����n�ry�Qw��V'�_m�g������hz�����a��nY�������������?�^��rqSW�6}�&6\{����?V����.��?�^
�3������X��$���VX:|���Z���?��������#`�'zB������ZX�(���_�x�
�H)?*��y�G�p�|�������6��p�e����kz��{�
�����Ig~+<����}�w�������|���/�����C�������1��x���j�Yu���'�~_�����_5v��+��D�F �s`a�hG���;�����	��4���H�����m&`�, �C��!���W�(���_����H)?*��)�G�pw�q���=��M�#N� ��j��}�5V
'�om�s.�2<������p�G��k�F����E�����s��}�*!��K.[d�0�|s�mV�.�	 PH�<���XB��m�-�^�����he�'N{n+���`�o��;B9q5M��
���. P����n�ry�Q��n��y��?��s��I�7=�WYq����P���o�$���c���z��a�U��~�����.��~0o��t���G�[��6k���B�5��0v�ka�����7�o�95�9ujXp�y�zk��v8�����������;j�@���{>7������=��+`��7vzN��V��V�lO��r������ 0t)�G�8��x���[oo5�sm��x�FSy�
�?����N?�{��g^�����>6y�������U4�o�v��'������/�_��/��G��M��{���Lu���2� @�y��1���r�%
_��a-�O��.��oZ����~�=!@��P��C-�}K@�/V<������p���#��
��\o�p������|��p�?oz�����������?���������w��v
;n�I�s�W��62����utXt����\����]m[��]�{�'��l_�zR~#@�@1�8sdi������X}��}�����0��}l\,��X�����?���&@�@�����D�(�@J�Q7��<�����B����>&�2K?���p��w4=��o���f��n_|�g/��4��4������>���Og�wYx���z~j��s�=g�����w�o����/
��H,,���K�!�
 P8�<���T�Zt����~<,��B
G����+~z}���w5����~��#@�@�r~���"@�@�����H	��@J�Q7��<����M��N�dXu�e�����N�z�:mZ�������:*�>|���h��'���g�9g�:��0<�~������>���a�%	��qxm�+��}���l�{�l����K�!�
 P8�<���T�Zh�����|<d���^Y���?�m�������K$`��(X�J��A
����;J& ��,`�K�@�R��
�qZ��f
��l�~8p��j���n����������p���}���svnv%n�:=��_�{Fn�6��3��v����?�1��p�E'�>�����pm�o��������
��)�����X�������'��]����f�5��c���Qc��~��p��M���e�~�=}'@�@kr~k^�&@�@����GP�	*����n�Ey��nv��S��?���ra��I���"<��������4���{��a���o�;^1s]��.0�<�����3W���_����O?������-�D�
����|8��������n{�Q����X)L�W�u����'���m�7+���p�a��[B�R;n����B� @�y��1��z�]���!#����I��K#F������aHK����~�W�"@�@r~*�#@�@u�������@J�Q7��<�L7��K,�P�V������j�������
��0"L�<%dW�,���a��6�\o��L~������.��^������=�����Y�����-<��'�s��v_�R+�.��%Bv�f�]+��uB8���k��4��rK����/dW�N��z�9��m��e��%_�v��n;m�|��������g���
�=��I����q,�h��@���j��h PO@����7TO@��^L���|R��
�q���f��]d�������V����Y1w��Ia�x����|<^�{M��m����}�p�!{�������Z���v���3��W�*�<�W����=;���?���������ybQw������
����oN���v��1���������}�yWxe��p�I���_ P�<����^HO��M/�FL�@�r~��7r������Q �X �����C_8e/8��0!a�"���������
�n�V�a�����/�g�����x��p���<�D��g�:��(�{�
�v�
�EY�����n�["����'�z��o3~�������UW^.dW��<G��7��;�y8��w����4*^1<w�Q��	���^o�+�3�6y����xu�]�>.���3��#tS �s`7���R�~S��� �����Z��������g�� 0�@J�Q7����]��F,pN���@s�����cx|F���bhv���W����w�����X,��W���J�n���+��r�,&�g������q|��Ya�^�z��
��O��������> ��@^��|z�Z�~[��-�- ��;~zO��V��V�lO�@*)�G�8�S
x*��8	 @�9����lE���o��O9h\�J���
��E��~ �m����n�m)�����	 @�X������hE��mE��(���_���=Z��[�=���p��N)��,b�$@�����s��"
X�E��> @`h���q�*�* �52�E�@�R��
�q���n/.�'@��b	8+zC���-� @��r~����hU@�oU���"�R~T���:��������pl��V�(`�1*�D�������U�(���_����H)?*����R������%�X�x�
�V��V�lK��r������ @�Uy�U1� ��@J�Q7�����"6NhN�9�9'[(���[����F@�W� @���~Q#�_t[ ����g[J���r|(��s`���7Z�~[��-�- ��;~zO��V��V�lO�@*)�G�8�S
x*��8	 @�9����lE���o��O9h\�J���
��E��~ �m����n�m)�����	 @�X������hE��mE��(���_���=Z��[�=���p��N)��,b�$@�����s��"
X�E��> @`h���q�*�* �52�E�@�R��
�q���n/.�'@��b	8+zC���-� @��r~����hU@�oU���"�R~T���:��������pl��V�(`�1*�D�������U�(���_����H)?*����R������%�X�x�
�V��V�lK��r������ @�Uy�U1� ��@J�Q7�����"6NhN�9�9'[(���[����F@�W� @���~Q#�_t[ ����g[J���r|(��s`���7Z�~[��-�- ��;~zO��V��V�lO�@*)�G�8�S
x*��8	 @�9����lE���o��O9h\�J���
��E��~ �m����n�m)�����	 @�X������hE��mE��(���_���=Z��[�=���p��N)��,b�$@�����s��"
X�E��> @`h���q�*�* �52�E�@�R��
�q���n/.�'@��b	8+zC���-� @��r~����hU@�oU���"�R~T���:��������pl��V�(`�1*�D�������U�(���_����H)?*����R������%�X�x�
�V��V�lK��r������ @�Uy�U1� ��@J�Q7�����"6NhN�9�9'[(���[����F@�W� @���~Q#�_t[ ����g[J���r|(��s`���7Z�~[��-�- ��;~zO��V��V�lO�@*)�G�8�S
x*��8	 @�9����lE���o��O9h\�J���
��E��~ �m����n�m)�����	 @�X������hE��mE��(���_���=Z��[�=���p��N)��,b�$@�����s��"
X�E��> @`h���q�*�* �52�E�@�R��
�q���n/.�'@��b	8+zC���-� @��r~����hU@�oU���"�R~T����'��Lp�$@� @� @� PF�k�sU��R�p#�nKs�� @� @� @��"���������B�;��� Pd��"GG���~����U���Mc!@�@cy���-HS ���
�8�S
x�K��	 @` ���d|O���o�c���K@��KR;(���_�8�%�H)?*����R�;����,�X������o}� @�Jr~��i,h, �76�i
��p�O)�i.i�&@������=��X���� @ /9?/I� @��~9���t^ �����WJ��rrD(��s`���o�X��}�J��*	��U��� @��������)�R~T��s<�������pH���/`�?FzH������$�C��r�����^ �y����n�_)����	 @����E����/`����+�$ �W)��B����~c#[ ��@J�Q7������6jH�9p �(���[��!�������! ��#NzI�@�R��
�q~���/'G$@��"89:�F����[�������_�h����lA�@�)�G�8�S
x�K��	 @` ���d|O���o�c���K@��KR;(���_�8�%�H)?*����R�;����,�X������o}� @�Jr~��i,h, �76�i
��p�O)�i.i�&@������=��X���� @ /9?/I� @��~9���t^ �����WJ��rrD(��s`���o�X��}�J��*	��U��� @��������)�R~T��s<�������pH���/`�?FzH������$�C��r�����^ �y����n�_)����	 @����E����/`����+�$ �W)��B����~c#[ ��@J�Q7������6jH�9p �(���[��!�������! ��#NzI�@�R��
�q~���/'G$@��"89:�F����[�������_�h����lA�@�)�G�8�S
x�K��	 @` ���d|O���o�c���K@��KR;(���_�8�%�H)?*����R�;����,�X������o}� @�Jr~��i,h, �76�i
��p�O)�i.i�&@������=��X���� @ /9?/I� @��~9���t^ �����WJ��rrD(��s`���o�X��}�J��*	��U��� @��������)�R~T��s<�������pH���/`�?FzH������$�C��r�����^ �y����n�_)����	 @����E����/`����+�$ �W)��B����~c#[ ��@J�Q7������6jH�9p �(���[��!�������! ��#NzI�@�R��
�q~�5�s�1<,��"a��i�����oNm�Z��o���"��&L
#^~���6 @��j��X��������lE��*��U��1 @�yy�y+[ ��@J�Q7��2|�y�
���EXc��
�-f�u�����u�����O������D�a�M���,�\b��m�����������HO�L����c��X��}�J��*	��U��� @��������)�R~T��s�,������`Xd��k+����
#F�F���������L+v�-���k�0|�l����t_|iTx�����W�.�u�L����(�90��)������lI�����e��� @�5y�5/[ ��@J�Q7��2|��
��~X�m�aaT,�^y�������I�_pen��:�S��Q�����	?�����s/6u���TJ���J����15E������� @ gy?gP� P����n��e�a| �k��jW�~��K���S�.���{���E��n��������	��]/ �G���>���^����T^@��|�
�}��>> @�W �����^��/��B�+gf�%��}����{������]���?�{�j��N�8�3~�M}O�	��ph�@C��!�
 P9�2�44% �7�d#H)?*��	^��o��j��O��:����M�Y�������m���K�sg]��� @��>E?�����~�p�@��J����� @`&y&_ @�&�R~T��!/z�w��������	��!�}��ez��>���j���	]ruS����(�90��1������lI�����e��� @�5y�5/[ ��@J�Q7���|��v
;n�Ixi��p��_oj%�}��a�e����p��~��>6"@����~L/"FL�y��y+[ @��r~�#��hM@�o����#�R~T���z(>�������Xu���"����;�2��0j��p�����3m�[���y��+����a�l��~�l�����.���������3u�����O�9��=�����sr��r���n���������|��>�&���}�2��>��@��	�90=E#&���;��J��n���PwLtO@����# Pl����n��y<���;o��{�g��W��+��}���Gg������~��;��gXz�E�����f_?���p�_�jvs� @�@E�>V���R��-dXt�C" �	�F	 PXy����1�,�R~T���-���u�p��v
[m�^m
��W�>��'�3��^56,���a�e�
���j�{�9�[�����������g�/����6f6���;V[�V��8qr�����l;u��������v�~�*����������y�w��02<���}��<yJ������q����HO �s`zrFL���o�c������)i�!@�@1��b�A/(�@J�Q7�����.��=w��6�������~&�����%[8}�^�W�^t������q������s�n�
[z��?X�
������
�#��9���}E��X�C�yH@�/P0t���;���R �����h_.��������Z�"����v��@+ {F�i��VXn�0��	�3��Fxm��~7W����� ��@����	��~�@�J* ��4p�M��6��6��F�@�R��
�q:��C?���E�m��xK��O�z9jL����Zo��Om���������w�~Y|I�9�q��� @�
��
4� @��r~I��hS@�o�nT^ ������`>|�a������s��3j�s���^$g�x`Xy���'��|�������n�,�$@��{��� @�M��M8� @��r~	������?<� Pi����n���
����N>f�����{{������k�m�n���v��C�;��g�*�6�iChQ`���gsr�~s��
. �<@�G������A5G�@eR��
�q�6�[�k�p���_[_8����S�5��������Z���3��{a�L�*��D��I`��������~�@�J* ��4p�M��6��6��F�@�R��
�q:6����E�����'^��}��E��+������9]z����U���� ���`��9uC3�!`���f�T@�/i�t�m
��m�������p�tl�?��6a����-���9'L��z��d��	�����/��O��>6��
�3���r�90�nh��6��6��B���
��%
�n @�My�M8� Py����n���
��[o>�����1�m�Z5�n��p����|�9�O=��L�*��D��I`��������~�@�J* ��4p�M��6��6��F�@�R��
�q:6��\o�p���-��|��p�?oz�����������?��������i_��H|A�9	��S74C�@�ohv!@�@I���N�	 �����&����@J�Q7N��|����}L�e�~|�
���hz����a�������������i_��H|A�9	��S74C�@�ohv!@�@I���N�	 �����&����@J�Q7N�<~�	����l���p��_S�Mk�P��p�YG����n��]���\��>
�����r���C74A�@�ohv!@�@I���N�	 �����&����@J�Q7N�<���������8.�����;h�P��k�������nC����?�6kD�!�
 @�M�<��m�nR��$��	 P"9�D��U�  ����	*)�R~T��S8��g�O>����j�,&L��q�/��?9��i�M��{�'6k������|���*�H��@��Av���)`��	g7�P@�/a�t�����gW*-�R~T��S9��/��B!��rvk�����o�k���G��/��'O	��7OX~�%��[m�����E��#��]Z�}�U��;���	 @`�y��� �������= PV9����o�' ���f/�/�R~T���9��/�����O�]+��,���;q��0�<s�|U��}>����n�6�p��w�m�axi��p||�n3�/�~XXz�E�o�'\���6��m @ A�<��	�2��
X�]�wptT@��(�� @���~�C�T ����'a�>����Fk���8����}��[����G�7�����CO�����A?��l�p�~��?�~_�����g���:�{�-6]'|������|p�
|C��@��@�tN�����# @��r~�#�������YoG#@�<)�G�8/�2�s�1<,������s�1c_
�_�xsj��a��c����~�3y�[9��Z������s`z�FL���og��������������ysG$@�)�G�8'S
x9��^ @�@��;%�8��~�7�"�* �52�E�������U��_ �����kJ/��4�S�90OMm�����YoG#@�@7��n�;6:/ �w��	(�@J�Q7���^�%�� �)��NI;�����M�H���
��E��~ @`h���q�*�H)?*����R���<���p�SS[:+`�v��� �M9����M������7wD�!�R~T��s2���c	�%tJ�9�S��C �7S- @��r~Q#�_yh\�J�@�R��
�q����/O# @��<������
X���v4tS@����c @���~����r��p��L)��X�zI��p�����_����T�(���_����F@�W� P~����n��)���� @ O��<5�E���og��������������ysG$@�)�G�8'S
x9��^ @�@��;%�8��~�7�"�* �52�E�������U��_ �����kJ/��4�S�90OMm�����YoG#@�@7��n�;6:/ �w��	(�@J�Q7���^�%�� �)��NI;�����M�H���
��E��~ @`h���q�*�H)?*����R���<���p�SS[:+`�v��� �M9����M������7wD�!�R~T��s2���c	�%tJ�9�S��C �7S- @��r~Q#�_yh\�J�@�R��
�q����/O# @��<������
X���v4tS@����c @���~����r��p��L)��X�zI��p�����_����T�(���_����F@�W� P~����n��)���� @ O��<5�E���og��������������ysG$@�)�G�8'S
x9��^ @�@��;%�8��~�7�"�* �52�E�������U��_ �����kJ/��4�S�90OMm�����YoG#@�@7��n�;6:/ �w��	(�@J�Q7���^�%�� �)��NI;�����M�H���
��E��~ @`h���q�*�H)?*����R���<���p�SS[:+`�v��� �M9����M������7wD�!�R~T��s2���c	�%tJ�9�S��C �7S- @��r~Q#�_yh\�J�@�R��
�q����/O# @��<������
X���v4tS@����c @���~����r��p��L)��X�zI��p�����_����T�(���_����F@�W� P~����n��)���� @ O��<5�E���og��������������ysG$@�)�G�8'S
x9��^ @�@��;%�8��~�7�"�* �52�E�������U��_ �����kJ/��4�S�90OMm�����YoG#@�@7��n�;6:/ �w��	(�@J�Q7���^�%�� �)��NI;�����M�H���
��E��~ @`h���q�*�H)?*����R���<���p�SS[:+`�v��� �M9����M������7wD�!�R~T��s�'����zI� @� @� @ M�k�sU���C��[�yn� @� @� @�P��@
� @� @� @�epnY"�� @� @� @�T^@��!6@ @� @� @��"��[�H�' @� @� @��P��|�
� @� @� @���(��%R�I� @� @� @�@�p+b$@� @� @� @�,
�e��~ @� @� @� Py����	 @� @� @�(��nY"�� @� @� @�T^@��!6@ @� @� @��"��[�H�' @� @� @��P��|�
� @� @� @���(��%R�I� @� @� @�@�p+b$@� @� @� @�,
�e��~ @� @� @� Py����	 @� @� @�(��nY"�� @� @� @�T^@��!6@ @� @� @��"��[�H�' @� @� @��P��|�
� @� @� @���(��%R�I� @� @� @�@�p+b$@� @� @� @�,
�e��~ @� @� @� Py����	 @� @� @�(��nY"�� @� @� @�T^@��!6@ @� @� @��"��[�H�' @�k,s@IDAT @� @��P��|�
� @� @� @���(��%R�I� @� @� @�@�p+b$@� @� @� @�,
�e��~ @� @� @� Py����	 @� @� @�(��nY"�� @� @� @�T^@��!6@ @� @� @��"��[�H�' @� @� @��P��|�
� @� @� @���(��%R�I� @� @� @�@�p+b$@� @� @� @�,
�9Gj�Yf	K-�d�k��ly�����1��� @� @� @��)���c��
:��a���X����z+����������v~$@� @� @� @ -����b���]:�1��3����c_�z���ag|-��R��S����~?O�01�q����� @� @� @��)����q_|�����3����
g��=.���3m7��s��:>�3�<3����9����� @� @� @��
(����3����3�	�6Yx�6mZ8��3�S��i�������#>���c��f�b0�a�
�Xw�Z�y��.�@l���]!<6>_�_O<���#!�k3����;����a�%mf��6������0j���}�jg�������~���?�l�1�Yn�e����Zd��`�>��0j����K/�/��<����}�WV��p�
�K�����0y��m����U�[�g�0�\s��n�;<��S�n7���s���p���;�������������o��F�����Z|_�r4&N��J�n���5�Y=,���a�8w��c�0nl��q���������'Mp�F?�:����WZu���������js`d��������k;������z��a���
�-�~��������s��F���E� @� @���n|��������i�����i`���
�����]�l��;|������o�Q����q��;o�{������9��.�~w���UV[��s�oF�<:�w������a�����l[�����x�	g���X���&���w��Vd���o�n�������e����s�N;��0���������9�����5�����|��G��]���o��u�%����g�zV^m�p���n���|7<���z?����xMxmB����a��Z��F�o��a�X�_���S�G<p�����1�:����N�c��
��w������c���k������)��@`�?t��`����]a�zM�?�x+�����kocF����	 @� @� @��p���g�[P�:uj,��^{��~��=w�������;m=�W3�o����"��>�������V^�<��p��<`!,+Z�y��a����^���>sN��qj����G~����ro��Nv���p��k��f��g��q,���������������[vU��������g�_��7{?7z��[��?��>�=��G���~E���}�����]����_n�%\w��{>�����k��I��zC�[����=c'�?<���{��c���WVd���?
�>�\��Xp����
Y���Wv���.�&^E�|���"�����u���o�[n���C�q�s@ @� @� @�Q@7"����Bv����;���Kj����f?�p�[$u��3=_7+B��������S���-]+��5�\}�?.�V��_�Z����>�g�m����>�7s��q�6���yx��O�������cN;��������p�%���,ns@��Z���]�J������^�WD.o��X�=�K-������"���|%�2���Wk����_,���m3�f^���'��������r��^z��f�	�}x������n{�����\}]������l�����m��[,g�9�Y3��g��������8�>���fz�tv���/�����q^NK/�T���o��#��8��^o>�U��m����c���^u��;����~D����K,R�c��+�{����sN>��cX3�?~��}��Y���WF�	S�L	�,�����K�+���������~o�@��| @� @� @�4#����|���V��j�[����t���m���f�Y����?VY��8�
�7_����}g��o��a�}�)l���}�����^��>�ef,�e������k�E^�d�i���M�e�����O���������x+��}��>,�;G�5�^���s>w~.���3{?�����N���X���eWO�����Y�=��/�����
���s}WXy�������Hz�M���dh@���m�q�����=���?�Y�4q����s���z��hV4��
x�O��N����}�����
���o�-.�����^r�%j�d��>��sk���v������m���?�������"{��{�-������?����\����7 @� @� @��
�(������v+��oE���_����G��L��m��;ca0{^gvK��~�nx��h������Mv;��5�+��k^�d}k���]Q{��G��b�����[����7>��g���b�\_1*���?�|��7+��1��|��^�Z|n��a�
�������S�����{��S�����s�9c�|�X�����ll�8��3��8� +�s�!���������_�T���Ya����s����#��'��i���n�A����������Y���++���H��9�YQ9{-��r�������W��������++�no��f|��o~����zm�� @� @� ���@�wm�ix���������������fW�eE���?�1��Y7����'}:,��2��e�l���k����������x��K/�~�]�*���N��f
�3>g8{ov�)�O�3�V>�[��q�~sS� x�)G�~�d��������r��}n[������F�Y�=�d���v�z��z�oU}��4,�g;��<���&�/v���z�����W\������g&���p�w���zw�����.�i�go @� @� @�C)��u[���V���O����w.�^�g�.oA��bw�y�
����3]	������H�u�g�x|�O�����~a5rt�w��,�����'��I���'���_y^�j'�l3��>!,��B�c�������pk��v��X�m��G���Xd_�v�7��g�tn����p���UV�����g�~)����O��2p��W�fW���28�vnv���WvUr���=��/�&���{z>�����\������7o�v���������9{���g���9{��/}�����lT��&[n>��=z�x*^I��!� @� @� @�
�Qy0�������Gv����3"|���P�g�>l��fa��v����'�3z�o������<�3k�'���{�U�=��
�y����Qw�e���#{�P��ba��x�����)����J������[�
������5�Y=��~=?����c����_����������a�u������&�u����K-�T����A���;������������B�=��n�;���_�|�o�Y����[#_z�����&{�����g��Y���p��y�}6�� @� @� ���nDl��;c�(�G�x�]y�������v[n�yoSw�gu^����~n����Q���n�����Y�g����Mg��|������������f,�~b���o�Q+��t?�S�=��O����j����[���_F�8������=:>w����v��m��<�����=w
�����#gW?gWA����nV��ye���R�y��m�?�z�a����cx��'�w�zY��]>�����[�~��}��?����������u|�+��v����o�+<���u�N��!@� @� @�4P��B�p��n�xe��}|���w.�t��f��Z����	�m�N�^7�6^�����+�B8�R����?�-���_�|�^{��f�K#����N���y���Qw�-7���}o�^�r8��{?������W��W@�t�M�����K���u���n��<���pg���#�x4|��W�g��6���_���z|�
�
��Oo������z?�s����X����n�{���~����7YA����f6��&����O�'}*d����6�=�/	 @� @� @���pc�[-�fW�fW�N�������9�U��o���C�=0���J�����_���������*��j�����>~x��{>�]����eW�{����V����>7*�n���a�]��7������/�����V�3��C��D�e�r=����y������1��9�;�[	�����5V	�����9*|��z?����<�*��V^���L8��O��2��	������<���C���������`�d��3���"�.\�����Y�>+�g��� @� @���������8�W�c�.}:lQLl1Q�(�H(]
b"��� `PJ(��!�]r�����{N�0����z����:������w�Z �@Rp�^4n�K.���{�c�#Q�hx���W=�����3�S��T����5�k^��2g��2����~����B��m9���UWr����p�!��
�eP�����F4n����n�E�����n�����K�;�/�%�v��U�U����i��������9��>���u.�q��H���:W�������;���������/�N��\�F�L��Z���/�2�K��e��q��1}�����r � � � � � H���D����=rY��S����:��1!��\yrI!��N#{���`�%:�S���v���������nD��Q�?���0�'��@��������6t��]�wyN%W=Zi��j�*���:����������r#���}�K��k9�0���2g����{C���^/9S�j������Y��W���]f��+_~���+�>
p;7�6�Se�Cb����?����|�$���a3�|����s�����5+��{f�@}�"� � � � � �.\�M��^�T�]�/|x[�\)y����HM]/����� ���XF1��KOyF�3�L��������S���&W=���������<�i[��^K���H��rH{�f���j��M��9y���M�����y�����Oe��e�K��x	p�z��T�s���	_N����;��>:v��tz��s>�
��e���k�o����uep�w#�:��$J�B��f��<��x���h�?g��; � � � � � �@$�F)������r��\�s���KCvv�q�mR�����}]���C��C�������^y]N��+��-�t0��f���9<��[�y�?�~r��U�6,\�����~L����:�q��5�~,��
n�^����X���g��
�v�!����N-��k+�(T�a��@�.����n�d����%a_[����	��2a�	�'�
���������#G�[��kV�,>���R�l)��t]h]�� � � � � �D+@�k��	p����~��T3kh��W��e��=�n�W]_�c�v�3WN�:�Z�D��n�Nn��:����;�o���n$W��\�h���zM�H�����VY�b�0����5�7s������9�������7L��i��%\���f�qo�\�������s����-Z�e�?~���������4�����N6��v�I����:��>z��V:������}6@@@@@@�Hp�Tjp�#�#xu���&��v���9���=�H�l��6�L����4~��k�&W��\�h�"	p+_~�4{�~�g���W�?
WGR��PF4�>���SG$n�����w����c�N�a �9s&)k��.��4Sjd�Z��=�-Z��<��ig�hm����k+����4���-�g����}���Z5�F��'t��!&�������G�w�~���J�3��u��_g��AJ�+)�n�S'O���:_��:�����v��2����}6@@@@@@�Hp�Tjps���[��Rr�����`��������9��(\��<��~)X��sx����F�A���M��5���G�
�t
��>'�f/p>���%��;H�E������M/9q��uY�W�=
����<�e�l�g��m�k��vk%
&8���
��W�N�6����:���M[dP������9�V�6A�k�C���_���;��7�9����������Z�N7�q�&�����C��CG����T�������i�����k����^�J����u���GU��:,����Sf�����v���W��.\��s�?��s�
@@@@@@ ��J����$��u>u=�M�6�fT�o�����I��%���u%s��z�Ut���7>�U�W�����
^��mX$�^�/!��z�y��z|��%��|�-�����{$_��&`+(����	��A=��M�7���p�0Sa�c�������������������nw���y����5�����.t����Gu������]�>��/=7v�83���sY63Mt���@�)V��g*o�H���z���97�6��g���d1SO���������V�7o�l��EM�^�d1�c���{�%��q�y��9]'z��9f��������BE
J�����U5%�<v9v��tz��������u�9���o��u�w��-���y�;O.�pQ��u-�S��\��W���P@@@@@@�hp�X<���W5�/77�ItTc4���S2n�xf�x�?������:�6����G�#�W��P��<��}�#�5i���`M�kO��u������X����YA����m����5�W�����b���!h�W��~35����?�~3x[�������@�t���Q�Z!l���c���A�G��������Q�����KG�����m�B��Y-5�U���G[�u�I����;#d��1��M � � � � � ��y�%������/����3�r��x���2��Q�n����i���O;k*Y�H����i@����H�z�~{[���u~S�o�n�&z����n����t������:u�{�_�zZ���������a��]
�_���	^5����c)75j �6���up�w���k���w��s.���|!���-�y�� 9O��]o_��U'O�"�&��������7#^z�k��`����0�������Y��!�5����=�F
�IGhO���X��s��w�s��qM��� ��NQ>�����T�!@@@@@H����_���	|�? ot,�L��"k�������M�:�S�.r��`��L�2I�j��f��R�L��q����U�d�	'u*a{ti���{mih���{�1Q�����������f����o7�B�`2\�@�|�r���U/ur�#G����;��%���I��Nq�`�jU����?Z��:~��_Cnir�U��A��5�o�v��Y���h�"��|�sd��5�J�+JF�Z��*���W�)�?{od��Z�u\V�R���k=S��9~�����SvX����W�X��v_��������������2e�����w�>�=m����/kje��d4u�����Q�|�D��}��c����<���@E����^QY
���
��l��#���E��E-��������1@@@@@@ \���sK�K*F���k��Q���h��*2�6#(����~���O�;��H�-*j���pY�J�z��e3a����&F��#W��?����@F�����i@���c2���f]��z����}
s�Y��,Y�����d�y~5�N�b�}��j���c����G��6�N{��g���<�e�����n����@@@@@T�� �h�m��y�����p�5��\��#� � � � � � �@�	������]�4C�S�F��::v���������� � � � � � � �>p]�\�Xa��zk�Z��69��'e����d����KM�vP � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~���G�@@@@@@��n:�h>& � � � � � ��~�d����;O��("��eZ���d��}A�s@@@@@@��n2�{����r�
Y������o'�������� � � � � � � ��p}��P0�+YL�d��;��������kDCX)V�����9������.��<�A@@@@@@H��g��P������R�|��O���+���'�.K�,��G��3G�s���T�@�9� � � � � � ��T��t|����u�%g������+�Zv��O$zl�T(-�?����%�9��X\�����e����$�P~��'��:uJvn�e��);���5����������:2f��9jg����x�9x�`���s�2g�k^-���KV��is�s�nD��S'O�^�������{d����V��.k��ri�*f
���7��7�:xH�o�a����m���uG_kE����5��������-+������}eM�lF���,�/�������j�*r�.���|��E����\V�R)[������i��D��:P�rE�|�%�|�
Y���P�F|NG������h�"�3������<�;e��������B	r�����y����r���e������?���kOD�j����BJ�).���/�W���>�H�cj��a}�YSg���%{����*K��B�{��a����k=G�
y}�����\T���]�7_�y?u�_�q�Y�g~�������
�*�V?[�E��,X����}J�-)W�H�%��>�~��w�~�m[���U�e�����B�'�3�o�;�L��������j������hJ���T�wy�j8� � � � � p�p
l��n�:W���8�\��R�JJ��
C�h\
�nir����;T�r��S'M�_�*��Otm����m�Lt<���/�O������R���8�������g?�F��\�j��0n�	[��{�|�����;n����&
U4������wLp����;���m�l�����q�{L�\��l��
�v�!�nIt�|�r�T����!�����^;?}od�A��}��~I�3��2�� ��y��~�nhw���I�2%B�zZN��9����(p-k�����k�|�rN[��N��r������m
t�s�F�j��Yck������mO9r8��T?c�kj[��0?�x��N������
����;zL�N�.�&M��f;\��1���

�s��*������,��W�PQ��W�u�,YO5��&���J�^�Z.7��W�x�������2m���������g&��0�1�����O�� ��������}� � � � � ��^������*R���@���IAG��Z��]
��Q���*��D���:b��G���.KTG��|��3
s��2
��jp��X�;+������ ���=���y�;hF��j�W4T��$��Z�O�'��	?�|+�����%W��a���e�����C9�nm�P������������u��mty��d�������F�7����OS���?���W�5G���=�
��ju���n�\6��������h64Dl��!V}�����h�u������5?th(:�4�r��������>7�-�$|l~�����!��r��fp�si��[����M�m�kQY����~3������������<��
�44���(�1�Yw��.�(�=�p4�8�9rD���;��^�b�F������K���|gB��H�	c&���x��zf���|�J�f���H���6�R������� � � � � ��N��������V���^/��}/I��S��F]��l0����F���@���KA3�jA�������t:���_���?����*���c�Z#�v�������e�	��D���.��q����_~�wC����y�?f]���:4[�l���-R����Z�~v����*^|�<��C��R���[7o;3��<�Z@
���Sl���������I�l����u��h�������������6L�2C����>�5zr�E�V�w��j�l7y��S��_�(����������?��X����&�id�?n��,�"�=���e�����ac����?���@G5n����6�&����K33=��V6������*61����E��CG��?9��6^����3S8�EG��s
�5����y�q�/�������[�?�:M���(����_��l�*hp�S����S3(�`��%�|���m�����\bF�?d�`���36��h�}�����:=s�E��#�u�����l��-����S����C����]�.}:���Zw�.����N?����E�3_B^��5�l
���1�|%�f��o������<���=o� � � � � �@������t�u�Y
�b-���!w>���H\���������?���:�2��b�z�����]z��k���k����C�V�MQlhx���)����h����(P�+��S����wZ�z�uo0������Y��-=�QkX2�L�������iT������9������>�p�S�����������6�#����ug�^��Y�V������������z�m���m�����GS���O�n�	�5��)���[�tqy����o�O�������gc���S2��Q��}�nV�W
���}�4~��c�
������������.��sX����h-���Z���~�h���g�>���WG�~0�#������L�|s��<U�T�#���~�/(d\��k}��w�~��\?3���@?��������aA�k�X� ���W�AI����r��u�_r����}@@@@@�� �5����6�N�i��]��Y���~���7������4�Z6j�<
�.����r��mh�[��Z����)n��?�7mv���W�i���gzF������#����F,���+����F�d�bU�AX���$
�uJ�j�/w�N�Xf���9�
]��f��R�LC;������^�����������q�����GJ��n�r����[8u���W^�4@4@U4���LK~���=�;k���[+O�k�~2�3�u��5�����Mv���o��}�%5�v���GM�]���f����-��h�7�W�Q����W�^E4��up�.Xf����G���W���k-%��g��1u��5>���}�~��mu�b����_����b��3��N�:�2\TA�>��3b{��e�����/�����&"� � � � � �����#��}u-it�m��V�����o�*���y�G�`�?�-��s��wy��;�k�5mN��#:~.��_�j}�g[��*��^���S��2��������k;�n����P���?]�����k��Q/�n��fT�4��X6����WCT�B|���	�#����;�L�;j���/���5Yu�w�������O
h�����+�x�N�n�J��i�m'�VZ��v�g����e������Zv_�vJ����)�S6����������)����"	p��:2�E�'����g�F'Z��_?���@@@@@ npM�%5��c���{
)�x���i����i���������oE��
W�{���FU��I����S��X6��?���:��]>|�Y�p�\w��r������I��P%�v��J
g��f������]��HL�0H+��7^���6J�2%��z�1;r��s��H�WG0�T�vY�r��C{?�W�"���>��{���Sf��_Lp_��k���B�n�?|^�`�|�����Z��z�u�uC����n��X�;)��3?
qOu�?�H�g���i���q�{�R3�.:���������~\; � � � � �@�
���KJ�[���r���y���l3��0Ox�� ����A����]���>���s�E�q���_G�����d���j�NY�� u=Y���}��%S����]�j������M�������=�����>-���s�=�|���@�E�b���o���_����w�Y��35q93��]�-ZnM��_89\��Q}O]�u�/�'Z{�nO��.ov�l��9��g~<���#-�/�D���p��k�n�c�Ku��8�z�+����8������!�]�^[G���V{WtM`��LjI��h�"�b��=M������s,���xf��	���J�:>�|�@�6��'�u��@@@@@�Z��t_��?�'�
ou���C��`���m��(L����V����e��U��~����������&<�#�}������.���:�._��F~�m��k�����vy��Y�f�������H�����[�4t����e���-;<+�Ks�ER�sq�
{�M�!�DZ��3g��\��u����
�����S�n�����I��\�s9����$���z"9\][��m���~C
gO�#+��m��>�5�Y�����<��2���1#�#-%J���?�\~��)i����@9��:��e��<���7R������u�\y}=g��������7����e���C�v��I3��h�k���J%3��!�T����l��3��[�&��d�z��i}��;{�����\���7e@@@@@ �p
y,n�����w���0
��.$md��R���L=|���8X��]�|���N�y���`�Z������3'5�Z�x�|���/�5o_���	:,=����������[:�/]�,h}�&�v����7k~���f�����J�j��|�r���X�S5w����6k��e���e����n�����\��S2�����O���h��E���)[���Kr�ZW��\(?�L2f�`W�y��o��������_�D���vm������U_#-�sd��:z.���d��k�����!c�����%�H��5$[����u*��z����7��L�{v
�?��6������z��FZ�N-����d_?:�?s;����_M�(��zf��	��j��o;��m�A=���O����.�k���P�r@@@@@ ep�s�n����mwy��m[�[�Kr���cP��:f
�$�m�=rT��H~1S����+������]��f���_�!����*?����<��#f}���:����E����S���w��o=U�H�^�tsBJ=������U�=��������� �����!P`��FG�j���?�Z�g]�������?�/?��)a�]pQi�`#I(�?�����Q����������n���k�������H7z���L�}vt��os=�)�F�f���������M[�C��S��{~���U=e�/�kb��5���7z�5]v��{���J���������7�Gz{��R������{��3�����L���%��nf^X����wND���<�[�@@@@@���h��&||�����}�vbF�%uMFw���uz�f
�j�/���	�.q��4#f�L�9`X�?�k��v�:��E���j(����R�P����{�)lM����/<��s���}���7bi����!�������;P�:�Ug�^=��������]��X��M�����O?(W��i�����'����p�
2�5���u�\Q�rOP���oC�����n�Z����Nf��h��C����������w�^{�L��L.�r�e��Q��%�Bl���%�T���:@t�h��H�.ot�w#~��K���c~�<������:i�����0�w���oXT�[�0�i�f��� -��pJ=3���h�6]^�BE
9��c=����0�F����q
@@@@@ p
v4�}��#��8;%����S�� d��2S�*\P���6Q�� 'u:�
f�����[���T�����?};�}(�?����T�c��5��l���_"�������?���&dV^|�y)Z��uZC�>���=�]�'tZY�O���g����4���W�|������Z����[����������o��i����s#���E���p����g���:����v����G�F�gZ��|���O;�kV���� g�����������Uw�i�#-�)u5��i�O�<�T�C�u�{����b3g��5����L'������}�4~��ueM�����v'��\�^���PE-��x�k���V�"<q�s���;������FJ=3����(�6��)�G:=�]tZ�}��J�I�.���W@@@@@������&��0DC-���~�����J��m��]�S�eM����_�tqk��d��%�S���}�S��I�IDAT3���HB�?�O����L�ky��3����k�������S��]���v�4~�L6����P0�hx������I
�t4����������I-����n�:{���j;u��3/�0�������F����L{l��L�l��5�P�Y�����������s�{�b}���b��
��`�?2��y����]�<x��\�9������]��$���]������(�.�� ��T��d��IJ�)aM�^�T1���E�E�%������u�����u��u��.7��?�Q�����?��^����<��CV�dJ=3���`�u�(y�����v9v��tj����^��'���S9; � � � � ��3\CM����_�������[<i�����w�{�����uH���mz���/���fd�����3�s�?hJ���`F?����{��q��q�iF�>t�s���������md��r��	k����6��<�75j �6��9��uZ���H��}RS��m�Q��4m��i�H����+���;���&Y��������.[�>d�>������o>� 3~�����k�j�i����Goj�Z�W5�/�4i����s��
8��
�w��������9rf���h~�������8��Lk�(Nwxl��kJ������:x~L�S��^�&�&'�6%��wB4n��j������B�x�>������O���a@@@@�[\�u�������45��^������g�g?�7�z��T�[��v��)2i�g?9���y�G���/p��u#P8�og�k���)�u4�]v�]�q���1�7�2�8���&�]k����h��}R[��m�?�������.)��{]n���������FyH�����@B(a�}{�[Ss�Sn���f��U^���3Bv���e�����c��}��)k�����'C>�w���)_�
]���Z'��>���^#
p�b�|���L�<�L>3�4����t��Nr������fJ�)fj����p���Z< �/����U+V�����X6R���'D����R�����x���)3��/&�����~\; � � � � �@�
���K�n��$�`�5�*������K.���|��q����%���_��:>kWk�A�i�?�~���f$�����_�����iQ������m:�`M�k���_~�oF��w=��^k)E�v��h��o8���������k�j��.���}���\]���W�v�,��j���f��AJ�+)�n��1k����uu���ouN���@F��������O;O��a��^���������_�M���������]��-��C_}�!��r%��l3�����b�	p����A�����c��
����*�>��@�9|Dz�u��5\����V(c�����9�\�:��n]{|����p�e�g��}�g���xf��	���75?�q���������7���~�pO4�@@@@@�� �5��Z\���}^6�5�����2�W�oFjp������n����K�xK6�����G����+�V����8��4�i��?RX�V
]��og�W�����r�cw�����O[#��Z��7tj��fZ�������x~�����[����6�Fj_]�
��f��<��v�����g6Rc��M+`��}��3�5kV�%��A3Z��-��O3��)3����7X�pS������2a�D�:y��k��CM�;��M�4=`0��^����ko������L�<�S���T��R��r��l~���D��T�{���P{�j�N��i�f��nD�f��AZvx��#���V��o~��Ww4�lm~���V]��CG�?��BC��f�a
�u���M��\u�t��}�kW���v�����l\����y}�tIE���R�DQ���nmz�N�n�s���������2k�7J���CG�N��/��	p�B�#� � � � ��)@�k�-�������_�LY�c�Nk4���;d��=�T��Fd�|y�)�k]U�3�����?�<����5m�,���sM���=ujZ-������rF<�h>][�?j1P=���f�Kv��{�5��P��.��z�����n��U������w@f�u^�l�*[7o��L���a�2�E��t���-���Do�J,�������j�����K���Xu4�]�����Y��v��m�����y���t�`�����2!r�r9d���fr
��ECG-\��w�KV��k�V���K
����K�KkT���z�����yu��@E��:='�^v9������g�n���y�n��B��fsus��_���}���6���J�)!������cc>+sf������tzi����a������/E��Z�YO0�v�����l�&zt}V}���f$�����,��|	�������]���W/_-[���������p����J���z���o{��s�����Y�7k����3�/���g2��sY��c�p�����;#�{���c�.w��6 � � � � �@� �5��Z\}DtD��,�����O��j�������>n{�Y;�'���mw�"����\������~r�#�x���[�����af�����4��3�WC���[zB��H�)���v�2f�(w?��3��>�UCf����%�����kjp�}��u�\y}=�tJ,n�z��i��N�l�<uJ>1a���+��!�#�4K4�1������n0u&��*75j`Fb_�������N�{4��X\���&7�U
�t�;r������?����zrs��<��sc�
�����(���{]c��b����T�T^x�>���������{��)�-����'x�8��~�L7��2���aLR�����4�� � � � � ��\���\!v��WZ�Qf�e�e������cd���-5���&6��po��ru`����8�O�P���������.����x�1���u��-����97j��!�K�/�O[�v�:k*e������u��	�2�FU��q�/]�L��:
�]�\]K�w�sH�h��Z#-/t|N��,j]~��	��m���e������>����l���>'�{t[���)�uMT�|��HY��b{7�W
���V�cF�GZt���O�U�d���
���{��^���=m���m�p/�}W���
Os���������	_N���w��v�t�z^�8��i@�K=��G/vz�Z��>�k9�2#R
@�o~�D�n�:q��#G�E��n����P��:��s����s��
\��i�K�.�-��~�?����t���X�k5��z���-�h�X�W8��<"D.B@@@@HQ\��j��N������68h�m�d��UZ�~�	�4����K�U���j�
�)A+^r�$L���d�w9~��	�VYSt�
��e�*M���U�h����#���m;�G3�VG��*[R��^�cF�~7�{3Rlf�[C�k�Pc��u��d������������f�����X���kF���&�)5�W�����Wa���rH5�U7������zu$��]{������Q��p%P��#�pj{u�����=�m��}�2������L��Q~���������ok�l�����4�L��S��K.35��'Z�HY0g�h�i��:z��y7��(C�8=u$�ihY�<�U��l�RP
�����:�~��
&$^"�L�^�4�{1St�:��w��������5��FX��������o����Y��u�;����������~��+���^Hp�
���'�uxu:�H����#]�V��}���A�J�OC��f
b��(S�t���������������NZG�V7���k�������i���N]�|�r�g���E�)���-:Y`���U���i���v���N�g&�w�>'���7k��<���{&X�n�n�������.���� � � � � �zpM_<���&���"��m�v����$����Z��N�4��!�>���G&��mC&3*8�P5�{��`����+�������z��H�`��G�i��#C5��0Y�M��B#-��W�� *�Y��o}�H���.�6�(L
�b)x:p(�[C��t�B	�!cFS�Ak�e�Z�!+rRH��D����<K�^3�7������G�q��F����9�3c������<bzf��qo��UCt]k8O�<���E��Nn�@�����y��?��}�ah�����oJ<3���k@@@@@��������m��y����6���6
 � � � � � � � �=#rU��rS�f�c�Q���T�:���/�K���@@@@@@����O�53/�^�Z�V��M����n��U��_j���uR � � � � � ��M����|*@@@@@@�C�8�4�� � � � � � �iS�7m�+�
@@@@@@�P�7;�&#� � � � � � �@� �M����B@@@@@@�8 ���N�� � � � � � � �6p�f���@@@@@@@ p���h2 � � � � � ��M����|*@@@@@@�C�8�4�� � � � � � �iS�7m�+�
@@@@@@�P�7;�&#� � � � � � �@� �M����B@@@@@@�8 ���N�� � � � � � � �6p�f���@@@@@@@ p���h2 � � � � � ��M����|*@@@@@@�C�8�4�� � � � � � �iS�7m�+�
@@@@@@�P�7;�&#� � � � � � �@� �M����B@@@@@@�8 ���N�� � � � � � � �6p�f���@@@@@@@ p���h2 � � � � � ��M����|*@@@@@@�C�8�4�� � � � � � �iS�7m�+�
@@@@@@�P�7;�&#� � � � � � �@� �M����B@@@@@@�8 ���N�� � � � � � � �6p�f���@@@@@@@ p���h2 � � � � � ��M����|*@@@@@@�C�8�4�� � � � � � �iS�7m�+�
@@@@@@�P�7;�&#� � � � � � �@� �M����B@@@@@@�8 ���N�� � � � � � � �6p�f���@@@@@@@ p���h2 � � � � � ��M����|*@@@@@@�C�8�4�� � � � � � �iS�7m�+�
@@@@@@�P�7;�&#� � � � � � �@� �M����B@@@@@@�8 ���N�� � � � � � � �6p�f���@@@@@@@ p���h2 � � � � � ��M����|*@@@@@@�C�8�4�� � � � � � �iS�7m�+�
@@@@@@�P�7;�&#� � � � � � �@� �M����B@@@@@@�8 ���N�� � � � � � � �6���[�LU�GIEND�B`�
#69Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#68)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Xuneng!

On Mon, Dec 22, 2025 at 9:57 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Alexander,

Thanks for your feedback!

I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?

I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.

Flush mode on the primary seems useful when synchronous_commit is set
to off [1]. In that mode, a transaction in primary may return success
before its WAL is durably flushed to disk, trading durability for
lower latency. A “wait for primary flush” operation provides an
explicit durability barrier for cases where applications or tools
occasionally need stronger guarantees.

[1] https://postgresqlco.nf/doc/en/param/synchronous_commit/

If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?

Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.

After further consideration, it also seems reasonable to use a single,
unified flush mode that works on both primary and standby servers,
provided its semantics are clearly documented to avoid the potential
confusion on failure. I don’t have a strong preference between these
two and would be interested in your thoughts.

If a standby is promoted while a session is waiting, the command
better abort and return an error (or report “not in recovery” when
using NO_THROW). At that point, the meaning of the LSN being waited
for may have changed due to the timeline switch and the transition
from standby to primary. An LSN such as 0/5000000 on TLI 2 can
represent entirely different WAL content from 0/5000000 on TLI 1.
Allowing the wait to silently continue across promotion risks giving
users a false sense of safety—for example, interpreting “wait
completed” as “the original data is now durable,” which would no
longer be true.

Agree, but there is still risk that promotion happens after user send
the query but before we started to wait. In this case we will still
silently start to wait on primary, while user probably meant to wait
on replica. Probably it would be safer to have separate user-visible
modes for waiting on primary and on replica?

------
Regards,
Alexander Korotkov
Supabase

#70Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#69)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander,

On Thu, Dec 25, 2025 at 7:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Mon, Dec 22, 2025 at 9:57 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Alexander,

Thanks for your feedback!

I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?

I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.

Flush mode on the primary seems useful when synchronous_commit is set
to off [1]. In that mode, a transaction in primary may return success
before its WAL is durably flushed to disk, trading durability for
lower latency. A “wait for primary flush” operation provides an
explicit durability barrier for cases where applications or tools
occasionally need stronger guarantees.

[1] https://postgresqlco.nf/doc/en/param/synchronous_commit/

If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?

Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.

After further consideration, it also seems reasonable to use a single,
unified flush mode that works on both primary and standby servers,
provided its semantics are clearly documented to avoid the potential
confusion on failure. I don’t have a strong preference between these
two and would be interested in your thoughts.

If a standby is promoted while a session is waiting, the command
better abort and return an error (or report “not in recovery” when
using NO_THROW). At that point, the meaning of the LSN being waited
for may have changed due to the timeline switch and the transition
from standby to primary. An LSN such as 0/5000000 on TLI 2 can
represent entirely different WAL content from 0/5000000 on TLI 1.
Allowing the wait to silently continue across promotion risks giving
users a false sense of safety—for example, interpreting “wait
completed” as “the original data is now durable,” which would no
longer be true.

Agree, but there is still risk that promotion happens after user send
the query but before we started to wait. In this case we will still
silently start to wait on primary, while user probably meant to wait
on replica. Probably it would be safer to have separate user-visible
modes for waiting on primary and on replica?

Thanks for your thoughts. You're right about the race condition. If
promotion happens between query submission and execution, a unified
'flush' mode could silently switch semantics without the user knowing.
Separate modes like 'standby_flush' and 'primary_flush' would make
user intent explicit and catch this case with an error, which is
safer. Do these two terms look reasonable to you, or would you suggest
better names? If they look ok, I plan to update the implementation to
use these two modes.

--
Best,
Xuneng

#71Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#70)
Re: Implement waiting for wal lsn replay: reloaded

On Thu, Dec 25, 2025 at 2:52 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 25, 2025 at 7:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Mon, Dec 22, 2025 at 9:57 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Alexander,

Thanks for your feedback!

I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?

I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.

Flush mode on the primary seems useful when synchronous_commit is set
to off [1]. In that mode, a transaction in primary may return success
before its WAL is durably flushed to disk, trading durability for
lower latency. A “wait for primary flush” operation provides an
explicit durability barrier for cases where applications or tools
occasionally need stronger guarantees.

[1] https://postgresqlco.nf/doc/en/param/synchronous_commit/

If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?

Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.

After further consideration, it also seems reasonable to use a single,
unified flush mode that works on both primary and standby servers,
provided its semantics are clearly documented to avoid the potential
confusion on failure. I don’t have a strong preference between these
two and would be interested in your thoughts.

If a standby is promoted while a session is waiting, the command
better abort and return an error (or report “not in recovery” when
using NO_THROW). At that point, the meaning of the LSN being waited
for may have changed due to the timeline switch and the transition
from standby to primary. An LSN such as 0/5000000 on TLI 2 can
represent entirely different WAL content from 0/5000000 on TLI 1.
Allowing the wait to silently continue across promotion risks giving
users a false sense of safety—for example, interpreting “wait
completed” as “the original data is now durable,” which would no
longer be true.

Agree, but there is still risk that promotion happens after user send
the query but before we started to wait. In this case we will still
silently start to wait on primary, while user probably meant to wait
on replica. Probably it would be safer to have separate user-visible
modes for waiting on primary and on replica?

Thanks for your thoughts. You're right about the race condition. If
promotion happens between query submission and execution, a unified
'flush' mode could silently switch semantics without the user knowing.
Separate modes like 'standby_flush' and 'primary_flush' would make
user intent explicit and catch this case with an error, which is
safer. Do these two terms look reasonable to you, or would you suggest
better names? If they look ok, I plan to update the implementation to
use these two modes.

Thank you, Xuneng. 'standby_flush' and 'primary_flush' look good for
me. Please, go ahead. I think we should name other modes
'standby_write' and 'standby_replay' for the sake of unity.

------
Regards,
Alexander Korotkov
Supabase

#72Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#71)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Fri, Dec 26, 2025 at 12:34 AM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Thu, Dec 25, 2025 at 2:52 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 25, 2025 at 7:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Mon, Dec 22, 2025 at 9:57 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Alexander,

Thanks for your feedback!

I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?

I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.

Flush mode on the primary seems useful when synchronous_commit is set
to off [1]. In that mode, a transaction in primary may return success
before its WAL is durably flushed to disk, trading durability for
lower latency. A “wait for primary flush” operation provides an
explicit durability barrier for cases where applications or tools
occasionally need stronger guarantees.

[1] https://postgresqlco.nf/doc/en/param/synchronous_commit/

If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?

Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.

After further consideration, it also seems reasonable to use a single,
unified flush mode that works on both primary and standby servers,
provided its semantics are clearly documented to avoid the potential
confusion on failure. I don’t have a strong preference between these
two and would be interested in your thoughts.

If a standby is promoted while a session is waiting, the command
better abort and return an error (or report “not in recovery” when
using NO_THROW). At that point, the meaning of the LSN being waited
for may have changed due to the timeline switch and the transition
from standby to primary. An LSN such as 0/5000000 on TLI 2 can
represent entirely different WAL content from 0/5000000 on TLI 1.
Allowing the wait to silently continue across promotion risks giving
users a false sense of safety—for example, interpreting “wait
completed” as “the original data is now durable,” which would no
longer be true.

Agree, but there is still risk that promotion happens after user send
the query but before we started to wait. In this case we will still
silently start to wait on primary, while user probably meant to wait
on replica. Probably it would be safer to have separate user-visible
modes for waiting on primary and on replica?

Thanks for your thoughts. You're right about the race condition. If
promotion happens between query submission and execution, a unified
'flush' mode could silently switch semantics without the user knowing.
Separate modes like 'standby_flush' and 'primary_flush' would make
user intent explicit and catch this case with an error, which is
safer. Do these two terms look reasonable to you, or would you suggest
better names? If they look ok, I plan to update the implementation to
use these two modes.

Thank you, Xuneng. 'standby_flush' and 'primary_flush' look good for
me. Please, go ahead. I think we should name other modes
'standby_write' and 'standby_replay' for the sake of unity.

Thanks. Yeah, renaming existing modes to 'standby_write' and
'standby_replay' also makes sense to me.

--
Best,
Xuneng

#73Chao Li
li.evan.chao@gmail.com
In reply to: Xuneng Zhou (#65)
Re: Implement waiting for wal lsn replay: reloaded

On Dec 19, 2025, at 10:49, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

This is exactly what I meant. Please, go ahead.

Here is the updated patch set (v7). Please check.

--
Best,
Xuneng
<v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patch><v7-0004-Use-WAIT-FOR-LSN-in.patch><v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patch><v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patch>

Hi Xuneng,

A solid patch! Just a few small comments:

1 - 0001
```
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
```

As you add pg_unreachable() in the new function GetCurrentLSNForWaitType(), I’m thinking if we should just do an Assert(), I saw every existing related function has done such an assert, for example addLSNWaiter(), it does “Assert(i >= 0 && i < WAIT_LSN_TYPE_COUNT);”. I guess we can just following the current mechanism to verify lsnType. So, for GetCurrentLSNForWaitType(), we can just add a default clause and Assert(false).

2 - 0002
```
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+								"MODE", mode_str),
```

I wonder why don’t we directly put MODE into the error message?

3 - 0002
```
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				if (PromoteIsTriggered())
 				{
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
```

currentLSN is only used in the if clause, thus it can be defined inside the if clause.

3 - 0002
```
+	/*
+	 * If we wrote an LSN that someone was waiting for then walk over the
+	 * shared memory array and set latches to notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
```

Do we need to mention "walk over the shared memory array and set latches” in the comment? The logic belongs to WaitLSNWakeup(). What about if the wake up logic changes in future, then this comment would become stale. So I think we only need to mention “notify the waiters”.

4 - 0003
```
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.  mode takes a string value ('replay', 'write', 'flush'),
+	 * timeout takes a string value, no_throw takes no value.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
```

The new comment has lost the meaning of “We don’t offer completions for these values (timeout and no_throw)”, to be explicit, I feel we can retain the sentence.

5 - 0004
```
+	my $isrecovery =
+	  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+	chomp($isrecovery);
 	croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
 	}
 	if (!defined($target_lsn))
 	{
-		my $isrecovery =
-		  $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-		chomp($isrecovery);
```

I wonder why pull up pg_is_in_recovery to an early place and unconditionally call it?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#74Xuneng Zhou
xunengzhou@gmail.com
In reply to: Chao Li (#73)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi Chao,

Thanks a lot for your review!

On Fri, Dec 26, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Dec 19, 2025, at 10:49, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

This is exactly what I meant. Please, go ahead.

Here is the updated patch set (v7). Please check.

--
Best,
Xuneng
<v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patch><v7-0004-Use-WAIT-FOR-LSN-in.patch><v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patch><v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patch>

Hi Xuneng,

A solid patch! Just a few small comments:

1 - 0001
```
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+       switch (lsnType)
+       {
+               case WAIT_LSN_TYPE_STANDBY_REPLAY:
+                       return GetXLogReplayRecPtr(NULL);
+
+               case WAIT_LSN_TYPE_STANDBY_WRITE:
+                       return GetWalRcvWriteRecPtr();
+
+               case WAIT_LSN_TYPE_STANDBY_FLUSH:
+                       return GetWalRcvFlushRecPtr(NULL, NULL);
+
+               case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+                       return GetFlushRecPtr(NULL);
+       }
+
+       elog(ERROR, "invalid LSN wait type: %d", lsnType);
+       pg_unreachable();
+}
```

As you add pg_unreachable() in the new function GetCurrentLSNForWaitType(), I’m thinking if we should just do an Assert(), I saw every existing related function has done such an assert, for example addLSNWaiter(), it does “Assert(i >= 0 && i < WAIT_LSN_TYPE_COUNT);”. I guess we can just following the current mechanism to verify lsnType. So, for GetCurrentLSNForWaitType(), we can just add a default clause and Assert(false).

My take is that Assert(false) alone might not be enough here, since
assertions vanish in non-assert builds. An unexpected lsnType is a
real bug even in production, so keeping a hard error plus
pg_unreachable() seems to be a safer pattern. It also acts as a
guardrail for future extensions — if new wait types are added without
updating this code, we’ll fail loudly rather than silently returning
an incorrect LSN. Assert(i >= 0 && i < WAIT_LSN_TYPE_COUNT) was added
to the top of the function.

2 - 0002
```
+                       else
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                                errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+                                                               "MODE", mode_str),
```

I wonder why don’t we directly put MODE into the error message?

Yeah, putting MODE into the error message is cleaner. It's done in v8.

3 - 0002
```
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+                               const           WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+                               XLogRecPtr      currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
-                                                       errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+                                                       errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
-                                                                         LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+                                                                         desc->verb,
+                                                                         desc->noun,
+                                                                         LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
-                                                       errhint("Waiting for the replay LSN can only be executed during recovery."));
+                                                       errhint("Waiting for the %s LSN can only be executed during recovery.",
+                                                                       desc->noun));
}
```

currentLSN is only used in the if clause, thus it can be defined inside the if clause.

+ 1.

3 - 0002
```
+       /*
+        * If we wrote an LSN that someone was waiting for then walk over the
+        * shared memory array and set latches to notify the waiters.
+        */
+       if (waitLSNState &&
+               (LogstreamResult.Write >=
+                pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+               WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
```

Do we need to mention "walk over the shared memory array and set latches” in the comment? The logic belongs to WaitLSNWakeup(). What about if the wake up logic changes in future, then this comment would become stale. So I think we only need to mention “notify the waiters”.

It makes sense to me. They are incorporated into v8.

4 - 0003
```
+       /*
+        * Handle parenthesized option list.  This fires when we're in an
+        * unfinished parenthesized option list.  get_previous_words treats a
+        * completed parenthesized option list as one word, so the above test is
+        * correct.  mode takes a string value ('replay', 'write', 'flush'),
+        * timeout takes a string value, no_throw takes no value.
+        */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
-               /*
-                * This fires if we're in an unfinished parenthesized option list.
-                * get_previous_words treats a completed parenthesized option list as
-                * one word, so the above test is correct.
-                */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-                       COMPLETE_WITH("timeout", "no_throw");
-
-               /*
-                * timeout takes a string value, no_throw takes no value. We don't
-                * offer completions for these values.
-                */
```

The new comment has lost the meaning of “We don’t offer completions for these values (timeout and no_throw)”, to be explicit, I feel we can retain the sentence.

The sentence is retained.

5 - 0004
```
+       my $isrecovery =
+         $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+       chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
-               my $isrecovery =
-                 $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-               chomp($isrecovery);
```

I wonder why pull up pg_is_in_recovery to an early place and unconditionally call it?

This seems unnecessary. I also realized that my earlier approach in
patch 4 may have been semantically incorrect — it could end up waiting
for the LSN to replay/write/flush on the node itself, rather than on
the downstream standby, which defeats the purpose of
wait_for_catchup(). Patch 4 attempts to address this by running WAIT
FOR LSN on the standby itself.

Support for primary-flush waiting and the refactoring of existing
modes have been also incorporated in v8 following Alexander’s
feedback. The major updates are in patches 2 and 4. Please check.

--
Best,
Xuneng

Attachments:

v8-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v8-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From 668956d0d0794c489167912d54d4c9c7bb237754 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v8 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE option.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 96 +++++++++++++++----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 14 ++-
 6 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b7ef589fc0..fdb92deac57 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6280,7 +6280,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 38b594d2170..2d81bb1a9a7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1856,8 +1856,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..5f4ff50cf38 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,47 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	Assert(lsnType >= 0 && lsnType < WAIT_LSN_TYPE_COUNT);
+
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -302,6 +349,19 @@ WaitLSNCleanup(void)
 	}
 }
 
+/*
+ * Check if the given LSN type requires recovery to be in progress.
+ * Standby wait types (replay, write, flush) require recovery;
+ * primary wait types (flush) do not.
+ */
+static inline bool
+WaitLSNTypeRequiresRecovery(WaitLSNType t)
+{
+	return t == WAIT_LSN_TYPE_STANDBY_REPLAY ||
+		t == WAIT_LSN_TYPE_STANDBY_WRITE ||
+		t == WAIT_LSN_TYPE_STANDBY_FLUSH;
+}
+
 /*
  * Wait using MyLatch till the given LSN is reached, the replica gets
  * promoted, or the postmaster dies.
@@ -341,13 +401,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (WaitLSNTypeRequiresRecovery(lsnType) && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +434,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..e62054585cb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..4cf13f0ccb3 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.h
- *	  Declarations for LSN replay waiting routines.
+ *	  Declarations for WAL flush, write, and replay waiting routines.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v8-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchapplication/octet-stream; name=v8-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchDownload
From de1e5024a8bc757ab2b10daa3c63ab67fbfee0a1 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v8 3/4] Add tab completion for WAIT FOR LSN MODE option

Update psql tab completion to support the optional MODE option in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords. The MODE option controls whether
the wait is evaluated from the standby or primary perspective.

When MODE is specified, completion suggests the valid mode values:
standby_replay, standby_write, standby_flush, and primary_flush.
---
 src/bin/psql/tab-complete.in.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 75a101c6ab5..62d87561169 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5355,8 +5355,10 @@ match_previous_words(int pattern_id,
 /*
  * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
  * where option can be:
+ *   MODE '<mode>'
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * and mode can be: standby_replay | standby_write | standby_flush | primary_flush
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5369,21 +5371,25 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH");
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.
+	 *
+	 * 'mode' takes a string value ('standby_replay', 'standby_write',
+	 * 'standby_flush', 'primary_flush'). 'timeout' takes a string value, and
+	 * 'no_throw' takes no value. We do not offer completions for the *values*
+	 * of 'timeout' or 'no_throw'.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+			COMPLETE_WITH("mode", "timeout", "no_throw");
+		else if (TailMatches("mode"))
+			COMPLETE_WITH("'standby_replay'", "'standby_write'", "'standby_flush'", "'primary_flush'");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v8-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchapplication/octet-stream; name=v8-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchDownload
From e693ce76edb3bd93078dca152e0b551b871db859 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v8 4/4] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 59 +++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..51e5324bff3 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,53 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write' => 'standby_write',
+				'flush' => 'standby_flush',);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+			my $output = $standby_node->safe_psql('postgres', $wait_query);
+			chomp($output);
+
+			if ($output ne 'success')
+			{
+				# Fetch additional detail for debugging purposes
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN failed with status:
+	${output});
+				diag qq(Last pg_stat_replication contents:
+	${details});
+				croak "failed waiting for catchup";
+			}
+			print "done\n";
+			return;
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v8-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v8-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchDownload
From 73513f3aacddb8e9d8762215a6c2000fc21f1589 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v8 2/4] Add MODE option to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE option in the
WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
- 'standby_replay' (default): Wait for WAL to be replayed to the specified LSN
- 'standby_write': Wait for WAL to be written (received) to the specified LSN
- 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN
- 'primary_flush': Wait for WAL to be flushed to disk on the primary server

The default mode is 'standby_replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands where options are specified as string values in the WITH clause.

Modes are explicitly named to distinguish between primary and standby operations:
- Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only
  be used during recovery (on a standby server)
- Primary mode ('primary_flush') can only be used on a primary server

The 'standby_write' and 'standby_flush' modes are useful for scenarios where
applications need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete. The 'primary_flush' mode
allows waiting for WAL to be flushed on the primary server.

Also includes:
- Documentation updates for the new syntax and mode descriptions
- Test coverage for all four modes including error cases and concurrent waiters
- Wakeup logic in walreceiver for standby write/flush waiters
- Wakeup logic in WAL writer for primary flush waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 213 +++++++++---
 src/backend/access/transam/xlog.c       |  22 +-
 src/backend/commands/wait.c             |  96 +++++-
 src/backend/replication/walreceiver.c   |  18 ++
 src/test/recovery/t/049_wait_for_lsn.pl | 411 ++++++++++++++++++++++--
 5 files changed, 673 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..df72b3327c8 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym></refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    MODE '<replaceable class="parameter">mode</replaceable>'
     TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
     NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+    standby_replay | standby_write | standby_flush | primary_flush
 </synopsis>
  </refsynopsisdiv>
 
@@ -34,20 +40,27 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+  </para>
+
+  <para>
+   On timeout, an error is emitted unless <literal>NO_THROW</literal>
+   is specified in the WITH clause. For standby modes
+   (<literal>standby_replay</literal>, <literal>standby_write</literal>,
+   <literal>standby_flush</literal>), an error is also emitted if the
+   server is promoted before the <parameter>lsn</parameter> is reached.
+   If <parameter>NO_THROW</parameter> is specified, the command returns
+   a status string instead of throwing errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -72,6 +85,65 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
       The following parameters are supported:
 
       <variablelist>
+       <varlistentry>
+        <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+        <listitem>
+         <para>
+          Specifies the type of LSN processing to wait for. If not specified,
+          the default is <literal>standby_replay</literal>. The valid modes are:
+         </para>
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>standby_replay</literal>: Wait for the LSN to be replayed
+            (applied to the database) on a standby server. After successful
+            completion, <function>pg_last_wal_replay_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_write</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and written to disk on a
+            standby server, but not yet flushed. This is faster than
+            <literal>standby_flush</literal> but provides weaker durability
+            guarantees since the data may still be in operating system
+            buffers. After successful completion, the
+            <structfield>written_lsn</structfield> column in
+            <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link> will show
+            a value greater than or equal to the target LSN. This mode can
+            only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_flush</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and flushed to disk on a
+            standby server. This provides a durability guarantee without
+            waiting for the WAL to be applied. After successful completion,
+            <function>pg_last_wal_receive_lsn()</function> will return a
+            value greater than or equal to the target LSN. This value is
+            also available as the <structfield>flushed_lsn</structfield>
+            column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link>. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>primary_flush</literal>: Wait for the WAL containing the
+            LSN to be flushed to disk on a primary server. After successful
+            completion, <function>pg_current_wal_flush_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used on a primary server (not during recovery).
+           </para>
+          </listitem>
+         </itemizedlist>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
         <listitem>
@@ -135,9 +207,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +223,34 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>standby_replay</literal> mode
+   waits for the LSN to be replayed (applied to the database), which is
+   useful to achieve read-your-writes consistency while using an async
+   replica for reads and the primary for writes. The
+   <literal>standby_flush</literal> mode waits for the WAL to be flushed
+   to durable storage on the replica, providing a durability guarantee
+   without waiting for replay. The <literal>standby_write</literal> mode
+   waits for the WAL to be written to the operating system, which is
+   faster than flush but provides weaker durability guarantees. The
+   <literal>primary_flush</literal> mode waits for WAL to be flushed on
+   a primary server. In all cases, the <acronym>LSN</acronym> of the last
+   modification should be stored on the client application side or the
+   connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   The standby modes (<literal>standby_replay</literal>,
+   <literal>standby_write</literal>, <literal>standby_flush</literal>)
+   can only be used during recovery, and <literal>primary_flush</literal>
+   can only be used on a primary server. Using the wrong mode for the
+   current server state will result in an error. If a standby is promoted
+   while waiting with a standby mode, the command will return
+   <literal>not in recovery</literal> (or throw an error if
+   <literal>NO_THROW</literal> is not specified). Promotion creates a new
+   timeline, and the LSN being waited for may refer to WAL from the old
+   timeline.
   </para>
 
 </refsect1>
@@ -175,21 +259,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -200,7 +284,7 @@ pg_current_wal_insert_lsn
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20';
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +295,43 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for flush on primary:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'primary_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
 
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +341,12 @@ ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fdb92deac57..da96b627228 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,6 +2918,14 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3100,6 +3108,14 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6277,10 +6293,12 @@ StartupXLOG(void)
 	WakeupCheckpointer();
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..a85c3b0de98 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,18 +29,39 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* Mode name: "standby_replay",
+								 * "standby_write", "standby_flush",
+								 * "primary_flush" */
+	const char *verb;			/* Past participle: "replayed", "written",
+								 * "flushed" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = {"standby_replay", "replayed"},
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = {"standby_write", "written"},
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = {"standby_flush", "flushed"},
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = {"primary_flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
 	const char *result = "<unset>";
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
+	bool		mode_specified = false;
 
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +69,32 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 	foreach_node(DefElem, defel, stmt->options)
 	{
-		if (strcmp(defel->defname, "timeout") == 0)
+		if (strcmp(defel->defname, "mode") == 0)
+		{
+			char	   *mode_str;
+
+			if (mode_specified)
+				errorConflictingDefElem(defel, pstate);
+			mode_specified = true;
+
+			mode_str = defGetString(defel);
+
+			if (pg_strcasecmp(mode_str, "standby_replay") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			else if (pg_strcasecmp(mode_str, "standby_write") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			else if (pg_strcasecmp(mode_str, "standby_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			else if (pg_strcasecmp(mode_str, "primary_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_PRIMARY_FLUSH;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for WAIT option \"MODE\": \"%s\"",
+								mode_str),
+						 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "timeout") == 0)
 		{
 			char	   *timeout_str;
 			const char *hintmsg;
@@ -107,8 +154,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +187,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	/*
+	 * Validate that the requested mode matches the current server state.
+	 * Primary modes can only be used on a primary.
+	 */
+	if (lsnType == WAIT_LSN_TYPE_PRIMARY_FLUSH)
+	{
+		if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is in progress"),
+					 errhint("Waiting for primary_flush can only be done on a primary server. "
+							 "Use standby_flush mode on a standby server.")));
+	}
+
+	/* Now wait for the LSN */
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +216,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +235,27 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+
 				if (PromoteIsTriggered())
 				{
+					XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..404d348da37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1013,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for, notify the
+		 * waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..e41aad45e28 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,5 +1,6 @@
-# Checks waiting for the LSN replay on standby using
-# the WAIT FOR command.
+# Checks waiting for the LSN using the WAIT FOR command.
+# Tests standby modes (standby_replay/standby_write/standby_flush) on standby
+# and primary_flush mode on primary.
 use strict;
 use warnings FATAL => 'all';
 
@@ -7,6 +8,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql(
+		'postgres', qq[
+		SELECT pg_catalog.quote_literal(setting)
+		FROM pg_settings
+		WHERE name = 'primary_conninfo';
+	]);
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = $saved_primary_conninfo;
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +99,52 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with standby_write, standby_flush, and
+# primary_flush modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' WITH (MODE 'standby_write', timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR with MODE 'standby_write'"
+);
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' WITH (MODE 'standby_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR with MODE 'standby_flush'"
+);
+
+# Check primary_flush mode on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn_primary_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_primary->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_primary_flush}' WITH (MODE 'primary_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '${lsn_primary_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"primary flushed WAL up to target LSN after WAIT FOR with MODE 'primary_flush'"
+);
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,14 +170,26 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
-# within another function, or inside a transaction with an isolation level
-# higher than READ COMMITTED.
+# 5. Check mode validation: standby modes error on primary, primary mode errors
+# on standby, and primary_flush works on primary.  Also check that WAIT FOR
+# triggers an error if called within another function or inside a transaction
+# with an isolation level higher than READ COMMITTED.
+
+# Test standby_flush on primary - should error
+$node_primary->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'standby_flush');",
+	stderr => \$stderr);
+ok($stderr =~ /recovery is not in progress/,
+	"get an error when running standby_flush on the primary");
 
-$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+# Test primary_flush on standby - should error
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'primary_flush');",
 	stderr => \$stderr);
-ok( $stderr =~ /recovery is not in progress/,
-	"get an error when running on the primary");
+ok($stderr =~ /recovery is in progress/,
+	"get an error when running primary_flush on the standby");
 
 $node_standby->psql(
 	'postgres',
@@ -125,7 +219,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,10 +302,26 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
-# psql sessions each waiting for a corresponding insertion.  When waiting is
-# finished, stored procedures logs if there are visible as many rows as
-# should be.
+# Test invalid MODE value
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "MODE": "invalid"/,
+	"get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'standby_replay', MODE 'standby_write');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple standby_replay waiters.  We make 5
+# background psql sessions each waiting for a corresponding insertion.  When
+# waiting is finished, stored procedures logs if there are visible as many
+# rows as should be.
 $node_primary->safe_psql(
 	'postgres', qq[
 CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
@@ -225,8 +335,17 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
   END
 \$\$
 LANGUAGE plpgsql;
+
+CREATE FUNCTION log_wait_done(prefix text, i int) RETURNS void AS \$\$
+  BEGIN
+    RAISE LOG '% %', prefix, i;
+  END
+\$\$
+LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -243,6 +362,7 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +371,246 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple standby_replay waiters reported consistent data');
+
+# 7b. Check the scenario of multiple standby_write waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_write waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'standby_write', timeout '1d');
+		SELECT log_wait_done('write_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_write waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple standby_flush waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_flush waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'standby_flush', timeout '1d');
+		SELECT log_wait_done('flush_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_flush waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed standby mode waiters (standby_replay,
+# standby_write, standby_flush) running concurrently.  We start 6 sessions:
+# 2 for each mode, all waiting for the same target LSN.  We stop the
+# walreceiver and pause replay to ensure all waiters block.  Then we resume
+# replay and restart the walreceiver to verify they unblock and complete
+# correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('standby_replay', 'standby_write', 'standby_flush');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+		SELECT log_wait_done('mixed_done', $i);
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 7e. Check the scenario of multiple primary_flush waiters on primary.
+# We start 5 background sessions waiting for different LSNs with primary_flush
+# mode.  Each waiter logs when done.
+my @primary_flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (400 + ${i});");
+	$primary_flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+my $primary_flush_log_offset = -s $node_primary->logfile;
+
+# Start primary_flush waiters
+my @primary_flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$primary_flush_sessions[$i] = $node_primary->background_psql('postgres');
+	$primary_flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$primary_flush_lsns[$i]' WITH (MODE 'primary_flush', timeout '1d');
+		SELECT log_wait_done('primary_flush_done', $i);
+	]);
+}
+
+# The WAL should already be flushed, so waiters should complete quickly
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->wait_for_log("primary_flush_done $i",
+		$primary_flush_log_offset);
+	$primary_flush_sessions[$i]->quit;
+}
+
+# Verify on primary that WAL was flushed up to the target LSN
+$output = $node_primary->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '$primary_flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple primary_flush waiters: primary flushed WAL up to target LSN");
+
+# 8. Check that the standby promotion terminates all standby wait modes.  Start
+# waiting for unreachable LSNs with standby_replay, standby_write, and
+# standby_flush modes, then promote.  Check the log for the relevant error
+# messages.  Also, check that waiting for already replayed LSN doesn't cause
+# an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('standby_replay', 'standby_write', 'standby_flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +620,16 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
 
 $node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
 
@@ -295,8 +645,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

#75Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#74)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Sat, Dec 27, 2025 at 12:15 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Chao,

Thanks a lot for your review!

On Fri, Dec 26, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Dec 19, 2025, at 10:49, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

This is exactly what I meant. Please, go ahead.

Here is the updated patch set (v7). Please check.

--
Best,
Xuneng
<v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patch><v7-0004-Use-WAIT-FOR-LSN-in.patch><v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patch><v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patch>

Hi Xuneng,

A solid patch! Just a few small comments:

1 - 0001
```
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+       switch (lsnType)
+       {
+               case WAIT_LSN_TYPE_STANDBY_REPLAY:
+                       return GetXLogReplayRecPtr(NULL);
+
+               case WAIT_LSN_TYPE_STANDBY_WRITE:
+                       return GetWalRcvWriteRecPtr();
+
+               case WAIT_LSN_TYPE_STANDBY_FLUSH:
+                       return GetWalRcvFlushRecPtr(NULL, NULL);
+
+               case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+                       return GetFlushRecPtr(NULL);
+       }
+
+       elog(ERROR, "invalid LSN wait type: %d", lsnType);
+       pg_unreachable();
+}
```

As you add pg_unreachable() in the new function GetCurrentLSNForWaitType(), I’m thinking if we should just do an Assert(), I saw every existing related function has done such an assert, for example addLSNWaiter(), it does “Assert(i >= 0 && i < WAIT_LSN_TYPE_COUNT);”. I guess we can just following the current mechanism to verify lsnType. So, for GetCurrentLSNForWaitType(), we can just add a default clause and Assert(false).

My take is that Assert(false) alone might not be enough here, since
assertions vanish in non-assert builds. An unexpected lsnType is a
real bug even in production, so keeping a hard error plus
pg_unreachable() seems to be a safer pattern. It also acts as a
guardrail for future extensions — if new wait types are added without
updating this code, we’ll fail loudly rather than silently returning
an incorrect LSN. Assert(i >= 0 && i < WAIT_LSN_TYPE_COUNT) was added
to the top of the function.

2 - 0002
```
+                       else
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                                errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+                                                               "MODE", mode_str),
```

I wonder why don’t we directly put MODE into the error message?

Yeah, putting MODE into the error message is cleaner. It's done in v8.

3 - 0002
```
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+                               const           WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+                               XLogRecPtr      currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
-                                                       errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+                                                       errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
-                                                                         LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+                                                                         desc->verb,
+                                                                         desc->noun,
+                                                                         LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
-                                                       errhint("Waiting for the replay LSN can only be executed during recovery."));
+                                                       errhint("Waiting for the %s LSN can only be executed during recovery.",
+                                                                       desc->noun));
}
```

currentLSN is only used in the if clause, thus it can be defined inside the if clause.

+ 1.

3 - 0002
```
+       /*
+        * If we wrote an LSN that someone was waiting for then walk over the
+        * shared memory array and set latches to notify the waiters.
+        */
+       if (waitLSNState &&
+               (LogstreamResult.Write >=
+                pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+               WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
```

Do we need to mention "walk over the shared memory array and set latches” in the comment? The logic belongs to WaitLSNWakeup(). What about if the wake up logic changes in future, then this comment would become stale. So I think we only need to mention “notify the waiters”.

It makes sense to me. They are incorporated into v8.

4 - 0003
```
+       /*
+        * Handle parenthesized option list.  This fires when we're in an
+        * unfinished parenthesized option list.  get_previous_words treats a
+        * completed parenthesized option list as one word, so the above test is
+        * correct.  mode takes a string value ('replay', 'write', 'flush'),
+        * timeout takes a string value, no_throw takes no value.
+        */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
-               /*
-                * This fires if we're in an unfinished parenthesized option list.
-                * get_previous_words treats a completed parenthesized option list as
-                * one word, so the above test is correct.
-                */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-                       COMPLETE_WITH("timeout", "no_throw");
-
-               /*
-                * timeout takes a string value, no_throw takes no value. We don't
-                * offer completions for these values.
-                */
```

The new comment has lost the meaning of “We don’t offer completions for these values (timeout and no_throw)”, to be explicit, I feel we can retain the sentence.

The sentence is retained.

5 - 0004
```
+       my $isrecovery =
+         $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+       chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
-               my $isrecovery =
-                 $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-               chomp($isrecovery);
```

I wonder why pull up pg_is_in_recovery to an early place and unconditionally call it?

This seems unnecessary. I also realized that my earlier approach in
patch 4 may have been semantically incorrect — it could end up waiting
for the LSN to replay/write/flush on the node itself, rather than on
the downstream standby, which defeats the purpose of
wait_for_catchup(). Patch 4 attempts to address this by running WAIT
FOR LSN on the standby itself.

Support for primary-flush waiting and the refactoring of existing
modes have been also incorporated in v8 following Alexander’s
feedback. The major updates are in patches 2 and 4. Please check.

Added WaitLSNTypeDesc to typedefs.list in v9 patch 2.

--
Best,
Xuneng

Attachments:

v9-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v9-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From 668956d0d0794c489167912d54d4c9c7bb237754 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v9 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 96 +++++++++++++++----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 14 ++-
 6 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b7ef589fc0..fdb92deac57 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6280,7 +6280,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 38b594d2170..2d81bb1a9a7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1856,8 +1856,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..5f4ff50cf38 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,47 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	Assert(lsnType >= 0 && lsnType < WAIT_LSN_TYPE_COUNT);
+
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -302,6 +349,19 @@ WaitLSNCleanup(void)
 	}
 }
 
+/*
+ * Check if the given LSN type requires recovery to be in progress.
+ * Standby wait types (replay, write, flush) require recovery;
+ * primary wait types (flush) do not.
+ */
+static inline bool
+WaitLSNTypeRequiresRecovery(WaitLSNType t)
+{
+	return t == WAIT_LSN_TYPE_STANDBY_REPLAY ||
+		t == WAIT_LSN_TYPE_STANDBY_WRITE ||
+		t == WAIT_LSN_TYPE_STANDBY_FLUSH;
+}
+
 /*
  * Wait using MyLatch till the given LSN is reached, the replica gets
  * promoted, or the postmaster dies.
@@ -341,13 +401,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (WaitLSNTypeRequiresRecovery(lsnType) && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +434,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..e62054585cb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..4cf13f0ccb3 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.h
- *	  Declarations for LSN replay waiting routines.
+ *	  Declarations for WAL flush, write, and replay waiting routines.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v9-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchapplication/octet-stream; name=v9-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchDownload
From dd0d980c031271eeada1c6f8dfc40ff7d58ced09 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v9 3/4] Add tab completion for WAIT FOR LSN MODE option

Update psql tab completion to support the optional MODE option in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords. The MODE option controls whether
the wait is evaluated from the standby or primary perspective.

When MODE is specified, completion suggests the valid mode values:
standby_replay, standby_write, standby_flush, and primary_flush.
---
 src/bin/psql/tab-complete.in.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 75a101c6ab5..62d87561169 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5355,8 +5355,10 @@ match_previous_words(int pattern_id,
 /*
  * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
  * where option can be:
+ *   MODE '<mode>'
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * and mode can be: standby_replay | standby_write | standby_flush | primary_flush
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5369,21 +5371,25 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH");
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.
+	 *
+	 * 'mode' takes a string value ('standby_replay', 'standby_write',
+	 * 'standby_flush', 'primary_flush'). 'timeout' takes a string value, and
+	 * 'no_throw' takes no value. We do not offer completions for the *values*
+	 * of 'timeout' or 'no_throw'.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+			COMPLETE_WITH("mode", "timeout", "no_throw");
+		else if (TailMatches("mode"))
+			COMPLETE_WITH("'standby_replay'", "'standby_write'", "'standby_flush'", "'primary_flush'");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v9-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v9-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchDownload
From aeb6a6a80e04cd591979181f23afb3584ef85f4d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v9 2/4] Add MODE option to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE option in the
WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
- 'standby_replay' (default): Wait for WAL to be replayed to the specified LSN
- 'standby_write': Wait for WAL to be written (received) to the specified LSN
- 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN
- 'primary_flush': Wait for WAL to be flushed to disk on the primary server

The default mode is 'standby_replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands where options are specified as string values in the WITH clause.

Modes are explicitly named to distinguish between primary and standby operations:
- Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only
  be used during recovery (on a standby server)
- Primary mode ('primary_flush') can only be used on a primary server

The 'standby_write' and 'standby_flush' modes are useful for scenarios where
applications need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete. The 'primary_flush' mode
allows waiting for WAL to be flushed on the primary server.

Also includes:
- Documentation updates for the new syntax and mode descriptions
- Test coverage for all four modes including error cases and concurrent waiters
- Wakeup logic in walreceiver for standby write/flush waiters
- Wakeup logic in WAL writer for primary flush waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 213 +++++++++---
 src/backend/access/transam/xlog.c       |  22 +-
 src/backend/commands/wait.c             |  96 +++++-
 src/backend/replication/walreceiver.c   |  18 ++
 src/test/recovery/t/049_wait_for_lsn.pl | 411 ++++++++++++++++++++++--
 src/tools/pgindent/typedefs.list        |   1 +
 6 files changed, 674 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..df72b3327c8 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym></refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    MODE '<replaceable class="parameter">mode</replaceable>'
     TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
     NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+    standby_replay | standby_write | standby_flush | primary_flush
 </synopsis>
  </refsynopsisdiv>
 
@@ -34,20 +40,27 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+  </para>
+
+  <para>
+   On timeout, an error is emitted unless <literal>NO_THROW</literal>
+   is specified in the WITH clause. For standby modes
+   (<literal>standby_replay</literal>, <literal>standby_write</literal>,
+   <literal>standby_flush</literal>), an error is also emitted if the
+   server is promoted before the <parameter>lsn</parameter> is reached.
+   If <parameter>NO_THROW</parameter> is specified, the command returns
+   a status string instead of throwing errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -72,6 +85,65 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
       The following parameters are supported:
 
       <variablelist>
+       <varlistentry>
+        <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+        <listitem>
+         <para>
+          Specifies the type of LSN processing to wait for. If not specified,
+          the default is <literal>standby_replay</literal>. The valid modes are:
+         </para>
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>standby_replay</literal>: Wait for the LSN to be replayed
+            (applied to the database) on a standby server. After successful
+            completion, <function>pg_last_wal_replay_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_write</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and written to disk on a
+            standby server, but not yet flushed. This is faster than
+            <literal>standby_flush</literal> but provides weaker durability
+            guarantees since the data may still be in operating system
+            buffers. After successful completion, the
+            <structfield>written_lsn</structfield> column in
+            <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link> will show
+            a value greater than or equal to the target LSN. This mode can
+            only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_flush</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and flushed to disk on a
+            standby server. This provides a durability guarantee without
+            waiting for the WAL to be applied. After successful completion,
+            <function>pg_last_wal_receive_lsn()</function> will return a
+            value greater than or equal to the target LSN. This value is
+            also available as the <structfield>flushed_lsn</structfield>
+            column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link>. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>primary_flush</literal>: Wait for the WAL containing the
+            LSN to be flushed to disk on a primary server. After successful
+            completion, <function>pg_current_wal_flush_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used on a primary server (not during recovery).
+           </para>
+          </listitem>
+         </itemizedlist>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
         <listitem>
@@ -135,9 +207,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +223,34 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>standby_replay</literal> mode
+   waits for the LSN to be replayed (applied to the database), which is
+   useful to achieve read-your-writes consistency while using an async
+   replica for reads and the primary for writes. The
+   <literal>standby_flush</literal> mode waits for the WAL to be flushed
+   to durable storage on the replica, providing a durability guarantee
+   without waiting for replay. The <literal>standby_write</literal> mode
+   waits for the WAL to be written to the operating system, which is
+   faster than flush but provides weaker durability guarantees. The
+   <literal>primary_flush</literal> mode waits for WAL to be flushed on
+   a primary server. In all cases, the <acronym>LSN</acronym> of the last
+   modification should be stored on the client application side or the
+   connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   The standby modes (<literal>standby_replay</literal>,
+   <literal>standby_write</literal>, <literal>standby_flush</literal>)
+   can only be used during recovery, and <literal>primary_flush</literal>
+   can only be used on a primary server. Using the wrong mode for the
+   current server state will result in an error. If a standby is promoted
+   while waiting with a standby mode, the command will return
+   <literal>not in recovery</literal> (or throw an error if
+   <literal>NO_THROW</literal> is not specified). Promotion creates a new
+   timeline, and the LSN being waited for may refer to WAL from the old
+   timeline.
   </para>
 
 </refsect1>
@@ -175,21 +259,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -200,7 +284,7 @@ pg_current_wal_insert_lsn
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20';
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +295,43 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for flush on primary:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'primary_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
 
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +341,12 @@ ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fdb92deac57..da96b627228 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,6 +2918,14 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3100,6 +3108,14 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6277,10 +6293,12 @@ StartupXLOG(void)
 	WakeupCheckpointer();
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..a85c3b0de98 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,18 +29,39 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* Mode name: "standby_replay",
+								 * "standby_write", "standby_flush",
+								 * "primary_flush" */
+	const char *verb;			/* Past participle: "replayed", "written",
+								 * "flushed" */
+}			WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = {"standby_replay", "replayed"},
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = {"standby_write", "written"},
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = {"standby_flush", "flushed"},
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = {"primary_flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
 	const char *result = "<unset>";
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
+	bool		mode_specified = false;
 
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +69,32 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 	foreach_node(DefElem, defel, stmt->options)
 	{
-		if (strcmp(defel->defname, "timeout") == 0)
+		if (strcmp(defel->defname, "mode") == 0)
+		{
+			char	   *mode_str;
+
+			if (mode_specified)
+				errorConflictingDefElem(defel, pstate);
+			mode_specified = true;
+
+			mode_str = defGetString(defel);
+
+			if (pg_strcasecmp(mode_str, "standby_replay") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			else if (pg_strcasecmp(mode_str, "standby_write") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			else if (pg_strcasecmp(mode_str, "standby_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			else if (pg_strcasecmp(mode_str, "primary_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_PRIMARY_FLUSH;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for WAIT option \"MODE\": \"%s\"",
+								mode_str),
+						 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "timeout") == 0)
 		{
 			char	   *timeout_str;
 			const char *hintmsg;
@@ -107,8 +154,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +187,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	/*
+	 * Validate that the requested mode matches the current server state.
+	 * Primary modes can only be used on a primary.
+	 */
+	if (lsnType == WAIT_LSN_TYPE_PRIMARY_FLUSH)
+	{
+		if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is in progress"),
+					 errhint("Waiting for primary_flush can only be done on a primary server. "
+							 "Use standby_flush mode on a standby server.")));
+	}
+
+	/* Now wait for the LSN */
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +216,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +235,27 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const		WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+
 				if (PromoteIsTriggered())
 				{
+					XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..404d348da37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1013,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for, notify the
+		 * waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..e41aad45e28 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,5 +1,6 @@
-# Checks waiting for the LSN replay on standby using
-# the WAIT FOR command.
+# Checks waiting for the LSN using the WAIT FOR command.
+# Tests standby modes (standby_replay/standby_write/standby_flush) on standby
+# and primary_flush mode on primary.
 use strict;
 use warnings FATAL => 'all';
 
@@ -7,6 +8,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql(
+		'postgres', qq[
+		SELECT pg_catalog.quote_literal(setting)
+		FROM pg_settings
+		WHERE name = 'primary_conninfo';
+	]);
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = $saved_primary_conninfo;
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +99,52 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with standby_write, standby_flush, and
+# primary_flush modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' WITH (MODE 'standby_write', timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR with MODE 'standby_write'"
+);
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' WITH (MODE 'standby_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR with MODE 'standby_flush'"
+);
+
+# Check primary_flush mode on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn_primary_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_primary->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_primary_flush}' WITH (MODE 'primary_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '${lsn_primary_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"primary flushed WAL up to target LSN after WAIT FOR with MODE 'primary_flush'"
+);
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,14 +170,26 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
-# within another function, or inside a transaction with an isolation level
-# higher than READ COMMITTED.
+# 5. Check mode validation: standby modes error on primary, primary mode errors
+# on standby, and primary_flush works on primary.  Also check that WAIT FOR
+# triggers an error if called within another function or inside a transaction
+# with an isolation level higher than READ COMMITTED.
+
+# Test standby_flush on primary - should error
+$node_primary->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'standby_flush');",
+	stderr => \$stderr);
+ok($stderr =~ /recovery is not in progress/,
+	"get an error when running standby_flush on the primary");
 
-$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+# Test primary_flush on standby - should error
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'primary_flush');",
 	stderr => \$stderr);
-ok( $stderr =~ /recovery is not in progress/,
-	"get an error when running on the primary");
+ok($stderr =~ /recovery is in progress/,
+	"get an error when running primary_flush on the standby");
 
 $node_standby->psql(
 	'postgres',
@@ -125,7 +219,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,10 +302,26 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
-# psql sessions each waiting for a corresponding insertion.  When waiting is
-# finished, stored procedures logs if there are visible as many rows as
-# should be.
+# Test invalid MODE value
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "MODE": "invalid"/,
+	"get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'standby_replay', MODE 'standby_write');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple standby_replay waiters.  We make 5
+# background psql sessions each waiting for a corresponding insertion.  When
+# waiting is finished, stored procedures logs if there are visible as many
+# rows as should be.
 $node_primary->safe_psql(
 	'postgres', qq[
 CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
@@ -225,8 +335,17 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
   END
 \$\$
 LANGUAGE plpgsql;
+
+CREATE FUNCTION log_wait_done(prefix text, i int) RETURNS void AS \$\$
+  BEGIN
+    RAISE LOG '% %', prefix, i;
+  END
+\$\$
+LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -243,6 +362,7 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +371,246 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple standby_replay waiters reported consistent data');
+
+# 7b. Check the scenario of multiple standby_write waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_write waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'standby_write', timeout '1d');
+		SELECT log_wait_done('write_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_write waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple standby_flush waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_flush waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'standby_flush', timeout '1d');
+		SELECT log_wait_done('flush_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_flush waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed standby mode waiters (standby_replay,
+# standby_write, standby_flush) running concurrently.  We start 6 sessions:
+# 2 for each mode, all waiting for the same target LSN.  We stop the
+# walreceiver and pause replay to ensure all waiters block.  Then we resume
+# replay and restart the walreceiver to verify they unblock and complete
+# correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('standby_replay', 'standby_write', 'standby_flush');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+		SELECT log_wait_done('mixed_done', $i);
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 7e. Check the scenario of multiple primary_flush waiters on primary.
+# We start 5 background sessions waiting for different LSNs with primary_flush
+# mode.  Each waiter logs when done.
+my @primary_flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (400 + ${i});");
+	$primary_flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+my $primary_flush_log_offset = -s $node_primary->logfile;
+
+# Start primary_flush waiters
+my @primary_flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$primary_flush_sessions[$i] = $node_primary->background_psql('postgres');
+	$primary_flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$primary_flush_lsns[$i]' WITH (MODE 'primary_flush', timeout '1d');
+		SELECT log_wait_done('primary_flush_done', $i);
+	]);
+}
+
+# The WAL should already be flushed, so waiters should complete quickly
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->wait_for_log("primary_flush_done $i",
+		$primary_flush_log_offset);
+	$primary_flush_sessions[$i]->quit;
+}
+
+# Verify on primary that WAL was flushed up to the target LSN
+$output = $node_primary->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '$primary_flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple primary_flush waiters: primary flushed WAL up to target LSN");
+
+# 8. Check that the standby promotion terminates all standby wait modes.  Start
+# waiting for unreachable LSNs with standby_replay, standby_write, and
+# standby_flush modes, then promote.  Check the log for the relevant error
+# messages.  Also, check that waiting for already replayed LSN doesn't cause
+# an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('standby_replay', 'standby_write', 'standby_flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +620,16 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
 
 $node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
 
@@ -295,8 +645,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5c88fa92f4e..ab7149c5e62 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3305,6 +3305,7 @@ WaitLSNProcInfo
 WaitLSNResult
 WaitLSNState
 WaitLSNType
+WaitLSNTypeDesc
 WaitPMResult
 WaitStmt
 WalCloseMethod
-- 
2.51.0

v9-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchapplication/octet-stream; name=v9-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchDownload
From c35ad3714611db108f1726870d04e6f8b39edc77 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v9 4/4] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 59 +++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..51e5324bff3 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,53 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write' => 'standby_write',
+				'flush' => 'standby_flush',);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+			my $output = $standby_node->safe_psql('postgres', $wait_query);
+			chomp($output);
+
+			if ($output ne 'success')
+			{
+				# Fetch additional detail for debugging purposes
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN failed with status:
+	${output});
+				diag qq(Last pg_stat_replication contents:
+	${details});
+				croak "failed waiting for catchup";
+			}
+			print "done\n";
+			return;
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#76Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#75)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Dec 30, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Dec 27, 2025 at 12:15 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Chao,

Thanks a lot for your review!

On Fri, Dec 26, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Dec 19, 2025, at 10:49, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.

Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?

Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:

-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');

-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');

-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');

If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.

This is exactly what I meant. Please, go ahead.

Here is the updated patch set (v7). Please check.

--
Best,
Xuneng
<v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patch><v7-0004-Use-WAIT-FOR-LSN-in.patch><v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patch><v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patch>

Hi Xuneng,

A solid patch! Just a few small comments:

1 - 0001
```
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+       switch (lsnType)
+       {
+               case WAIT_LSN_TYPE_STANDBY_REPLAY:
+                       return GetXLogReplayRecPtr(NULL);
+
+               case WAIT_LSN_TYPE_STANDBY_WRITE:
+                       return GetWalRcvWriteRecPtr();
+
+               case WAIT_LSN_TYPE_STANDBY_FLUSH:
+                       return GetWalRcvFlushRecPtr(NULL, NULL);
+
+               case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+                       return GetFlushRecPtr(NULL);
+       }
+
+       elog(ERROR, "invalid LSN wait type: %d", lsnType);
+       pg_unreachable();
+}
```

As you add pg_unreachable() in the new function GetCurrentLSNForWaitType(), I’m thinking if we should just do an Assert(), I saw every existing related function has done such an assert, for example addLSNWaiter(), it does “Assert(i >= 0 && i < WAIT_LSN_TYPE_COUNT);”. I guess we can just following the current mechanism to verify lsnType. So, for GetCurrentLSNForWaitType(), we can just add a default clause and Assert(false).

My take is that Assert(false) alone might not be enough here, since
assertions vanish in non-assert builds. An unexpected lsnType is a
real bug even in production, so keeping a hard error plus
pg_unreachable() seems to be a safer pattern. It also acts as a
guardrail for future extensions — if new wait types are added without
updating this code, we’ll fail loudly rather than silently returning
an incorrect LSN. Assert(i >= 0 && i < WAIT_LSN_TYPE_COUNT) was added
to the top of the function.

2 - 0002
```
+                       else
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                                errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+                                                               "MODE", mode_str),
```

I wonder why don’t we directly put MODE into the error message?

Yeah, putting MODE into the error message is cleaner. It's done in v8.

3 - 0002
```
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+                               const           WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+                               XLogRecPtr      currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
-                                                       errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+                                                       errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
-                                                                         LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+                                                                         desc->verb,
+                                                                         desc->noun,
+                                                                         LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
-                                                       errhint("Waiting for the replay LSN can only be executed during recovery."));
+                                                       errhint("Waiting for the %s LSN can only be executed during recovery.",
+                                                                       desc->noun));
}
```

currentLSN is only used in the if clause, thus it can be defined inside the if clause.

+ 1.

3 - 0002
```
+       /*
+        * If we wrote an LSN that someone was waiting for then walk over the
+        * shared memory array and set latches to notify the waiters.
+        */
+       if (waitLSNState &&
+               (LogstreamResult.Write >=
+                pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+               WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
```

Do we need to mention "walk over the shared memory array and set latches” in the comment? The logic belongs to WaitLSNWakeup(). What about if the wake up logic changes in future, then this comment would become stale. So I think we only need to mention “notify the waiters”.

It makes sense to me. They are incorporated into v8.

4 - 0003
```
+       /*
+        * Handle parenthesized option list.  This fires when we're in an
+        * unfinished parenthesized option list.  get_previous_words treats a
+        * completed parenthesized option list as one word, so the above test is
+        * correct.  mode takes a string value ('replay', 'write', 'flush'),
+        * timeout takes a string value, no_throw takes no value.
+        */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
-               /*
-                * This fires if we're in an unfinished parenthesized option list.
-                * get_previous_words treats a completed parenthesized option list as
-                * one word, so the above test is correct.
-                */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-                       COMPLETE_WITH("timeout", "no_throw");
-
-               /*
-                * timeout takes a string value, no_throw takes no value. We don't
-                * offer completions for these values.
-                */
```

The new comment has lost the meaning of “We don’t offer completions for these values (timeout and no_throw)”, to be explicit, I feel we can retain the sentence.

The sentence is retained.

5 - 0004
```
+       my $isrecovery =
+         $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+       chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
-               my $isrecovery =
-                 $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
-               chomp($isrecovery);
```

I wonder why pull up pg_is_in_recovery to an early place and unconditionally call it?

This seems unnecessary. I also realized that my earlier approach in
patch 4 may have been semantically incorrect — it could end up waiting
for the LSN to replay/write/flush on the node itself, rather than on
the downstream standby, which defeats the purpose of
wait_for_catchup(). Patch 4 attempts to address this by running WAIT
FOR LSN on the standby itself.

Support for primary-flush waiting and the refactoring of existing
modes have been also incorporated in v8 following Alexander’s
feedback. The major updates are in patches 2 and 4. Please check.

Added WaitLSNTypeDesc to typedefs.list in v9 patch 2.

Run pgindent using the updated typedefs.list.

--
Best,
Xuneng

Attachments:

v10-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchapplication/octet-stream; name=v10-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchDownload
From ad54cf3b65274450dfde3e4ea2898ac3a7352a12 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v10 4/4] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 59 +++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..51e5324bff3 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,53 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write' => 'standby_write',
+				'flush' => 'standby_flush',);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+			my $output = $standby_node->safe_psql('postgres', $wait_query);
+			chomp($output);
+
+			if ($output ne 'success')
+			{
+				# Fetch additional detail for debugging purposes
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN failed with status:
+	${output});
+				diag qq(Last pg_stat_replication contents:
+	${details});
+				croak "failed waiting for catchup";
+			}
+			print "done\n";
+			return;
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v10-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchapplication/octet-stream; name=v10-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchDownload
From afc09c6893df66fa0348fff92b5999c9873d3812 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v10 3/4] Add tab completion for WAIT FOR LSN MODE option

Update psql tab completion to support the optional MODE option in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords. The MODE option controls whether
the wait is evaluated from the standby or primary perspective.

When MODE is specified, completion suggests the valid mode values:
standby_replay, standby_write, standby_flush, and primary_flush.
---
 src/bin/psql/tab-complete.in.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 75a101c6ab5..62d87561169 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5355,8 +5355,10 @@ match_previous_words(int pattern_id,
 /*
  * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
  * where option can be:
+ *   MODE '<mode>'
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * and mode can be: standby_replay | standby_write | standby_flush | primary_flush
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5369,21 +5371,25 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH");
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.
+	 *
+	 * 'mode' takes a string value ('standby_replay', 'standby_write',
+	 * 'standby_flush', 'primary_flush'). 'timeout' takes a string value, and
+	 * 'no_throw' takes no value. We do not offer completions for the *values*
+	 * of 'timeout' or 'no_throw'.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+			COMPLETE_WITH("mode", "timeout", "no_throw");
+		else if (TailMatches("mode"))
+			COMPLETE_WITH("'standby_replay'", "'standby_write'", "'standby_flush'", "'primary_flush'");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v10-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v10-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchDownload
From 091eed383f1fc9503f423a9253634b4d477817e5 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v10 2/4] Add MODE option to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE option in the
WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
- 'standby_replay' (default): Wait for WAL to be replayed to the specified LSN
- 'standby_write': Wait for WAL to be written (received) to the specified LSN
- 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN
- 'primary_flush': Wait for WAL to be flushed to disk on the primary server

The default mode is 'standby_replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands where options are specified as string values in the WITH clause.

Modes are explicitly named to distinguish between primary and standby operations:
- Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only
  be used during recovery (on a standby server)
- Primary mode ('primary_flush') can only be used on a primary server

The 'standby_write' and 'standby_flush' modes are useful for scenarios where
applications need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete. The 'primary_flush' mode
allows waiting for WAL to be flushed on the primary server.

Also includes:
- Documentation updates for the new syntax and mode descriptions
- Test coverage for all four modes including error cases and concurrent waiters
- Wakeup logic in walreceiver for standby write/flush waiters
- Wakeup logic in WAL writer for primary flush waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 213 +++++++++---
 src/backend/access/transam/xlog.c       |  22 +-
 src/backend/commands/wait.c             |  96 +++++-
 src/backend/replication/walreceiver.c   |  18 ++
 src/test/recovery/t/049_wait_for_lsn.pl | 411 ++++++++++++++++++++++--
 src/tools/pgindent/typedefs.list        |   1 +
 6 files changed, 674 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..df72b3327c8 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym></refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    MODE '<replaceable class="parameter">mode</replaceable>'
     TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
     NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+    standby_replay | standby_write | standby_flush | primary_flush
 </synopsis>
  </refsynopsisdiv>
 
@@ -34,20 +40,27 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+  </para>
+
+  <para>
+   On timeout, an error is emitted unless <literal>NO_THROW</literal>
+   is specified in the WITH clause. For standby modes
+   (<literal>standby_replay</literal>, <literal>standby_write</literal>,
+   <literal>standby_flush</literal>), an error is also emitted if the
+   server is promoted before the <parameter>lsn</parameter> is reached.
+   If <parameter>NO_THROW</parameter> is specified, the command returns
+   a status string instead of throwing errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -72,6 +85,65 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
       The following parameters are supported:
 
       <variablelist>
+       <varlistentry>
+        <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+        <listitem>
+         <para>
+          Specifies the type of LSN processing to wait for. If not specified,
+          the default is <literal>standby_replay</literal>. The valid modes are:
+         </para>
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>standby_replay</literal>: Wait for the LSN to be replayed
+            (applied to the database) on a standby server. After successful
+            completion, <function>pg_last_wal_replay_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_write</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and written to disk on a
+            standby server, but not yet flushed. This is faster than
+            <literal>standby_flush</literal> but provides weaker durability
+            guarantees since the data may still be in operating system
+            buffers. After successful completion, the
+            <structfield>written_lsn</structfield> column in
+            <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link> will show
+            a value greater than or equal to the target LSN. This mode can
+            only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_flush</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and flushed to disk on a
+            standby server. This provides a durability guarantee without
+            waiting for the WAL to be applied. After successful completion,
+            <function>pg_last_wal_receive_lsn()</function> will return a
+            value greater than or equal to the target LSN. This value is
+            also available as the <structfield>flushed_lsn</structfield>
+            column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link>. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>primary_flush</literal>: Wait for the WAL containing the
+            LSN to be flushed to disk on a primary server. After successful
+            completion, <function>pg_current_wal_flush_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used on a primary server (not during recovery).
+           </para>
+          </listitem>
+         </itemizedlist>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
         <listitem>
@@ -135,9 +207,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +223,34 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>standby_replay</literal> mode
+   waits for the LSN to be replayed (applied to the database), which is
+   useful to achieve read-your-writes consistency while using an async
+   replica for reads and the primary for writes. The
+   <literal>standby_flush</literal> mode waits for the WAL to be flushed
+   to durable storage on the replica, providing a durability guarantee
+   without waiting for replay. The <literal>standby_write</literal> mode
+   waits for the WAL to be written to the operating system, which is
+   faster than flush but provides weaker durability guarantees. The
+   <literal>primary_flush</literal> mode waits for WAL to be flushed on
+   a primary server. In all cases, the <acronym>LSN</acronym> of the last
+   modification should be stored on the client application side or the
+   connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   The standby modes (<literal>standby_replay</literal>,
+   <literal>standby_write</literal>, <literal>standby_flush</literal>)
+   can only be used during recovery, and <literal>primary_flush</literal>
+   can only be used on a primary server. Using the wrong mode for the
+   current server state will result in an error. If a standby is promoted
+   while waiting with a standby mode, the command will return
+   <literal>not in recovery</literal> (or throw an error if
+   <literal>NO_THROW</literal> is not specified). Promotion creates a new
+   timeline, and the LSN being waited for may refer to WAL from the old
+   timeline.
   </para>
 
 </refsect1>
@@ -175,21 +259,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -200,7 +284,7 @@ pg_current_wal_insert_lsn
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20';
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +295,43 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for flush on primary:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'primary_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
 
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +341,12 @@ ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fdb92deac57..da96b627228 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,6 +2918,14 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3100,6 +3108,14 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6277,10 +6293,12 @@ StartupXLOG(void)
 	WakeupCheckpointer();
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..01df18140bd 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,18 +29,39 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* Mode name: "standby_replay",
+								 * "standby_write", "standby_flush",
+								 * "primary_flush" */
+	const char *verb;			/* Past participle: "replayed", "written",
+								 * "flushed" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = {"standby_replay", "replayed"},
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = {"standby_write", "written"},
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = {"standby_flush", "flushed"},
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = {"primary_flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
 	const char *result = "<unset>";
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
+	bool		mode_specified = false;
 
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +69,32 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 	foreach_node(DefElem, defel, stmt->options)
 	{
-		if (strcmp(defel->defname, "timeout") == 0)
+		if (strcmp(defel->defname, "mode") == 0)
+		{
+			char	   *mode_str;
+
+			if (mode_specified)
+				errorConflictingDefElem(defel, pstate);
+			mode_specified = true;
+
+			mode_str = defGetString(defel);
+
+			if (pg_strcasecmp(mode_str, "standby_replay") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			else if (pg_strcasecmp(mode_str, "standby_write") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			else if (pg_strcasecmp(mode_str, "standby_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			else if (pg_strcasecmp(mode_str, "primary_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_PRIMARY_FLUSH;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for WAIT option \"MODE\": \"%s\"",
+								mode_str),
+						 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "timeout") == 0)
 		{
 			char	   *timeout_str;
 			const char *hintmsg;
@@ -107,8 +154,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +187,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	/*
+	 * Validate that the requested mode matches the current server state.
+	 * Primary modes can only be used on a primary.
+	 */
+	if (lsnType == WAIT_LSN_TYPE_PRIMARY_FLUSH)
+	{
+		if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is in progress"),
+					 errhint("Waiting for primary_flush can only be done on a primary server. "
+							 "Use standby_flush mode on a standby server.")));
+	}
+
+	/* Now wait for the LSN */
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +216,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +235,27 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+
 				if (PromoteIsTriggered())
 				{
+					XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..404d348da37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1013,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for, notify the
+		 * waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..e41aad45e28 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,5 +1,6 @@
-# Checks waiting for the LSN replay on standby using
-# the WAIT FOR command.
+# Checks waiting for the LSN using the WAIT FOR command.
+# Tests standby modes (standby_replay/standby_write/standby_flush) on standby
+# and primary_flush mode on primary.
 use strict;
 use warnings FATAL => 'all';
 
@@ -7,6 +8,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql(
+		'postgres', qq[
+		SELECT pg_catalog.quote_literal(setting)
+		FROM pg_settings
+		WHERE name = 'primary_conninfo';
+	]);
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = $saved_primary_conninfo;
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +99,52 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with standby_write, standby_flush, and
+# primary_flush modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' WITH (MODE 'standby_write', timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR with MODE 'standby_write'"
+);
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' WITH (MODE 'standby_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR with MODE 'standby_flush'"
+);
+
+# Check primary_flush mode on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn_primary_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_primary->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_primary_flush}' WITH (MODE 'primary_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '${lsn_primary_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"primary flushed WAL up to target LSN after WAIT FOR with MODE 'primary_flush'"
+);
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,14 +170,26 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
-# within another function, or inside a transaction with an isolation level
-# higher than READ COMMITTED.
+# 5. Check mode validation: standby modes error on primary, primary mode errors
+# on standby, and primary_flush works on primary.  Also check that WAIT FOR
+# triggers an error if called within another function or inside a transaction
+# with an isolation level higher than READ COMMITTED.
+
+# Test standby_flush on primary - should error
+$node_primary->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'standby_flush');",
+	stderr => \$stderr);
+ok($stderr =~ /recovery is not in progress/,
+	"get an error when running standby_flush on the primary");
 
-$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+# Test primary_flush on standby - should error
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'primary_flush');",
 	stderr => \$stderr);
-ok( $stderr =~ /recovery is not in progress/,
-	"get an error when running on the primary");
+ok($stderr =~ /recovery is in progress/,
+	"get an error when running primary_flush on the standby");
 
 $node_standby->psql(
 	'postgres',
@@ -125,7 +219,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,10 +302,26 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
-# psql sessions each waiting for a corresponding insertion.  When waiting is
-# finished, stored procedures logs if there are visible as many rows as
-# should be.
+# Test invalid MODE value
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "MODE": "invalid"/,
+	"get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'standby_replay', MODE 'standby_write');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple standby_replay waiters.  We make 5
+# background psql sessions each waiting for a corresponding insertion.  When
+# waiting is finished, stored procedures logs if there are visible as many
+# rows as should be.
 $node_primary->safe_psql(
 	'postgres', qq[
 CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
@@ -225,8 +335,17 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
   END
 \$\$
 LANGUAGE plpgsql;
+
+CREATE FUNCTION log_wait_done(prefix text, i int) RETURNS void AS \$\$
+  BEGIN
+    RAISE LOG '% %', prefix, i;
+  END
+\$\$
+LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -243,6 +362,7 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +371,246 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple standby_replay waiters reported consistent data');
+
+# 7b. Check the scenario of multiple standby_write waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_write waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'standby_write', timeout '1d');
+		SELECT log_wait_done('write_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_write waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple standby_flush waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_flush waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'standby_flush', timeout '1d');
+		SELECT log_wait_done('flush_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_flush waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed standby mode waiters (standby_replay,
+# standby_write, standby_flush) running concurrently.  We start 6 sessions:
+# 2 for each mode, all waiting for the same target LSN.  We stop the
+# walreceiver and pause replay to ensure all waiters block.  Then we resume
+# replay and restart the walreceiver to verify they unblock and complete
+# correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('standby_replay', 'standby_write', 'standby_flush');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+		SELECT log_wait_done('mixed_done', $i);
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 7e. Check the scenario of multiple primary_flush waiters on primary.
+# We start 5 background sessions waiting for different LSNs with primary_flush
+# mode.  Each waiter logs when done.
+my @primary_flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (400 + ${i});");
+	$primary_flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+my $primary_flush_log_offset = -s $node_primary->logfile;
+
+# Start primary_flush waiters
+my @primary_flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$primary_flush_sessions[$i] = $node_primary->background_psql('postgres');
+	$primary_flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$primary_flush_lsns[$i]' WITH (MODE 'primary_flush', timeout '1d');
+		SELECT log_wait_done('primary_flush_done', $i);
+	]);
+}
+
+# The WAL should already be flushed, so waiters should complete quickly
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->wait_for_log("primary_flush_done $i",
+		$primary_flush_log_offset);
+	$primary_flush_sessions[$i]->quit;
+}
+
+# Verify on primary that WAL was flushed up to the target LSN
+$output = $node_primary->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '$primary_flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple primary_flush waiters: primary flushed WAL up to target LSN");
+
+# 8. Check that the standby promotion terminates all standby wait modes.  Start
+# waiting for unreachable LSNs with standby_replay, standby_write, and
+# standby_flush modes, then promote.  Check the log for the relevant error
+# messages.  Also, check that waiting for already replayed LSN doesn't cause
+# an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('standby_replay', 'standby_write', 'standby_flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +620,16 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
 
 $node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
 
@@ -295,8 +645,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5c88fa92f4e..ab7149c5e62 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3305,6 +3305,7 @@ WaitLSNProcInfo
 WaitLSNResult
 WaitLSNState
 WaitLSNType
+WaitLSNTypeDesc
 WaitPMResult
 WaitStmt
 WalCloseMethod
-- 
2.51.0

v10-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchapplication/octet-stream; name=v10-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchDownload
From 668956d0d0794c489167912d54d4c9c7bb237754 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v10 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 96 +++++++++++++++----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 14 ++-
 6 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b7ef589fc0..fdb92deac57 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6280,7 +6280,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 38b594d2170..2d81bb1a9a7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1856,8 +1856,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..5f4ff50cf38 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,47 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	Assert(lsnType >= 0 && lsnType < WAIT_LSN_TYPE_COUNT);
+
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -302,6 +349,19 @@ WaitLSNCleanup(void)
 	}
 }
 
+/*
+ * Check if the given LSN type requires recovery to be in progress.
+ * Standby wait types (replay, write, flush) require recovery;
+ * primary wait types (flush) do not.
+ */
+static inline bool
+WaitLSNTypeRequiresRecovery(WaitLSNType t)
+{
+	return t == WAIT_LSN_TYPE_STANDBY_REPLAY ||
+		t == WAIT_LSN_TYPE_STANDBY_WRITE ||
+		t == WAIT_LSN_TYPE_STANDBY_FLUSH;
+}
+
 /*
  * Wait using MyLatch till the given LSN is reached, the replica gets
  * promoted, or the postmaster dies.
@@ -341,13 +401,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (WaitLSNTypeRequiresRecovery(lsnType) && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +434,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..e62054585cb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..4cf13f0ccb3 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.h
- *	  Declarations for LSN replay waiting routines.
+ *	  Declarations for WAL flush, write, and replay waiting routines.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

#77Álvaro Herrera
alvherre@kurilemu.de
In reply to: Xuneng Zhou (#74)
Re: Implement waiting for wal lsn replay: reloaded

On 2025-Dec-27, Xuneng Zhou wrote:

On Fri, Dec 26, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

2 - 0002
```
+                       else
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                                errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+                                                               "MODE", mode_str),
```

I wonder why don’t we directly put MODE into the error message?

Yeah, putting MODE into the error message is cleaner. It's done in v8.

The reason not to do that (and also put WAIT in a separate string) is so
that the message is identicla to other messages and thus requires no
separate translation, specifically
errmsg("unrecognized value for %s option \"%s\": \"%s\"", ...)

See commit 502e256f2262. Please use that form.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).

#78Chao Li
li.evan.chao@gmail.com
In reply to: Álvaro Herrera (#77)
Re: Implement waiting for wal lsn replay: reloaded

On Dec 30, 2025, at 11:14, Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Dec-27, Xuneng Zhou wrote:

On Fri, Dec 26, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

2 - 0002
```
+                       else
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                                errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+                                                               "MODE", mode_str),
```

I wonder why don’t we directly put MODE into the error message?

Yeah, putting MODE into the error message is cleaner. It's done in v8.

The reason not to do that (and also put WAIT in a separate string) is so
that the message is identicla to other messages and thus requires no
separate translation, specifically
errmsg("unrecognized value for %s option \"%s\": \"%s\"", ...)

See commit 502e256f2262. Please use that form.

To follow 502e256f2262, it should use “%s” for “WAIT” as well. I raised the comment because I saw “WAIT” is the format strings, thus “MODE” can be there as well.

So, we should do a similar change like:
```
-                                                errmsg("unrecognized value for EXPLAIN option \"%s\": \"%s\"",
-                                                               opt->defname, p),
+                                                errmsg("unrecognized value for %s option \"%s\": \"%s\"",
+                                                               "EXPLAIN", opt->defname, p),
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#79Xuneng Zhou
xunengzhou@gmail.com
In reply to: Chao Li (#78)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Dec 30, 2025 at 11:25 AM Chao Li <li.evan.chao@gmail.com> wrote:

On Dec 30, 2025, at 11:14, Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Dec-27, Xuneng Zhou wrote:

On Fri, Dec 26, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

2 - 0002
```
+                       else
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                                errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+                                                               "MODE", mode_str),
```

I wonder why don’t we directly put MODE into the error message?

Yeah, putting MODE into the error message is cleaner. It's done in v8.

The reason not to do that (and also put WAIT in a separate string) is so
that the message is identicla to other messages and thus requires no
separate translation, specifically
errmsg("unrecognized value for %s option \"%s\": \"%s\"", ...)

See commit 502e256f2262. Please use that form.

To follow 502e256f2262, it should use “%s” for “WAIT” as well. I raised the comment because I saw “WAIT” is the format strings, thus “MODE” can be there as well.

So, we should do a similar change like:
```
-                                                errmsg("unrecognized value for EXPLAIN option \"%s\": \"%s\"",
-                                                               opt->defname, p),
+                                                errmsg("unrecognized value for %s option \"%s\": \"%s\"",
+                                                               "EXPLAIN", opt->defname, p),
```

Thanks for raising this and clarifying the rationale. I've made the
modification per your input.

--
Best,
Xuneng

Attachments:

v11-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v11-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchDownload
From 6b15fa268bb26bdd81879da267f0d1ccab2c8093 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v11 2/4] Add MODE option to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE option in the
WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
- 'standby_replay' (default): Wait for WAL to be replayed to the specified LSN
- 'standby_write': Wait for WAL to be written (received) to the specified LSN
- 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN
- 'primary_flush': Wait for WAL to be flushed to disk on the primary server

The default mode is 'standby_replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands where options are specified as string values in the WITH clause.

Modes are explicitly named to distinguish between primary and standby operations:
- Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only
  be used during recovery (on a standby server)
- Primary mode ('primary_flush') can only be used on a primary server

The 'standby_write' and 'standby_flush' modes are useful for scenarios where
applications need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete. The 'primary_flush' mode
allows waiting for WAL to be flushed on the primary server.

Also includes:
- Documentation updates for the new syntax and mode descriptions
- Test coverage for all four modes including error cases and concurrent waiters
- Wakeup logic in walreceiver for standby write/flush waiters
- Wakeup logic in WAL writer for primary flush waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 213 +++++++++---
 src/backend/access/transam/xlog.c       |  22 +-
 src/backend/commands/wait.c             |  96 +++++-
 src/backend/replication/walreceiver.c   |  18 ++
 src/test/recovery/t/049_wait_for_lsn.pl | 411 ++++++++++++++++++++++--
 src/tools/pgindent/typedefs.list        |   1 +
 6 files changed, 674 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..df72b3327c8 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym></refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    MODE '<replaceable class="parameter">mode</replaceable>'
     TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
     NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+    standby_replay | standby_write | standby_flush | primary_flush
 </synopsis>
  </refsynopsisdiv>
 
@@ -34,20 +40,27 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+  </para>
+
+  <para>
+   On timeout, an error is emitted unless <literal>NO_THROW</literal>
+   is specified in the WITH clause. For standby modes
+   (<literal>standby_replay</literal>, <literal>standby_write</literal>,
+   <literal>standby_flush</literal>), an error is also emitted if the
+   server is promoted before the <parameter>lsn</parameter> is reached.
+   If <parameter>NO_THROW</parameter> is specified, the command returns
+   a status string instead of throwing errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -72,6 +85,65 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
       The following parameters are supported:
 
       <variablelist>
+       <varlistentry>
+        <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+        <listitem>
+         <para>
+          Specifies the type of LSN processing to wait for. If not specified,
+          the default is <literal>standby_replay</literal>. The valid modes are:
+         </para>
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>standby_replay</literal>: Wait for the LSN to be replayed
+            (applied to the database) on a standby server. After successful
+            completion, <function>pg_last_wal_replay_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_write</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and written to disk on a
+            standby server, but not yet flushed. This is faster than
+            <literal>standby_flush</literal> but provides weaker durability
+            guarantees since the data may still be in operating system
+            buffers. After successful completion, the
+            <structfield>written_lsn</structfield> column in
+            <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link> will show
+            a value greater than or equal to the target LSN. This mode can
+            only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_flush</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and flushed to disk on a
+            standby server. This provides a durability guarantee without
+            waiting for the WAL to be applied. After successful completion,
+            <function>pg_last_wal_receive_lsn()</function> will return a
+            value greater than or equal to the target LSN. This value is
+            also available as the <structfield>flushed_lsn</structfield>
+            column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link>. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>primary_flush</literal>: Wait for the WAL containing the
+            LSN to be flushed to disk on a primary server. After successful
+            completion, <function>pg_current_wal_flush_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used on a primary server (not during recovery).
+           </para>
+          </listitem>
+         </itemizedlist>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
         <listitem>
@@ -135,9 +207,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +223,34 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>standby_replay</literal> mode
+   waits for the LSN to be replayed (applied to the database), which is
+   useful to achieve read-your-writes consistency while using an async
+   replica for reads and the primary for writes. The
+   <literal>standby_flush</literal> mode waits for the WAL to be flushed
+   to durable storage on the replica, providing a durability guarantee
+   without waiting for replay. The <literal>standby_write</literal> mode
+   waits for the WAL to be written to the operating system, which is
+   faster than flush but provides weaker durability guarantees. The
+   <literal>primary_flush</literal> mode waits for WAL to be flushed on
+   a primary server. In all cases, the <acronym>LSN</acronym> of the last
+   modification should be stored on the client application side or the
+   connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   The standby modes (<literal>standby_replay</literal>,
+   <literal>standby_write</literal>, <literal>standby_flush</literal>)
+   can only be used during recovery, and <literal>primary_flush</literal>
+   can only be used on a primary server. Using the wrong mode for the
+   current server state will result in an error. If a standby is promoted
+   while waiting with a standby mode, the command will return
+   <literal>not in recovery</literal> (or throw an error if
+   <literal>NO_THROW</literal> is not specified). Promotion creates a new
+   timeline, and the LSN being waited for may refer to WAL from the old
+   timeline.
   </para>
 
 </refsect1>
@@ -175,21 +259,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -200,7 +284,7 @@ pg_current_wal_insert_lsn
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20';
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +295,43 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for flush on primary:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'primary_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
 
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +341,12 @@ ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fdb92deac57..da96b627228 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,6 +2918,14 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3100,6 +3108,14 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6277,10 +6293,12 @@ StartupXLOG(void)
 	WakeupCheckpointer();
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..016e948eb77 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -28,18 +29,39 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+	const char *noun;			/* Mode name: "standby_replay",
+								 * "standby_write", "standby_flush",
+								 * "primary_flush" */
+	const char *verb;			/* Past participle: "replayed", "written",
+								 * "flushed" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = {"standby_replay", "replayed"},
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = {"standby_write", "written"},
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = {"standby_flush", "flushed"},
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = {"primary_flush", "flushed"},
+};
+
 void
 ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 {
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
 	const char *result = "<unset>";
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
+	bool		mode_specified = false;
 
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +69,32 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 	foreach_node(DefElem, defel, stmt->options)
 	{
-		if (strcmp(defel->defname, "timeout") == 0)
+		if (strcmp(defel->defname, "mode") == 0)
+		{
+			char	   *mode_str;
+
+			if (mode_specified)
+				errorConflictingDefElem(defel, pstate);
+			mode_specified = true;
+
+			mode_str = defGetString(defel);
+
+			if (pg_strcasecmp(mode_str, "standby_replay") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			else if (pg_strcasecmp(mode_str, "standby_write") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			else if (pg_strcasecmp(mode_str, "standby_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			else if (pg_strcasecmp(mode_str, "primary_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_PRIMARY_FLUSH;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for %s option \"%s\": \"%s\"",
+								"WAIT", defel->defname, mode_str),
+						 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "timeout") == 0)
 		{
 			char	   *timeout_str;
 			const char *hintmsg;
@@ -107,8 +154,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +187,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	/*
+	 * Validate that the requested mode matches the current server state.
+	 * Primary modes can only be used on a primary.
+	 */
+	if (lsnType == WAIT_LSN_TYPE_PRIMARY_FLUSH)
+	{
+		if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is in progress"),
+					 errhint("Waiting for primary_flush can only be done on a primary server. "
+							 "Use standby_flush mode on a standby server.")));
+	}
+
+	/* Now wait for the LSN */
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +216,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
+			{
+				const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 				ereport(ERROR,
 						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
 							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}
 			else
 				result = "timeout";
 			break;
@@ -166,20 +235,27 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
 			if (throw)
 			{
+				const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+
 				if (PromoteIsTriggered())
 				{
+					XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+							errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
 									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+									  desc->verb,
+									  desc->noun,
+									  LSN_FORMAT_ARGS(currentLSN)));
 				}
 				else
 					ereport(ERROR,
 							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+							errhint("Waiting for the %s LSN can only be executed during recovery.",
+									desc->noun));
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..404d348da37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1013,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for, notify the
+		 * waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..b767b475ff7 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,5 +1,6 @@
-# Checks waiting for the LSN replay on standby using
-# the WAIT FOR command.
+# Checks waiting for the LSN using the WAIT FOR command.
+# Tests standby modes (standby_replay/standby_write/standby_flush) on standby
+# and primary_flush mode on primary.
 use strict;
 use warnings FATAL => 'all';
 
@@ -7,6 +8,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql(
+		'postgres', qq[
+		SELECT pg_catalog.quote_literal(setting)
+		FROM pg_settings
+		WHERE name = 'primary_conninfo';
+	]);
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = $saved_primary_conninfo;
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +99,52 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with standby_write, standby_flush, and
+# primary_flush modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' WITH (MODE 'standby_write', timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR with MODE 'standby_write'"
+);
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' WITH (MODE 'standby_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR with MODE 'standby_flush'"
+);
+
+# Check primary_flush mode on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn_primary_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_primary->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_primary_flush}' WITH (MODE 'primary_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '${lsn_primary_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"primary flushed WAL up to target LSN after WAIT FOR with MODE 'primary_flush'"
+);
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,14 +170,26 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
-# within another function, or inside a transaction with an isolation level
-# higher than READ COMMITTED.
+# 5. Check mode validation: standby modes error on primary, primary mode errors
+# on standby, and primary_flush works on primary.  Also check that WAIT FOR
+# triggers an error if called within another function or inside a transaction
+# with an isolation level higher than READ COMMITTED.
+
+# Test standby_flush on primary - should error
+$node_primary->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'standby_flush');",
+	stderr => \$stderr);
+ok($stderr =~ /recovery is not in progress/,
+	"get an error when running standby_flush on the primary");
 
-$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+# Test primary_flush on standby - should error
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'primary_flush');",
 	stderr => \$stderr);
-ok( $stderr =~ /recovery is not in progress/,
-	"get an error when running on the primary");
+ok($stderr =~ /recovery is in progress/,
+	"get an error when running primary_flush on the standby");
 
 $node_standby->psql(
 	'postgres',
@@ -125,7 +219,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,10 +302,26 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
-# psql sessions each waiting for a corresponding insertion.  When waiting is
-# finished, stored procedures logs if there are visible as many rows as
-# should be.
+# Test invalid MODE value
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "mode": "invalid"/,
+	"get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'standby_replay', MODE 'standby_write');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple standby_replay waiters.  We make 5
+# background psql sessions each waiting for a corresponding insertion.  When
+# waiting is finished, stored procedures logs if there are visible as many
+# rows as should be.
 $node_primary->safe_psql(
 	'postgres', qq[
 CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
@@ -225,8 +335,17 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
   END
 \$\$
 LANGUAGE plpgsql;
+
+CREATE FUNCTION log_wait_done(prefix text, i int) RETURNS void AS \$\$
+  BEGIN
+    RAISE LOG '% %', prefix, i;
+  END
+\$\$
+LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -243,6 +362,7 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +371,246 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple standby_replay waiters reported consistent data');
+
+# 7b. Check the scenario of multiple standby_write waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_write waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'standby_write', timeout '1d');
+		SELECT log_wait_done('write_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_write waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple standby_flush waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_flush waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'standby_flush', timeout '1d');
+		SELECT log_wait_done('flush_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_flush waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed standby mode waiters (standby_replay,
+# standby_write, standby_flush) running concurrently.  We start 6 sessions:
+# 2 for each mode, all waiting for the same target LSN.  We stop the
+# walreceiver and pause replay to ensure all waiters block.  Then we resume
+# replay and restart the walreceiver to verify they unblock and complete
+# correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('standby_replay', 'standby_write', 'standby_flush');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+		SELECT log_wait_done('mixed_done', $i);
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 7e. Check the scenario of multiple primary_flush waiters on primary.
+# We start 5 background sessions waiting for different LSNs with primary_flush
+# mode.  Each waiter logs when done.
+my @primary_flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (400 + ${i});");
+	$primary_flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+my $primary_flush_log_offset = -s $node_primary->logfile;
+
+# Start primary_flush waiters
+my @primary_flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$primary_flush_sessions[$i] = $node_primary->background_psql('postgres');
+	$primary_flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$primary_flush_lsns[$i]' WITH (MODE 'primary_flush', timeout '1d');
+		SELECT log_wait_done('primary_flush_done', $i);
+	]);
+}
+
+# The WAL should already be flushed, so waiters should complete quickly
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->wait_for_log("primary_flush_done $i",
+		$primary_flush_log_offset);
+	$primary_flush_sessions[$i]->quit;
+}
+
+# Verify on primary that WAL was flushed up to the target LSN
+$output = $node_primary->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '$primary_flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple primary_flush waiters: primary flushed WAL up to target LSN");
+
+# 8. Check that the standby promotion terminates all standby wait modes.  Start
+# waiting for unreachable LSNs with standby_replay, standby_write, and
+# standby_flush modes, then promote.  Check the log for the relevant error
+# messages.  Also, check that waiting for already replayed LSN doesn't cause
+# an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('standby_replay', 'standby_write', 'standby_flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +620,16 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
 
 $node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
 
@@ -295,8 +645,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5c88fa92f4e..ab7149c5e62 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3305,6 +3305,7 @@ WaitLSNProcInfo
 WaitLSNResult
 WaitLSNState
 WaitLSNType
+WaitLSNTypeDesc
 WaitPMResult
 WaitStmt
 WalCloseMethod
-- 
2.51.0

v11-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchapplication/octet-stream; name=v11-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchDownload
From 668956d0d0794c489167912d54d4c9c7bb237754 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v11 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 96 +++++++++++++++----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 14 ++-
 6 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b7ef589fc0..fdb92deac57 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6280,7 +6280,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 38b594d2170..2d81bb1a9a7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1856,8 +1856,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..5f4ff50cf38 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,47 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	Assert(lsnType >= 0 && lsnType < WAIT_LSN_TYPE_COUNT);
+
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -302,6 +349,19 @@ WaitLSNCleanup(void)
 	}
 }
 
+/*
+ * Check if the given LSN type requires recovery to be in progress.
+ * Standby wait types (replay, write, flush) require recovery;
+ * primary wait types (flush) do not.
+ */
+static inline bool
+WaitLSNTypeRequiresRecovery(WaitLSNType t)
+{
+	return t == WAIT_LSN_TYPE_STANDBY_REPLAY ||
+		t == WAIT_LSN_TYPE_STANDBY_WRITE ||
+		t == WAIT_LSN_TYPE_STANDBY_FLUSH;
+}
+
 /*
  * Wait using MyLatch till the given LSN is reached, the replica gets
  * promoted, or the postmaster dies.
@@ -341,13 +401,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (WaitLSNTypeRequiresRecovery(lsnType) && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +434,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..e62054585cb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..4cf13f0ccb3 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.h
- *	  Declarations for LSN replay waiting routines.
+ *	  Declarations for WAL flush, write, and replay waiting routines.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v11-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchapplication/octet-stream; name=v11-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchDownload
From b2c1fac6ec41ec52d96628294c2d1c7f5c1191f3 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v11 3/4] Add tab completion for WAIT FOR LSN MODE option

Update psql tab completion to support the optional MODE option in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords. The MODE option controls whether
the wait is evaluated from the standby or primary perspective.

When MODE is specified, completion suggests the valid mode values:
standby_replay, standby_write, standby_flush, and primary_flush.
---
 src/bin/psql/tab-complete.in.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 75a101c6ab5..62d87561169 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5355,8 +5355,10 @@ match_previous_words(int pattern_id,
 /*
  * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
  * where option can be:
+ *   MODE '<mode>'
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * and mode can be: standby_replay | standby_write | standby_flush | primary_flush
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5369,21 +5371,25 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH");
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.
+	 *
+	 * 'mode' takes a string value ('standby_replay', 'standby_write',
+	 * 'standby_flush', 'primary_flush'). 'timeout' takes a string value, and
+	 * 'no_throw' takes no value. We do not offer completions for the *values*
+	 * of 'timeout' or 'no_throw'.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+			COMPLETE_WITH("mode", "timeout", "no_throw");
+		else if (TailMatches("mode"))
+			COMPLETE_WITH("'standby_replay'", "'standby_write'", "'standby_flush'", "'primary_flush'");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v11-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchapplication/octet-stream; name=v11-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchDownload
From 7134802e488caf98ae9ef33c09a10e21a5fa0fc3 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v11 4/4] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 59 +++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..51e5324bff3 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,53 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write' => 'standby_write',
+				'flush' => 'standby_flush',);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+			my $output = $standby_node->safe_psql('postgres', $wait_query);
+			chomp($output);
+
+			if ($output ne 'success')
+			{
+				# Fetch additional detail for debugging purposes
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN failed with status:
+	${output});
+				diag qq(Last pg_stat_replication contents:
+	${details});
+				croak "failed waiting for catchup";
+			}
+			print "done\n";
+			return;
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#80Álvaro Herrera
alvherre@kurilemu.de
In reply to: Xuneng Zhou (#79)
Re: Implement waiting for wal lsn replay: reloaded

In 0002 you have this kind of thing:

ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+						errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+							   desc->verb,
+							   desc->noun,
+							   LSN_FORMAT_ARGS(currentLSN)));
+			}

I'm afraid this technique doesn't work, for translatability reasons.
Your whole design of having a struct with ->verb and ->noun is not
workable (which is a pity, but you can't really fight this.) You need to
spell out the whole messages for each case, something like

if (lsntype == replay)
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current standby_replay LSN %X/%08X",
else if (lsntype == flush)
ereport( ... )

and so on. This means four separate messages for translation for each
message your patch is adding, which is IMO the correct approach.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"... In accounting terms this makes perfect sense. To rational humans, it
is insane. Welcome to IBM." (Robert X. Cringely)
https://www.cringely.com/2015/06/03/autodesks-john-walker-explained-hp-and-ibm-in-1991/

#81Alexander Korotkov
aekorotkov@gmail.com
In reply to: Álvaro Herrera (#80)
Re: Implement waiting for wal lsn replay: reloaded

On Thu, Jan 1, 2026 at 7:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

In 0002 you have this kind of thing:

ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
-                                             errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+                                             errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
-                                                        LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+                                                        desc->verb,
+                                                        desc->noun,
+                                                        LSN_FORMAT_ARGS(currentLSN)));
+                     }

I'm afraid this technique doesn't work, for translatability reasons.
Your whole design of having a struct with ->verb and ->noun is not
workable (which is a pity, but you can't really fight this.) You need to
spell out the whole messages for each case, something like

if (lsntype == replay)
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current standby_replay LSN %X/%08X",
else if (lsntype == flush)
ereport( ... )

and so on. This means four separate messages for translation for each
message your patch is adding, which is IMO the correct approach.

+1
Thank you for catching this, Alvaro. Yes, I think we need to get rid
of WaitLSNTypeDesc. It's nice idea, but we support too many languages
to have something like this.

------
Regards,
Alexander Korotkov
Supabase

#82Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#81)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alvaro, Alexander,

On Fri, Jan 2, 2026 at 7:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Jan 1, 2026 at 7:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

In 0002 you have this kind of thing:

ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
-                                             errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+                                             errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
-                                                        LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+                                                        desc->verb,
+                                                        desc->noun,
+                                                        LSN_FORMAT_ARGS(currentLSN)));
+                     }

I'm afraid this technique doesn't work, for translatability reasons.
Your whole design of having a struct with ->verb and ->noun is not
workable (which is a pity, but you can't really fight this.) You need to
spell out the whole messages for each case, something like

if (lsntype == replay)
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current standby_replay LSN %X/%08X",
else if (lsntype == flush)
ereport( ... )

and so on. This means four separate messages for translation for each
message your patch is adding, which is IMO the correct approach.

+1
Thank you for catching this, Alvaro. Yes, I think we need to get rid
of WaitLSNTypeDesc. It's nice idea, but we support too many languages
to have something like this.

Thanks for pointing this out. This approach doesn’t scale to multiple
languages. While switch statements are more verbose, the extra clarity
is justified to preserve proper internationalization. Please check the
updated v12.

--
Best,
Xuneng

Attachments:

v12-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchapplication/octet-stream; name=v12-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchDownload
From fa5cb59b09cbaff43ffd8b3c3d0b98ebf671bebc Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v12 3/4] Add tab completion for WAIT FOR LSN MODE option

Update psql tab completion to support the optional MODE option in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords. The MODE option controls whether
the wait is evaluated from the standby or primary perspective.

When MODE is specified, completion suggests the valid mode values:
standby_replay, standby_write, standby_flush, and primary_flush.
---
 src/bin/psql/tab-complete.in.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 75a101c6ab5..62d87561169 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5355,8 +5355,10 @@ match_previous_words(int pattern_id,
 /*
  * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
  * where option can be:
+ *   MODE '<mode>'
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * and mode can be: standby_replay | standby_write | standby_flush | primary_flush
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5369,21 +5371,25 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH");
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.
+	 *
+	 * 'mode' takes a string value ('standby_replay', 'standby_write',
+	 * 'standby_flush', 'primary_flush'). 'timeout' takes a string value, and
+	 * 'no_throw' takes no value. We do not offer completions for the *values*
+	 * of 'timeout' or 'no_throw'.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+			COMPLETE_WITH("mode", "timeout", "no_throw");
+		else if (TailMatches("mode"))
+			COMPLETE_WITH("'standby_replay'", "'standby_write'", "'standby_flush'", "'primary_flush'");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.51.0

v12-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchapplication/octet-stream; name=v12-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchDownload
From 668956d0d0794c489167912d54d4c9c7bb237754 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v12 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 96 +++++++++++++++----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 14 ++-
 6 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1b7ef589fc0..fdb92deac57 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6280,7 +6280,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 38b594d2170..2d81bb1a9a7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1856,8 +1856,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..5f4ff50cf38 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,47 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	Assert(lsnType >= 0 && lsnType < WAIT_LSN_TYPE_COUNT);
+
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -302,6 +349,19 @@ WaitLSNCleanup(void)
 	}
 }
 
+/*
+ * Check if the given LSN type requires recovery to be in progress.
+ * Standby wait types (replay, write, flush) require recovery;
+ * primary wait types (flush) do not.
+ */
+static inline bool
+WaitLSNTypeRequiresRecovery(WaitLSNType t)
+{
+	return t == WAIT_LSN_TYPE_STANDBY_REPLAY ||
+		t == WAIT_LSN_TYPE_STANDBY_WRITE ||
+		t == WAIT_LSN_TYPE_STANDBY_FLUSH;
+}
+
 /*
  * Wait using MyLatch till the given LSN is reached, the replica gets
  * promoted, or the postmaster dies.
@@ -341,13 +401,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (WaitLSNTypeRequiresRecovery(lsnType) && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +434,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..e62054585cb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..4cf13f0ccb3 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.h
- *	  Declarations for LSN replay waiting routines.
+ *	  Declarations for WAL flush, write, and replay waiting routines.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.51.0

v12-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchapplication/octet-stream; name=v12-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchDownload
From cf6bea8a139e492281664e524a69be0e2cca17af Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v12 4/4] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 59 +++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..51e5324bff3 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,53 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write' => 'standby_write',
+				'flush' => 'standby_flush',);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+			my $output = $standby_node->safe_psql('postgres', $wait_query);
+			chomp($output);
+
+			if ($output ne 'success')
+			{
+				# Fetch additional detail for debugging purposes
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN failed with status:
+	${output});
+				diag qq(Last pg_stat_replication contents:
+	${details});
+				croak "failed waiting for catchup";
+			}
+			print "done\n";
+			return;
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

v12-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v12-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchDownload
From 3276142d426310850877eadbabe041d6d51e2c76 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v12 2/4] Add MODE option to WAIT FOR LSN command

Extend the WAIT FOR LSN command with an optional MODE option in the
WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
- 'standby_replay' (default): Wait for WAL to be replayed to the specified LSN
- 'standby_write': Wait for WAL to be written (received) to the specified LSN
- 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN
- 'primary_flush': Wait for WAL to be flushed to disk on the primary server

The default mode is 'standby_replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands where options are specified as string values in the WITH clause.

Modes are explicitly named to distinguish between primary and standby operations:
- Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only
  be used during recovery (on a standby server)
- Primary mode ('primary_flush') can only be used on a primary server

The 'standby_write' and 'standby_flush' modes are useful for scenarios where
applications need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete. The 'primary_flush' mode
allows waiting for WAL to be flushed on the primary server.

Also includes:
- Documentation updates for the new syntax and mode descriptions
- Test coverage for all four modes including error cases and concurrent waiters
- Wakeup logic in walreceiver for standby write/flush waiters
- Wakeup logic in WAL writer for primary flush waiters
---
 doc/src/sgml/ref/wait_for.sgml          | 213 +++++++++---
 src/backend/access/transam/xlog.c       |  22 +-
 src/backend/commands/wait.c             | 174 ++++++++--
 src/backend/replication/walreceiver.c   |  18 ++
 src/test/recovery/t/049_wait_for_lsn.pl | 411 ++++++++++++++++++++++--
 5 files changed, 741 insertions(+), 97 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..df72b3327c8 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym></refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    MODE '<replaceable class="parameter">mode</replaceable>'
     TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
     NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+    standby_replay | standby_write | standby_flush | primary_flush
 </synopsis>
  </refsynopsisdiv>
 
@@ -34,20 +40,27 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+  </para>
+
+  <para>
+   On timeout, an error is emitted unless <literal>NO_THROW</literal>
+   is specified in the WITH clause. For standby modes
+   (<literal>standby_replay</literal>, <literal>standby_write</literal>,
+   <literal>standby_flush</literal>), an error is also emitted if the
+   server is promoted before the <parameter>lsn</parameter> is reached.
+   If <parameter>NO_THROW</parameter> is specified, the command returns
+   a status string instead of throwing errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -72,6 +85,65 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
       The following parameters are supported:
 
       <variablelist>
+       <varlistentry>
+        <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+        <listitem>
+         <para>
+          Specifies the type of LSN processing to wait for. If not specified,
+          the default is <literal>standby_replay</literal>. The valid modes are:
+         </para>
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>standby_replay</literal>: Wait for the LSN to be replayed
+            (applied to the database) on a standby server. After successful
+            completion, <function>pg_last_wal_replay_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_write</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and written to disk on a
+            standby server, but not yet flushed. This is faster than
+            <literal>standby_flush</literal> but provides weaker durability
+            guarantees since the data may still be in operating system
+            buffers. After successful completion, the
+            <structfield>written_lsn</structfield> column in
+            <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link> will show
+            a value greater than or equal to the target LSN. This mode can
+            only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_flush</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and flushed to disk on a
+            standby server. This provides a durability guarantee without
+            waiting for the WAL to be applied. After successful completion,
+            <function>pg_last_wal_receive_lsn()</function> will return a
+            value greater than or equal to the target LSN. This value is
+            also available as the <structfield>flushed_lsn</structfield>
+            column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link>. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>primary_flush</literal>: Wait for the WAL containing the
+            LSN to be flushed to disk on a primary server. After successful
+            completion, <function>pg_current_wal_flush_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used on a primary server (not during recovery).
+           </para>
+          </listitem>
+         </itemizedlist>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
         <listitem>
@@ -135,9 +207,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +223,34 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>standby_replay</literal> mode
+   waits for the LSN to be replayed (applied to the database), which is
+   useful to achieve read-your-writes consistency while using an async
+   replica for reads and the primary for writes. The
+   <literal>standby_flush</literal> mode waits for the WAL to be flushed
+   to durable storage on the replica, providing a durability guarantee
+   without waiting for replay. The <literal>standby_write</literal> mode
+   waits for the WAL to be written to the operating system, which is
+   faster than flush but provides weaker durability guarantees. The
+   <literal>primary_flush</literal> mode waits for WAL to be flushed on
+   a primary server. In all cases, the <acronym>LSN</acronym> of the last
+   modification should be stored on the client application side or the
+   connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   The standby modes (<literal>standby_replay</literal>,
+   <literal>standby_write</literal>, <literal>standby_flush</literal>)
+   can only be used during recovery, and <literal>primary_flush</literal>
+   can only be used on a primary server. Using the wrong mode for the
+   current server state will result in an error. If a standby is promoted
+   while waiting with a standby mode, the command will return
+   <literal>not in recovery</literal> (or throw an error if
+   <literal>NO_THROW</literal> is not specified). Promotion creates a new
+   timeline, and the LSN being waited for may refer to WAL from the old
+   timeline.
   </para>
 
 </refsect1>
@@ -175,21 +259,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -200,7 +284,7 @@ pg_current_wal_insert_lsn
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20';
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +295,43 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for flush on primary:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'primary_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
 
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +341,12 @@ ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fdb92deac57..da96b627228 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,6 +2918,14 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3100,6 +3108,14 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6277,10 +6293,12 @@ StartupXLOG(void)
 	WakeupCheckpointer();
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..54f2df2425f 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -34,12 +35,14 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
 	const char *result = "<unset>";
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
+	bool		mode_specified = false;
 
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +50,32 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 	foreach_node(DefElem, defel, stmt->options)
 	{
-		if (strcmp(defel->defname, "timeout") == 0)
+		if (strcmp(defel->defname, "mode") == 0)
+		{
+			char	   *mode_str;
+
+			if (mode_specified)
+				errorConflictingDefElem(defel, pstate);
+			mode_specified = true;
+
+			mode_str = defGetString(defel);
+
+			if (pg_strcasecmp(mode_str, "standby_replay") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			else if (pg_strcasecmp(mode_str, "standby_write") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			else if (pg_strcasecmp(mode_str, "standby_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			else if (pg_strcasecmp(mode_str, "primary_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_PRIMARY_FLUSH;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for %s option \"%s\": \"%s\"",
+								"WAIT", defel->defname, mode_str),
+						 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "timeout") == 0)
 		{
 			char	   *timeout_str;
 			const char *hintmsg;
@@ -107,8 +135,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +168,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	/*
+	 * Validate that the requested mode matches the current server state.
+	 * Primary modes can only be used on a primary.
+	 */
+	if (lsnType == WAIT_LSN_TYPE_PRIMARY_FLUSH)
+	{
+		if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is in progress"),
+					 errhint("Waiting for primary_flush can only be done on a primary server. "
+							 "Use standby_flush mode on a standby server.")));
+	}
+
+	/* Now wait for the LSN */
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +197,48 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
-				ereport(ERROR,
-						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
-							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			{
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
+				switch (lsnType)
+				{
+					case WAIT_LSN_TYPE_STANDBY_REPLAY:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current standby_replay LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					case WAIT_LSN_TYPE_STANDBY_WRITE:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be written; current standby_write LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					case WAIT_LSN_TYPE_STANDBY_FLUSH:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be flushed; current standby_flush LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be flushed; current primary_flush LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					default:
+						elog(ERROR, "unexpected wait LSN type %d", lsnType);
+						pg_unreachable();
+				}
+			}
 			else
 				result = "timeout";
 			break;
@@ -168,18 +248,72 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 			{
 				if (PromoteIsTriggered())
 				{
-					ereport(ERROR,
-							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
-									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+					XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
+					switch (lsnType)
+					{
+						case WAIT_LSN_TYPE_STANDBY_REPLAY:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errdetail("Recovery ended before target LSN %X/%08X was replayed; last standby_replay LSN %X/%08X.",
+											  LSN_FORMAT_ARGS(lsn),
+											  LSN_FORMAT_ARGS(currentLSN)));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_WRITE:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errdetail("Recovery ended before target LSN %X/%08X was written; last standby_write LSN %X/%08X.",
+											  LSN_FORMAT_ARGS(lsn),
+											  LSN_FORMAT_ARGS(currentLSN)));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_FLUSH:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errdetail("Recovery ended before target LSN %X/%08X was flushed; last standby_flush LSN %X/%08X.",
+											  LSN_FORMAT_ARGS(lsn),
+											  LSN_FORMAT_ARGS(currentLSN)));
+							break;
+
+						default:
+							elog(ERROR, "unexpected wait LSN type %d", lsnType);
+							pg_unreachable();
+					}
 				}
 				else
-					ereport(ERROR,
-							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+				{
+					switch (lsnType)
+					{
+						case WAIT_LSN_TYPE_STANDBY_REPLAY:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errhint("Waiting for the standby_replay LSN can only be executed during recovery."));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_WRITE:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errhint("Waiting for the standby_write LSN can only be executed during recovery."));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_FLUSH:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errhint("Waiting for the standby_flush LSN can only be executed during recovery."));
+							break;
+
+						default:
+							elog(ERROR, "unexpected wait LSN type %d", lsnType);
+							pg_unreachable();
+					}
+				}
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..404d348da37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1013,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for, notify the
+		 * waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..b767b475ff7 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,5 +1,6 @@
-# Checks waiting for the LSN replay on standby using
-# the WAIT FOR command.
+# Checks waiting for the LSN using the WAIT FOR command.
+# Tests standby modes (standby_replay/standby_write/standby_flush) on standby
+# and primary_flush mode on primary.
 use strict;
 use warnings FATAL => 'all';
 
@@ -7,6 +8,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql(
+		'postgres', qq[
+		SELECT pg_catalog.quote_literal(setting)
+		FROM pg_settings
+		WHERE name = 'primary_conninfo';
+	]);
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = $saved_primary_conninfo;
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +99,52 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with standby_write, standby_flush, and
+# primary_flush modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' WITH (MODE 'standby_write', timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR with MODE 'standby_write'"
+);
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' WITH (MODE 'standby_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR with MODE 'standby_flush'"
+);
+
+# Check primary_flush mode on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn_primary_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_primary->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_primary_flush}' WITH (MODE 'primary_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '${lsn_primary_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"primary flushed WAL up to target LSN after WAIT FOR with MODE 'primary_flush'"
+);
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,14 +170,26 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
-# within another function, or inside a transaction with an isolation level
-# higher than READ COMMITTED.
+# 5. Check mode validation: standby modes error on primary, primary mode errors
+# on standby, and primary_flush works on primary.  Also check that WAIT FOR
+# triggers an error if called within another function or inside a transaction
+# with an isolation level higher than READ COMMITTED.
+
+# Test standby_flush on primary - should error
+$node_primary->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'standby_flush');",
+	stderr => \$stderr);
+ok($stderr =~ /recovery is not in progress/,
+	"get an error when running standby_flush on the primary");
 
-$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+# Test primary_flush on standby - should error
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'primary_flush');",
 	stderr => \$stderr);
-ok( $stderr =~ /recovery is not in progress/,
-	"get an error when running on the primary");
+ok($stderr =~ /recovery is in progress/,
+	"get an error when running primary_flush on the standby");
 
 $node_standby->psql(
 	'postgres',
@@ -125,7 +219,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,10 +302,26 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
-# psql sessions each waiting for a corresponding insertion.  When waiting is
-# finished, stored procedures logs if there are visible as many rows as
-# should be.
+# Test invalid MODE value
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "mode": "invalid"/,
+	"get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'standby_replay', MODE 'standby_write');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple standby_replay waiters.  We make 5
+# background psql sessions each waiting for a corresponding insertion.  When
+# waiting is finished, stored procedures logs if there are visible as many
+# rows as should be.
 $node_primary->safe_psql(
 	'postgres', qq[
 CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
@@ -225,8 +335,17 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
   END
 \$\$
 LANGUAGE plpgsql;
+
+CREATE FUNCTION log_wait_done(prefix text, i int) RETURNS void AS \$\$
+  BEGIN
+    RAISE LOG '% %', prefix, i;
+  END
+\$\$
+LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -243,6 +362,7 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +371,246 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple standby_replay waiters reported consistent data');
+
+# 7b. Check the scenario of multiple standby_write waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_write waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'standby_write', timeout '1d');
+		SELECT log_wait_done('write_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_write waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple standby_flush waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_flush waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'standby_flush', timeout '1d');
+		SELECT log_wait_done('flush_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_flush waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed standby mode waiters (standby_replay,
+# standby_write, standby_flush) running concurrently.  We start 6 sessions:
+# 2 for each mode, all waiting for the same target LSN.  We stop the
+# walreceiver and pause replay to ensure all waiters block.  Then we resume
+# replay and restart the walreceiver to verify they unblock and complete
+# correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('standby_replay', 'standby_write', 'standby_flush');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+		SELECT log_wait_done('mixed_done', $i);
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 7e. Check the scenario of multiple primary_flush waiters on primary.
+# We start 5 background sessions waiting for different LSNs with primary_flush
+# mode.  Each waiter logs when done.
+my @primary_flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (400 + ${i});");
+	$primary_flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+my $primary_flush_log_offset = -s $node_primary->logfile;
+
+# Start primary_flush waiters
+my @primary_flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$primary_flush_sessions[$i] = $node_primary->background_psql('postgres');
+	$primary_flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$primary_flush_lsns[$i]' WITH (MODE 'primary_flush', timeout '1d');
+		SELECT log_wait_done('primary_flush_done', $i);
+	]);
+}
+
+# The WAL should already be flushed, so waiters should complete quickly
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->wait_for_log("primary_flush_done $i",
+		$primary_flush_log_offset);
+	$primary_flush_sessions[$i]->quit;
+}
+
+# Verify on primary that WAL was flushed up to the target LSN
+$output = $node_primary->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '$primary_flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple primary_flush waiters: primary flushed WAL up to target LSN");
+
+# 8. Check that the standby promotion terminates all standby wait modes.  Start
+# waiting for unreachable LSNs with standby_replay, standby_write, and
+# standby_flush modes, then promote.  Check the log for the relevant error
+# messages.  Also, check that waiting for already replayed LSN doesn't cause
+# an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('standby_replay', 'standby_write', 'standby_flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +620,16 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
 
 $node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
 
@@ -295,8 +645,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.51.0

#83Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#82)
4 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi, Xuneng!

On Fri, Jan 2, 2026 at 11:17 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Jan 2, 2026 at 7:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Jan 1, 2026 at 7:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

In 0002 you have this kind of thing:

ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
-                                             errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+                                             errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
-                                                        LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+                                                        desc->verb,
+                                                        desc->noun,
+                                                        LSN_FORMAT_ARGS(currentLSN)));
+                     }

I'm afraid this technique doesn't work, for translatability reasons.
Your whole design of having a struct with ->verb and ->noun is not
workable (which is a pity, but you can't really fight this.) You need to
spell out the whole messages for each case, something like

if (lsntype == replay)
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current standby_replay LSN %X/%08X",
else if (lsntype == flush)
ereport( ... )

and so on. This means four separate messages for translation for each
message your patch is adding, which is IMO the correct approach.

+1
Thank you for catching this, Alvaro. Yes, I think we need to get rid
of WaitLSNTypeDesc. It's nice idea, but we support too many languages
to have something like this.

Thanks for pointing this out. This approach doesn’t scale to multiple
languages. While switch statements are more verbose, the extra clarity
is justified to preserve proper internationalization. Please check the
updated v12.

I've corrected the patchset. Mostly changed just comments, formatting
etc. I'm going to push it if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v13-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchapplication/octet-stream; name=v13-0001-Extend-xlogwait-infrastructure-with-write-and-fl.patchDownload
From 3f7b1deaae59f45d8e049cf3b95ac7716ab38471 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v13 1/4] Extend xlogwait infrastructure with write and flush
 wait types

Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes are following.
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to
  WaitLSNType.
- Add GetCurrentLSNForWaitType() to retrieve the current LSN for each wait
  type.
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility.
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
---
 src/backend/access/transam/xlog.c             |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  4 +-
 src/backend/access/transam/xlogwait.c         | 96 +++++++++++++++----
 src/backend/commands/wait.c                   |  2 +-
 .../utils/activity/wait_event_names.txt       |  3 +-
 src/include/access/xlogwait.h                 | 14 ++-
 6 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e71b6e21123..05ac7c5f7f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6280,7 +6280,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a21ac48c9fe..0b5f871abe7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1856,8 +1856,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
-				WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+				WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6c2bda763e2..5020ae1e52d 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
  *		This file implements waiting for WAL operations to reach specific LSNs
  *		on both physical standby and primary servers. The core idea is simple:
  *		every process that wants to wait publishes the LSN it needs to the
- *		shared memory, and the appropriate process (startup on standby, or
- *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *		shared memory, and the appropriate process (startup on standby,
+ *		walreceiver on standby, or WAL writer/backend on primary) wakes it
+ *		once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
  *		of the backend processes.  The elements of that array are organized
- *		into a pairing heap waitersHeap, which allows for very fast finding
- *		of the least awaited LSN.
+ *		into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ *		allows for very fast finding of the least awaited LSN for each type.
  *
- *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
- *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch until it is woken up by the appropriate
- *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
- *		information about itself in the shared memory.
+ *		In addition, the least-awaited LSN for each type is cached in the
+ *		minWaitedLSN array.  The waiter process publishes information about
+ *		itself to the shared memory and waits on the latch until it is woken
+ *		up by the appropriate process, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		On standby servers: After replaying a WAL record, the startup process
- *		first performs a fast path check minWaitedLSN > replayLSN.  If this
- *		check is negative, it checks waitersHeap and wakes up the backend
- *		whose awaited LSNs are reached.
+ *		On standby servers:
+ *		- After replaying a WAL record, the startup process performs a fast
+ *		  path check minWaitedLSN[REPLAY] > replayLSN.  If this check is
+ *		  negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ *		  whose awaited LSNs are reached.
+ *		- After receiving WAL, the walreceiver process performs similar checks
+ *		  against the flush and write LSNs, waking up waiters in the FLUSH
+ *		  and WRITE heaps, respectively.
  *
  *		On primary servers: After flushing WAL, the WAL writer or backend
  *		process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
 #include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -62,6 +68,47 @@ static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
 
 struct WaitLSNState *waitLSNState = NULL;
 
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+	[WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+	[WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+	[WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+	[WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+				 "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+	Assert(lsnType >= 0 && lsnType < WAIT_LSN_TYPE_COUNT);
+
+	switch (lsnType)
+	{
+		case WAIT_LSN_TYPE_STANDBY_REPLAY:
+			return GetXLogReplayRecPtr(NULL);
+
+		case WAIT_LSN_TYPE_STANDBY_WRITE:
+			return GetWalRcvWriteRecPtr();
+
+		case WAIT_LSN_TYPE_STANDBY_FLUSH:
+			return GetWalRcvFlushRecPtr(NULL, NULL);
+
+		case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+			return GetFlushRecPtr(NULL);
+	}
+
+	elog(ERROR, "invalid LSN wait type: %d", lsnType);
+	pg_unreachable();
+}
+
 /* Report the amount of shared memory space needed for WaitLSNState. */
 Size
 WaitLSNShmemSize(void)
@@ -302,6 +349,19 @@ WaitLSNCleanup(void)
 	}
 }
 
+/*
+ * Check if the given LSN type requires recovery to be in progress.
+ * Standby wait types (replay, write, flush) require recovery;
+ * primary wait types (flush) do not.
+ */
+static inline bool
+WaitLSNTypeRequiresRecovery(WaitLSNType t)
+{
+	return t == WAIT_LSN_TYPE_STANDBY_REPLAY ||
+		t == WAIT_LSN_TYPE_STANDBY_WRITE ||
+		t == WAIT_LSN_TYPE_STANDBY_FLUSH;
+}
+
 /*
  * Wait using MyLatch till the given LSN is reached, the replica gets
  * promoted, or the postmaster dies.
@@ -341,13 +401,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = -1;
 
-		if (lsnType == WAIT_LSN_TYPE_REPLAY)
-			currentLSN = GetXLogReplayRecPtr(NULL);
-		else
-			currentLSN = GetFlushRecPtr(NULL);
+		/* Get current LSN for the wait type */
+		currentLSN = GetCurrentLSNForWaitType(lsnType);
 
 		/* Check that recovery is still in-progress */
-		if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+		if (WaitLSNTypeRequiresRecovery(lsnType) && !RecoveryInProgress())
 		{
 			/*
 			 * Recovery was ended, but check if target LSN was already
@@ -376,7 +434,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		rc = WaitLatch(MyLatch, wake_events, delay_ms,
-					   (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+					   WaitLSNWaitEvents[lsnType]);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index d43dfd642d6..4867f59691e 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 43d870dbcf1..3299de23bb3 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary or standby."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE	"Waiting for WAL write to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index b5fd3e74f1c..d12531d32b8 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.h
- *	  Declarations for LSN replay waiting routines.
+ *	  Declarations for WAL flush, write, and replay waiting routines.
  *
  * Copyright (c) 2025-2026, PostgreSQL Global Development Group
  *
@@ -35,11 +35,16 @@ typedef enum
  */
 typedef enum WaitLSNType
 {
-	WAIT_LSN_TYPE_REPLAY,		/* Waiting for replay on standby */
-	WAIT_LSN_TYPE_FLUSH,		/* Waiting for flush on primary */
+	/* Standby wait types (walreceiver/startup wakes) */
+	WAIT_LSN_TYPE_STANDBY_REPLAY,
+	WAIT_LSN_TYPE_STANDBY_WRITE,
+	WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+	/* Primary wait types (WAL writer/backends wake) */
+	WAIT_LSN_TYPE_PRIMARY_FLUSH,
 } WaitLSNType;
 
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
 
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
 extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
-- 
2.39.5 (Apple Git-154)

v13-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchapplication/octet-stream; name=v13-0004-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait.patchDownload
From 5827bd2d978757e910f2bf00f8e2006abc563d24 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 3 Jan 2026 00:49:10 +0200
Subject: [PATCH v13 4/4] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 59 +++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 955dfc0e7f8..a28ea89aa10 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,53 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write' => 'standby_write',
+				'flush' => 'standby_flush',);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+			my $output = $standby_node->safe_psql('postgres', $wait_query);
+			chomp($output);
+
+			if ($output ne 'success')
+			{
+				# Fetch additional detail for debugging purposes
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN failed with status:
+	${output});
+				diag qq(Last pg_stat_replication contents:
+	${details});
+				croak "failed waiting for catchup";
+			}
+			print "done\n";
+			return;
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.39.5 (Apple Git-154)

v13-0003-Add-tab-completion-for-the-WAIT-FOR-LSN-MODE-opt.patchapplication/octet-stream; name=v13-0003-Add-tab-completion-for-the-WAIT-FOR-LSN-MODE-opt.patchDownload
From 1d1ee0be7975a94e0f062a96e77be11322f335fa Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 3 Jan 2026 00:42:32 +0200
Subject: [PATCH v13 3/4] Add tab completion for the WAIT FOR LSN MODE option

Update psql tab completion to support the optional MODE option in the
WAIT FOR LSN command.  After specifying an LSN value, completion now offers
both MODE and WITH keywords.  The MODE option specifies which LSN type to wait
for.  In particular, it controls whether the wait is evaluated from the
standby or primary perspective.

When MODE is specified, the completion suggests the valid mode values:
standby_replay, standby_write, standby_flush, and primary_flush.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
---
 src/bin/psql/tab-complete.in.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index d81f2fcdbe6..06edea98f06 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5355,8 +5355,11 @@ match_previous_words(int pattern_id,
 /*
  * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
  * where option can be:
+ *   MODE '<mode>'
  *   TIMEOUT '<timeout>'
  *   NO_THROW
+ * and mode can be:
+ *   standby_replay | standby_write | standby_flush | primary_flush
  */
 	else if (Matches("WAIT"))
 		COMPLETE_WITH("FOR");
@@ -5369,21 +5372,24 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH");
 	else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
 		COMPLETE_WITH("(");
+
+	/*
+	 * Handle parenthesized option list.  This fires when we're in an
+	 * unfinished parenthesized option list.  get_previous_words treats a
+	 * completed parenthesized option list as one word, so the above test is
+	 * correct.
+	 *
+	 * 'mode' takes a string value (one of the listed above), 'timeout' takes
+	 * a string value, and 'no_throw' takes no value.  We do not offer
+	 * completions for the *values* of 'timeout' or 'no_throw'.
+	 */
 	else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
 			 !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
 	{
-		/*
-		 * This fires if we're in an unfinished parenthesized option list.
-		 * get_previous_words treats a completed parenthesized option list as
-		 * one word, so the above test is correct.
-		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("timeout", "no_throw");
-
-		/*
-		 * timeout takes a string value, no_throw takes no value. We don't
-		 * offer completions for these values.
-		 */
+			COMPLETE_WITH("mode", "timeout", "no_throw");
+		else if (TailMatches("mode"))
+			COMPLETE_WITH("'standby_replay'", "'standby_write'", "'standby_flush'", "'primary_flush'");
 	}
 
 /* WITH [RECURSIVE] */
-- 
2.39.5 (Apple Git-154)

v13-0002-Add-the-MODE-option-to-the-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v13-0002-Add-the-MODE-option-to-the-WAIT-FOR-LSN-command.patchDownload
From 4ff3736d2becbff9931ec571098f8ca44081b18c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 3 Jan 2026 00:38:47 +0200
Subject: [PATCH v13 2/4] Add the MODE option to the WAIT FOR LSN command

This commit extends the WAIT FOR LSN command with an optional MODE option in
the WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
 - 'standby_replay' (default): Wait for WAL to be replayed to the specified
   LSN,
 - 'standby_write': Wait for WAL to be written (received) to the specified
   LSN,
 - 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN,
 - 'primary_flush': Wait for WAL to be flushed to disk on the primary server.

The default mode is 'standby_replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands, where options are specified as string values in the WITH clause.

Modes are explicitly named to distinguish between primary and standby
operations:
- Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only
  be used during recovery (on a standby server),
- Primary mode ('primary_flush') can only be used on a primary server.

The 'standby_write' and 'standby_flush' modes are useful for scenarios where
applications need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete. The 'primary_flush' mode
allows waiting for WAL to be flushed on the primary server.

This commit also includes includes:
- Documentation updates for the new syntax and mode descriptions,
- Test coverage for all four modes, including error cases and concurrent
  waiters,
- Wakeup logic in walreceiver for standby write/flush waiters,
- Wakeup logic in WAL writer for primary flush waiters.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
---
 doc/src/sgml/ref/wait_for.sgml          | 213 +++++++++---
 src/backend/access/transam/xlog.c       |  22 +-
 src/backend/commands/wait.c             | 174 ++++++++--
 src/backend/replication/walreceiver.c   |  18 ++
 src/test/recovery/t/049_wait_for_lsn.pl | 411 ++++++++++++++++++++++--
 5 files changed, 741 insertions(+), 97 deletions(-)

diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..df72b3327c8 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
 
  <refnamediv>
   <refname>WAIT FOR</refname>
-  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+  <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym></refpurpose>
  </refnamediv>
 
  <refsynopsisdiv>
 <synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+    [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    MODE '<replaceable class="parameter">mode</replaceable>'
     TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
     NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+    standby_replay | standby_write | standby_flush | primary_flush
 </synopsis>
  </refsynopsisdiv>
 
@@ -34,20 +40,27 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Description</title>
 
   <para>
-    Waits until recovery replays <parameter>lsn</parameter>.
-    If no <parameter>timeout</parameter> is specified or it is set to
-    zero, this command waits indefinitely for the
-    <parameter>lsn</parameter>.
-    On timeout, or if the server is promoted before
-    <parameter>lsn</parameter> is reached, an error is emitted,
-    unless <literal>NO_THROW</literal> is specified in the WITH clause.
-    If <parameter>NO_THROW</parameter> is specified, then the command
-    doesn't throw errors.
+   Waits until the specified <parameter>lsn</parameter> is reached
+   according to the specified <parameter>mode</parameter>,
+   which determines whether to wait for WAL to be written, flushed, or replayed.
+   If no <parameter>timeout</parameter> is specified or it is set to
+   zero, this command waits indefinitely for the
+   <parameter>lsn</parameter>.
+  </para>
+
+  <para>
+   On timeout, an error is emitted unless <literal>NO_THROW</literal>
+   is specified in the WITH clause. For standby modes
+   (<literal>standby_replay</literal>, <literal>standby_write</literal>,
+   <literal>standby_flush</literal>), an error is also emitted if the
+   server is promoted before the <parameter>lsn</parameter> is reached.
+   If <parameter>NO_THROW</parameter> is specified, the command returns
+   a status string instead of throwing errors.
   </para>
 
   <para>
-    The possible return values are <literal>success</literal>,
-    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+   The possible return values are <literal>success</literal>,
+   <literal>timeout</literal>, and <literal>not in recovery</literal>.
   </para>
  </refsect1>
 
@@ -72,6 +85,65 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
       The following parameters are supported:
 
       <variablelist>
+       <varlistentry>
+        <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+        <listitem>
+         <para>
+          Specifies the type of LSN processing to wait for. If not specified,
+          the default is <literal>standby_replay</literal>. The valid modes are:
+         </para>
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>standby_replay</literal>: Wait for the LSN to be replayed
+            (applied to the database) on a standby server. After successful
+            completion, <function>pg_last_wal_replay_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_write</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and written to disk on a
+            standby server, but not yet flushed. This is faster than
+            <literal>standby_flush</literal> but provides weaker durability
+            guarantees since the data may still be in operating system
+            buffers. After successful completion, the
+            <structfield>written_lsn</structfield> column in
+            <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link> will show
+            a value greater than or equal to the target LSN. This mode can
+            only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>standby_flush</literal>: Wait for the WAL containing the
+            LSN to be received from the primary and flushed to disk on a
+            standby server. This provides a durability guarantee without
+            waiting for the WAL to be applied. After successful completion,
+            <function>pg_last_wal_receive_lsn()</function> will return a
+            value greater than or equal to the target LSN. This value is
+            also available as the <structfield>flushed_lsn</structfield>
+            column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+            <structname>pg_stat_wal_receiver</structname></link>. This mode
+            can only be used during recovery.
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>primary_flush</literal>: Wait for the WAL containing the
+            LSN to be flushed to disk on a primary server. After successful
+            completion, <function>pg_current_wal_flush_lsn()</function> will
+            return a value greater than or equal to the target LSN. This mode
+            can only be used on a primary server (not during recovery).
+           </para>
+          </listitem>
+         </itemizedlist>
+        </listitem>
+       </varlistentry>
+
        <varlistentry>
         <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
         <listitem>
@@ -135,9 +207,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
     <listitem>
      <para>
       This return value denotes that the database server is not in a recovery
-      state.  This might mean either the database server was not in recovery
-      at the moment of receiving the command, or it was promoted before
-      reaching the target <parameter>lsn</parameter>.
+      state. This might mean either the database server was not in recovery
+      at the moment of receiving the command (i.e., executed on a primary),
+      or it was promoted before reaching the target <parameter>lsn</parameter>.
+      In the promotion case, this status indicates a timeline change occurred,
+      and the application should re-evaluate whether the target LSN is still
+      relevant.
      </para>
     </listitem>
    </varlistentry>
@@ -148,25 +223,34 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Notes</title>
 
   <para>
-    <command>WAIT FOR</command> command waits till
-    <parameter>lsn</parameter> to be replayed on standby.
-    That is, after this command execution, the value returned by
-    <function>pg_last_wal_replay_lsn</function> should be greater or equal
-    to the <parameter>lsn</parameter> value.  This is useful to achieve
-    read-your-writes-consistency, while using async replica for reads and
-    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
-    modification should be stored on the client application side or the
-    connection pooler side.
+   <command>WAIT FOR</command> waits until the specified
+   <parameter>lsn</parameter> is reached according to the specified
+   <parameter>mode</parameter>. The <literal>standby_replay</literal> mode
+   waits for the LSN to be replayed (applied to the database), which is
+   useful to achieve read-your-writes consistency while using an async
+   replica for reads and the primary for writes. The
+   <literal>standby_flush</literal> mode waits for the WAL to be flushed
+   to durable storage on the replica, providing a durability guarantee
+   without waiting for replay. The <literal>standby_write</literal> mode
+   waits for the WAL to be written to the operating system, which is
+   faster than flush but provides weaker durability guarantees. The
+   <literal>primary_flush</literal> mode waits for WAL to be flushed on
+   a primary server. In all cases, the <acronym>LSN</acronym> of the last
+   modification should be stored on the client application side or the
+   connection pooler side.
   </para>
 
   <para>
-    <command>WAIT FOR</command> command should be called on standby.
-    If a user runs <command>WAIT FOR</command> on primary, it
-    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
-    However, if <command>WAIT FOR</command> is
-    called on primary promoted from standby and <literal>lsn</literal>
-    was already replayed, then the <command>WAIT FOR</command> command just
-    exits immediately.
+   The standby modes (<literal>standby_replay</literal>,
+   <literal>standby_write</literal>, <literal>standby_flush</literal>)
+   can only be used during recovery, and <literal>primary_flush</literal>
+   can only be used on a primary server. Using the wrong mode for the
+   current server state will result in an error. If a standby is promoted
+   while waiting with a standby mode, the command will return
+   <literal>not in recovery</literal> (or throw an error if
+   <literal>NO_THROW</literal> is not specified). Promotion creates a new
+   timeline, and the LSN being waited for may refer to WAL from the old
+   timeline.
   </para>
 
 </refsect1>
@@ -175,21 +259,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
   <title>Examples</title>
 
   <para>
-    You can use <command>WAIT FOR</command> command to wait for
-    the <type>pg_lsn</type> value.  For example, an application could update
-    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
-    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
-    on primary server to get the <acronym>lsn</acronym> given that
-    <varname>synchronous_commit</varname> could be set to
-    <literal>off</literal>.
+   You can use <command>WAIT FOR</command> command to wait for
+   the <type>pg_lsn</type> value.  For example, an application could update
+   the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+   changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+   on primary server to get the <acronym>lsn</acronym> given that
+   <varname>synchronous_commit</varname> could be set to
+   <literal>off</literal>.
 
    <programlisting>
 postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
 UPDATE 100
 postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
 (1 row)
 </programlisting>
 
@@ -200,7 +284,7 @@ pg_current_wal_insert_lsn
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20';
  status
---------
+---------
  success
 (1 row)
 postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +295,43 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
   </para>
 
   <para>
-    If the target LSN is not reached before the timeout, the error is thrown.
+   Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'standby_write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   Wait for flush on primary:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'primary_flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+  </para>
+
+  <para>
+   If the target LSN is not reached before the timeout, an error is thrown:
 
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +341,12 @@ ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current
 
   <para>
    The same example uses <command>WAIT FOR</command> with
-   <parameter>NO_THROW</parameter> option.
+   <parameter>NO_THROW</parameter> option:
+
 <programlisting>
 postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
  status
---------
+---------
  timeout
 (1 row)
 </programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 05ac7c5f7f8..81dc86847c0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,6 +2918,14 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3100,6 +3108,14 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_PRIMARY_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_PRIMARY_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6277,10 +6293,12 @@ StartupXLOG(void)
 	WakeupCheckpointer();
 
 	/*
-	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before reaching the target LSN.
+	 * Wake up all waiters.  They need to report an error that recovery was
+	 * ended before reaching the target LSN.
 	 */
 	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+	WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 4867f59691e..264f81571d4 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
  *
  * wait.c
  *	  Implements WAIT FOR, which allows waiting for events such as
- *	  time passing or LSN having been replayed on replica.
+ *	  time passing or LSN having been replayed, flushed, or written.
  *
  * Portions Copyright (c) 2025-2026, PostgreSQL Global Development Group
  *
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
 #include "commands/defrem.h"
@@ -34,12 +35,14 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	XLogRecPtr	lsn;
 	int64		timeout = 0;
 	WaitLSNResult waitLSNResult;
+	WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
 	bool		throw = true;
 	TupleDesc	tupdesc;
 	TupOutputState *tstate;
 	const char *result = "<unset>";
 	bool		timeout_specified = false;
 	bool		no_throw_specified = false;
+	bool		mode_specified = false;
 
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +50,32 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 	foreach_node(DefElem, defel, stmt->options)
 	{
-		if (strcmp(defel->defname, "timeout") == 0)
+		if (strcmp(defel->defname, "mode") == 0)
+		{
+			char	   *mode_str;
+
+			if (mode_specified)
+				errorConflictingDefElem(defel, pstate);
+			mode_specified = true;
+
+			mode_str = defGetString(defel);
+
+			if (pg_strcasecmp(mode_str, "standby_replay") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+			else if (pg_strcasecmp(mode_str, "standby_write") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+			else if (pg_strcasecmp(mode_str, "standby_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+			else if (pg_strcasecmp(mode_str, "primary_flush") == 0)
+				lsnType = WAIT_LSN_TYPE_PRIMARY_FLUSH;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for %s option \"%s\": \"%s\"",
+								"WAIT", defel->defname, mode_str),
+						 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "timeout") == 0)
 		{
 			char	   *timeout_str;
 			const char *hintmsg;
@@ -107,8 +135,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	}
 
 	/*
-	 * We are going to wait for the LSN replay.  We should first care that we
-	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
 	 * Otherwise, our snapshot could prevent the replay of WAL records
 	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
 	 * command, not a procedure or function.
@@ -140,7 +168,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	 */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
-	waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+	/*
+	 * Validate that the requested mode matches the current server state.
+	 * Primary modes can only be used on a primary.
+	 */
+	if (lsnType == WAIT_LSN_TYPE_PRIMARY_FLUSH)
+	{
+		if (RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is in progress"),
+					 errhint("Waiting for primary_flush can only be done on a primary server. "
+							 "Use standby_flush mode on a standby server.")));
+	}
+
+	/* Now wait for the LSN */
+	waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
 
 	/*
 	 * Process the result of WaitForLSN().  Throw appropriate error if needed.
@@ -154,11 +197,48 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 
 		case WAIT_LSN_RESULT_TIMEOUT:
 			if (throw)
-				ereport(ERROR,
-						errcode(ERRCODE_QUERY_CANCELED),
-						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
-							   LSN_FORMAT_ARGS(lsn),
-							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			{
+				XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
+				switch (lsnType)
+				{
+					case WAIT_LSN_TYPE_STANDBY_REPLAY:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current standby_replay LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					case WAIT_LSN_TYPE_STANDBY_WRITE:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be written; current standby_write LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					case WAIT_LSN_TYPE_STANDBY_FLUSH:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be flushed; current standby_flush LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+						ereport(ERROR,
+								errcode(ERRCODE_QUERY_CANCELED),
+								errmsg("timed out while waiting for target LSN %X/%08X to be flushed; current primary_flush LSN %X/%08X",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(currentLSN)));
+						break;
+
+					default:
+						elog(ERROR, "unexpected wait LSN type %d", lsnType);
+						pg_unreachable();
+				}
+			}
 			else
 				result = "timeout";
 			break;
@@ -168,18 +248,72 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 			{
 				if (PromoteIsTriggered())
 				{
-					ereport(ERROR,
-							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-							errmsg("recovery is not in progress"),
-							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
-									  LSN_FORMAT_ARGS(lsn),
-									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+					XLogRecPtr	currentLSN = GetCurrentLSNForWaitType(lsnType);
+
+					switch (lsnType)
+					{
+						case WAIT_LSN_TYPE_STANDBY_REPLAY:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errdetail("Recovery ended before target LSN %X/%08X was replayed; last standby_replay LSN %X/%08X.",
+											  LSN_FORMAT_ARGS(lsn),
+											  LSN_FORMAT_ARGS(currentLSN)));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_WRITE:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errdetail("Recovery ended before target LSN %X/%08X was written; last standby_write LSN %X/%08X.",
+											  LSN_FORMAT_ARGS(lsn),
+											  LSN_FORMAT_ARGS(currentLSN)));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_FLUSH:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errdetail("Recovery ended before target LSN %X/%08X was flushed; last standby_flush LSN %X/%08X.",
+											  LSN_FORMAT_ARGS(lsn),
+											  LSN_FORMAT_ARGS(currentLSN)));
+							break;
+
+						default:
+							elog(ERROR, "unexpected wait LSN type %d", lsnType);
+							pg_unreachable();
+					}
 				}
 				else
-					ereport(ERROR,
-							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-							errmsg("recovery is not in progress"),
-							errhint("Waiting for the replay LSN can only be executed during recovery."));
+				{
+					switch (lsnType)
+					{
+						case WAIT_LSN_TYPE_STANDBY_REPLAY:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errhint("Waiting for the standby_replay LSN can only be executed during recovery."));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_WRITE:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errhint("Waiting for the standby_write LSN can only be executed during recovery."));
+							break;
+
+						case WAIT_LSN_TYPE_STANDBY_FLUSH:
+							ereport(ERROR,
+									errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+									errmsg("recovery is not in progress"),
+									errhint("Waiting for the standby_flush LSN can only be executed during recovery."));
+							break;
+
+						default:
+							elog(ERROR, "unexpected wait LSN type %d", lsnType);
+							pg_unreachable();
+					}
+				}
 			}
 			else
 				result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac002f730c3..a41453530a1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "catalog/pg_authid.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
@@ -965,6 +966,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 
+	/*
+	 * If we wrote an LSN that someone was waiting for, notify the waiters.
+	 */
+	if (waitLSNState &&
+		(LogstreamResult.Write >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
 	/*
 	 * Close the current segment if it's fully written up in the last cycle of
 	 * the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1013,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/*
+		 * If we flushed an LSN that someone was waiting for, notify the
+		 * waiters.
+		 */
+		if (waitLSNState &&
+			(LogstreamResult.Flush >=
+			 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+			WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..b767b475ff7 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,5 +1,6 @@
-# Checks waiting for the LSN replay on standby using
-# the WAIT FOR command.
+# Checks waiting for the LSN using the WAIT FOR command.
+# Tests standby modes (standby_replay/standby_write/standby_flush) on standby
+# and primary_flush mode on primary.
 use strict;
 use warnings FATAL => 'all';
 
@@ -7,6 +8,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+	my ($node) = @_;
+	$saved_primary_conninfo = $node->safe_psql(
+		'postgres', qq[
+		SELECT pg_catalog.quote_literal(setting)
+		FROM pg_settings
+		WHERE name = 'primary_conninfo';
+	]);
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+	my ($node) = @_;
+	$node->safe_psql(
+		'postgres', qq[
+		ALTER SYSTEM SET primary_conninfo = $saved_primary_conninfo;
+		SELECT pg_reload_conf();
+	]);
+
+	$node->poll_query_until('postgres',
+		"SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
@@ -62,7 +99,52 @@ $output = $node_standby->safe_psql(
 ok((split("\n", $output))[-1] eq 30,
 	"standby reached the same LSN as primary");
 
-# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# 3. Check that WAIT FOR works with standby_write, standby_flush, and
+# primary_flush modes.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_write}' WITH (MODE 'standby_write', timeout '1d');
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby wrote WAL up to target LSN after WAIT FOR with MODE 'standby_write'"
+);
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_flush}' WITH (MODE 'standby_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"standby flushed WAL up to target LSN after WAIT FOR with MODE 'standby_flush'"
+);
+
+# Check primary_flush mode on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn_primary_flush =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_primary->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn_primary_flush}' WITH (MODE 'primary_flush', timeout '1d');
+	SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '${lsn_primary_flush}'::pg_lsn);
+]);
+
+ok( (split("\n", $output))[-1] >= 0,
+	"primary flushed WAL up to target LSN after WAIT FOR with MODE 'primary_flush'"
+);
+
+# 4. Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
 my $lsn3 =
@@ -88,14 +170,26 @@ $output = $node_standby->safe_psql(
 	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
 ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
 
-# 4. Check that WAIT FOR triggers an error if called on primary,
-# within another function, or inside a transaction with an isolation level
-# higher than READ COMMITTED.
+# 5. Check mode validation: standby modes error on primary, primary mode errors
+# on standby, and primary_flush works on primary.  Also check that WAIT FOR
+# triggers an error if called within another function or inside a transaction
+# with an isolation level higher than READ COMMITTED.
+
+# Test standby_flush on primary - should error
+$node_primary->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'standby_flush');",
+	stderr => \$stderr);
+ok($stderr =~ /recovery is not in progress/,
+	"get an error when running standby_flush on the primary");
 
-$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+# Test primary_flush on standby - should error
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (MODE 'primary_flush');",
 	stderr => \$stderr);
-ok( $stderr =~ /recovery is not in progress/,
-	"get an error when running on the primary");
+ok($stderr =~ /recovery is in progress/,
+	"get an error when running primary_flush on the standby");
 
 $node_standby->psql(
 	'postgres',
@@ -125,7 +219,7 @@ ok( $stderr =~
 	  /WAIT FOR must be only called without an active or registered snapshot/,
 	"get an error when running within another function");
 
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 
@@ -208,10 +302,26 @@ $node_standby->psql(
 ok( $stderr =~ /option "invalid_option" not recognized/,
 	"get error for invalid WITH clause option");
 
-# 6. Check the scenario of multiple LSN waiters.  We make 5 background
-# psql sessions each waiting for a corresponding insertion.  When waiting is
-# finished, stored procedures logs if there are visible as many rows as
-# should be.
+# Test invalid MODE value
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "mode": "invalid"/,
+	"get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (MODE 'standby_replay', MODE 'standby_write');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple standby_replay waiters.  We make 5
+# background psql sessions each waiting for a corresponding insertion.  When
+# waiting is finished, stored procedures logs if there are visible as many
+# rows as should be.
 $node_primary->safe_psql(
 	'postgres', qq[
 CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
@@ -225,8 +335,17 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
   END
 \$\$
 LANGUAGE plpgsql;
+
+CREATE FUNCTION log_wait_done(prefix text, i int) RETURNS void AS \$\$
+  BEGIN
+    RAISE LOG '% %', prefix, i;
+  END
+\$\$
+LANGUAGE plpgsql;
 ]);
+
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
@@ -243,6 +362,7 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
+
 my $log_offset = -s $node_standby->logfile;
 $node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
@@ -251,23 +371,246 @@ for (my $i = 0; $i < 5; $i++)
 	$psql_sessions[$i]->quit;
 }
 
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple standby_replay waiters reported consistent data');
+
+# 7b. Check the scenario of multiple standby_write waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (100 + ${i});");
+	$write_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_write waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$write_sessions[$i] = $node_standby->background_psql('postgres');
+	$write_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'standby_write', timeout '1d');
+		SELECT log_wait_done('write_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("write_done $i", $write_log_offset);
+	$write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_write waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple standby_flush waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (200 + ${i});");
+	$flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start standby_flush waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$flush_sessions[$i] = $node_standby->background_psql('postgres');
+	$flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'standby_flush', timeout '1d');
+		SELECT log_wait_done('flush_done', $i);
+	]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+	$flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple standby_flush waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed standby mode waiters (standby_replay,
+# standby_write, standby_flush) running concurrently.  We start 6 sessions:
+# 2 for each mode, all waiting for the same target LSN.  We stop the
+# walreceiver and pause replay to ensure all waiters block.  Then we resume
+# replay and restart the walreceiver to verify they unblock and complete
+# correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('standby_replay', 'standby_write', 'standby_flush');
+for (my $i = 0; $i < 6; $i++)
+{
+	$mixed_sessions[$i] = $node_standby->background_psql('postgres');
+	$mixed_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+		SELECT log_wait_done('mixed_done', $i);
+	]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+	"SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
 
-# 7. Check that the standby promotion terminates the wait on LSN.  Start
-# waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.  Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+	$node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+	$mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+	       pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+	"mixed mode waiters: all modes completed and reached target LSN");
+
+# 7e. Check the scenario of multiple primary_flush waiters on primary.
+# We start 5 background sessions waiting for different LSNs with primary_flush
+# mode.  Each waiter logs when done.
+my @primary_flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (400 + ${i});");
+	$primary_flush_lsns[$i] =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+}
+
+my $primary_flush_log_offset = -s $node_primary->logfile;
+
+# Start primary_flush waiters
+my @primary_flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$primary_flush_sessions[$i] = $node_primary->background_psql('postgres');
+	$primary_flush_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '$primary_flush_lsns[$i]' WITH (MODE 'primary_flush', timeout '1d');
+		SELECT log_wait_done('primary_flush_done', $i);
+	]);
+}
+
+# The WAL should already be flushed, so waiters should complete quickly
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->wait_for_log("primary_flush_done $i",
+		$primary_flush_log_offset);
+	$primary_flush_sessions[$i]->quit;
+}
+
+# Verify on primary that WAL was flushed up to the target LSN
+$output = $node_primary->safe_psql('postgres',
+	"SELECT pg_lsn_cmp(pg_current_wal_flush_lsn(), '$primary_flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+	"multiple primary_flush waiters: primary flushed WAL up to target LSN");
+
+# 8. Check that the standby promotion terminates all standby wait modes.  Start
+# waiting for unreachable LSNs with standby_replay, standby_write, and
+# standby_flush modes, then promote.  Check the log for the relevant error
+# messages.  Also, check that waiting for already replayed LSN doesn't cause
+# an error even after promotion.
 my $lsn4 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
-	qr/start/, qq[
-	\\echo start
-	WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('standby_replay', 'standby_write', 'standby_flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i] = $node_standby->background_psql('postgres');
+	$wait_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+	]);
+}
 
 # Make sure standby will be promoted at least at the primary insert LSN we
 # have just observed.  Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +620,16 @@ $node_primary->wait_for_catchup($node_standby);
 
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+	$log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+	$log_offset);
+$node_standby->wait_for_log(
+	qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
 
 $node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
 
@@ -295,8 +645,11 @@ ok($output eq "not in recovery",
 $node_standby->stop;
 $node_primary->stop;
 
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
 # already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+	$wait_sessions[$i]->{run}->finish;
+}
 
 done_testing();
-- 
2.39.5 (Apple Git-154)

#84Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#83)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander,

On Sat, Jan 3, 2026 at 6:54 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Xuneng!

On Fri, Jan 2, 2026 at 11:17 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Jan 2, 2026 at 7:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Jan 1, 2026 at 7:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

In 0002 you have this kind of thing:

ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
-                                             errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+                                             errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
-                                                        LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+                                                        desc->verb,
+                                                        desc->noun,
+                                                        LSN_FORMAT_ARGS(currentLSN)));
+                     }

I'm afraid this technique doesn't work, for translatability reasons.
Your whole design of having a struct with ->verb and ->noun is not
workable (which is a pity, but you can't really fight this.) You need to
spell out the whole messages for each case, something like

if (lsntype == replay)
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current standby_replay LSN %X/%08X",
else if (lsntype == flush)
ereport( ... )

and so on. This means four separate messages for translation for each
message your patch is adding, which is IMO the correct approach.

+1
Thank you for catching this, Alvaro. Yes, I think we need to get rid
of WaitLSNTypeDesc. It's nice idea, but we support too many languages
to have something like this.

Thanks for pointing this out. This approach doesn’t scale to multiple
languages. While switch statements are more verbose, the extra clarity
is justified to preserve proper internationalization. Please check the
updated v12.

I've corrected the patchset. Mostly changed just comments, formatting
etc. I'm going to push it if no objections.

Thanks for updating the patchset. LGTM.

--
Best,
Xuneng

#85Thomas Munro
thomas.munro@gmail.com
In reply to: Xuneng Zhou (#84)
Re: Implement waiting for wal lsn replay: reloaded

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

#86Xuneng Zhou
xunengzhou@gmail.com
In reply to: Thomas Munro (#85)
Re: Implement waiting for wal lsn replay: reloaded

Hi Thomas,

On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

Thanks for raising this issue. I think it is related to f30848cb after
some analysis. I'll prepare a follow-up patch to fix it.

--
Best,
Xuneng

#87Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#86)
Re: Implement waiting for wal lsn replay: reloaded

On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

Thanks for raising this issue. I think it is related to f30848cb after
some analysis. I'll prepare a follow-up patch to fix it.

Sorry, I've mistakenly referenced this report from commit [1]. I
thought it was related, but it appears to be not. [1] is related to
the report I've got from Ruikai Peng off-list.

Regarding the present failure, could it happen before ExecWaitStmt()
calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we
should do preliminary efforts to release these snapshots.

1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f

------
Regards,
Alexander Korotkov
Supabase

#88Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#87)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

Thanks for raising this issue. I think it is related to f30848cb after
some analysis. I'll prepare a follow-up patch to fix it.

Sorry, I've mistakenly referenced this report from commit [1]. I
thought it was related, but it appears to be not. [1] is related to
the report I've got from Ruikai Peng off-list.

Regarding the present failure, could it happen before ExecWaitStmt()
calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we
should do preliminary efforts to release these snapshots.

1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f

I agree that moving PopActiveSnapshot() and
InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt()
appears to be a sensible optimization. However, in this particular
failure scenario, it may not address the issue.

For tablespace conflicts, recovery conflict resolution uses
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which
returns all active backends, regardless of their snapshot state. As a
result, even if all snapshots are released at the start of
ExecWaitStmt(), the session would still be canceled during replay of
DROP TABLESPACE.

Given this, I am considering handling this conflict class explicitly:
if the WAIT FOR statement is terminated and the error indicates a
recovery conflict, we fall back to the existing polling-based
approach.

* Ask everybody to cancel their queries immediately so we can ensure no
* temp files remain and we can remove the tablespace. Nuke the entire
* site from orbit, it's the only way to be sure.
*
* XXX: We could work out the pids of active backends using this
* tablespace by examining the temp filenames in the directory. We would
* then convert the pids into VirtualXIDs before attempting to cancel
* them.

I am also wondering whether this optimization would be helpful.

--
Best,
Xuneng

Attachments:

v1-0001-Fix-wait_for_catchup-failure-when-standby-session.patchapplication/octet-stream; name=v1-0001-Fix-wait_for_catchup-failure-when-standby-session.patchDownload
From a9d8230639118d932883c51cc4b6ecf214840022 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 6 Jan 2026 20:55:43 +0800
Subject: [PATCH v1] Fix wait_for_catchup() failure when standby session is
 killed by recovery conflict

Commit f30848cb optimized wait_for_catchup() to use WAIT FOR LSN on the standby instead of polling pg_stat_replication on the primary. However, this introduced a failure mode: the WAIT FOR LSN session can be killed by recovery conflicts on the standby, causing the test helper to die unexpectedly.

This manifests as flapping failures in tests like 031_recovery_conflict, where DROP TABLESPACE on the primary triggers ResolveRecoveryConflictWithTablespace() on the standby. That function kills all backends indiscriminately, including the innocent WAIT FOR LSN session that happens to be connected at that moment.

Fix by wrapping the WAIT FOR LSN call in an eval block and falling back to the original polling approach when the session is killed by a recovery conflict. The fallback is selective:

- If WAIT FOR LSN succeeds with 'success': return immediately

- If WAIT FOR LSN returns non-success (timeout, not_in_recovery): fail immediately with diagnostics

- If the session is killed by a recovery conflict (error contains "conflict with recovery"): fall back to polling on the primary

- For any other error: fail immediately to avoid masking real problems

The polling fallback is immune to standby-side conflicts because it queries pg_stat_replication on the primary, not the standby.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 53 +++++++++++++++++++-----
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a28ea89aa10..08379aeb8fb 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3401,22 +3401,52 @@ sub wait_for_catchup
 			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
 			my $wait_query =
 			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
-			my $output = $standby_node->safe_psql('postgres', $wait_query);
-			chomp($output);
 
-			if ($output ne 'success')
+			# Try WAIT FOR LSN. If it succeeds, we're done. If it returns a
+			# non-success status (timeout, not_in_recovery), fail immediately.
+			# If the session is interrupted (e.g., killed by recovery conflict),
+			# fall back to polling on the upstream which is immune to standby-
+			# side conflicts.
+			my $output;
+			local $@;
+			my $wait_succeeded = eval {
+				$output = $standby_node->safe_psql('postgres', $wait_query);
+				chomp($output);
+				1;
+			};
+
+			if ($wait_succeeded && $output eq 'success')
+			{
+				print "done\n";
+				return;
+			}
+
+			# If WAIT FOR LSN executed but returned non-success (e.g., timeout,
+			# not_in_recovery), fail immediately with diagnostic info. Falling
+			# back to polling would just waste time.
+			if ($wait_succeeded)
 			{
-				# Fetch additional detail for debugging purposes
 				my $details = $self->safe_psql('postgres',
 					"SELECT * FROM pg_catalog.pg_stat_replication");
-				diag qq(WAIT FOR LSN failed with status:
-	${output});
-				diag qq(Last pg_stat_replication contents:
-	${details});
-				croak "failed waiting for catchup";
+				diag qq(WAIT FOR LSN returned '$output'
+pg_stat_replication on upstream:
+${details});
+				croak "WAIT FOR LSN '$wait_mode' returned '$output'";
+			}
+
+			# WAIT FOR LSN was interrupted. Only fall back to polling if this
+			# looks like a recovery conflict - the canonical PostgreSQL error
+			# message contains "conflict with recovery". Other errors should
+			# fail immediately rather than being masked by a silent fallback.
+			if ($@ =~ /conflict with recovery/i)
+			{
+				diag qq(WAIT FOR LSN interrupted, falling back to polling:
+$@);
+			}
+			else
+			{
+				croak "WAIT FOR LSN failed: $@";
 			}
-			print "done\n";
-			return;
 		}
 	}
 
@@ -3424,6 +3454,7 @@ sub wait_for_catchup
 	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
 	# - When standby_name is a string (e.g., subscription name)
 	# - When the standby is no longer in recovery (was promoted)
+	# - When WAIT FOR LSN was interrupted (e.g., killed by a recovery conflict)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#89Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#88)
Re: Implement waiting for wal lsn replay: reloaded

On Tue, Jan 6, 2026 at 3:12 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

Thanks for raising this issue. I think it is related to f30848cb after
some analysis. I'll prepare a follow-up patch to fix it.

Sorry, I've mistakenly referenced this report from commit [1]. I
thought it was related, but it appears to be not. [1] is related to
the report I've got from Ruikai Peng off-list.

Regarding the present failure, could it happen before ExecWaitStmt()
calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we
should do preliminary efforts to release these snapshots.

1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f

I agree that moving PopActiveSnapshot() and
InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt()
appears to be a sensible optimization. However, in this particular
failure scenario, it may not address the issue.

For tablespace conflicts, recovery conflict resolution uses
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which
returns all active backends, regardless of their snapshot state. As a
result, even if all snapshots are released at the start of
ExecWaitStmt(), the session would still be canceled during replay of
DROP TABLESPACE.

GetConflictingVirtualXIDs() uses proc->xmin to detect the conflicts.
ExecWaitStmt() asserts MyProc->xmin == InvalidTransactionId after
releasing all the snapshots. I still think this happens because
conflict handling happens before ExecWaitStmt() manages to release the
snapshots.

------
Regards,
Alexander Korotkov
Supabase

#90Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#88)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Jan 6, 2026 at 9:12 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

Thanks for raising this issue. I think it is related to f30848cb after
some analysis. I'll prepare a follow-up patch to fix it.

Sorry, I've mistakenly referenced this report from commit [1]. I
thought it was related, but it appears to be not. [1] is related to
the report I've got from Ruikai Peng off-list.

Regarding the present failure, could it happen before ExecWaitStmt()
calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we
should do preliminary efforts to release these snapshots.

1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f

I agree that moving PopActiveSnapshot() and
InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt()
appears to be a sensible optimization. However, in this particular
failure scenario, it may not address the issue.

For tablespace conflicts, recovery conflict resolution uses
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which
returns all active backends, regardless of their snapshot state. As a
result, even if all snapshots are released at the start of
ExecWaitStmt(), the session would still be canceled during replay of
DROP TABLESPACE.

Given this, I am considering handling this conflict class explicitly:
if the WAIT FOR statement is terminated and the error indicates a
recovery conflict, we fall back to the existing polling-based
approach.

* Ask everybody to cancel their queries immediately so we can ensure no
* temp files remain and we can remove the tablespace. Nuke the entire
* site from orbit, it's the only way to be sure.
*
* XXX: We could work out the pids of active backends using this
* tablespace by examining the temp filenames in the directory. We would
* then convert the pids into VirtualXIDs before attempting to cancel
* them.

I am also wondering whether this optimization would be helpful.

Just format the commit message.

--
Best,
Xuneng

Attachments:

v2-0001-Fix-wait_for_catchup-failure-when-standby-session.patchapplication/octet-stream; name=v2-0001-Fix-wait_for_catchup-failure-when-standby-session.patchDownload
From 1eaf36cbfafb75c91734615529dcc8f0ed7d7999 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 6 Jan 2026 20:55:43 +0800
Subject: [PATCH v2] Fix wait_for_catchup() failure when standby session is
 killed by recovery conflict

Commit f30848cb optimized wait_for_catchup() to use WAIT FOR LSN on
the standby instead of polling pg_stat_replication on the primary.
However, this introduced a failure mode: the WAIT FOR LSN session
can be killed by recovery conflicts on the standby, causing the
test helper to die unexpectedly.

This manifests as flapping failures in tests like 031_recovery_conflict,
where DROP TABLESPACE on the primary triggers
ResolveRecoveryConflictWithTablespace() on the standby. That function
kills all backends indiscriminately, including the innocent WAIT FOR
LSN session that happens to be connected at that moment.

Fix by wrapping the WAIT FOR LSN call in an eval block and falling
back to the original polling approach when the session is killed by
a recovery conflict. The fallback is selective:

- If WAIT FOR LSN succeeds with 'success': return immediately
- If WAIT FOR LSN returns non-success (timeout, not_in_recovery):
  fail immediately with diagnostics
- If the session is killed by a recovery conflict (error contains
  "conflict with recovery"): fall back to polling on the primary
- For any other error: fail immediately to avoid masking real problems

The polling fallback is immune to standby-side conflicts because it
queries pg_stat_replication on the primary, not the standby.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 53 +++++++++++++++++++-----
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a28ea89aa10..08379aeb8fb 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3401,22 +3401,52 @@ sub wait_for_catchup
 			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
 			my $wait_query =
 			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
-			my $output = $standby_node->safe_psql('postgres', $wait_query);
-			chomp($output);
 
-			if ($output ne 'success')
+			# Try WAIT FOR LSN. If it succeeds, we're done. If it returns a
+			# non-success status (timeout, not_in_recovery), fail immediately.
+			# If the session is interrupted (e.g., killed by recovery conflict),
+			# fall back to polling on the upstream which is immune to standby-
+			# side conflicts.
+			my $output;
+			local $@;
+			my $wait_succeeded = eval {
+				$output = $standby_node->safe_psql('postgres', $wait_query);
+				chomp($output);
+				1;
+			};
+
+			if ($wait_succeeded && $output eq 'success')
+			{
+				print "done\n";
+				return;
+			}
+
+			# If WAIT FOR LSN executed but returned non-success (e.g., timeout,
+			# not_in_recovery), fail immediately with diagnostic info. Falling
+			# back to polling would just waste time.
+			if ($wait_succeeded)
 			{
-				# Fetch additional detail for debugging purposes
 				my $details = $self->safe_psql('postgres',
 					"SELECT * FROM pg_catalog.pg_stat_replication");
-				diag qq(WAIT FOR LSN failed with status:
-	${output});
-				diag qq(Last pg_stat_replication contents:
-	${details});
-				croak "failed waiting for catchup";
+				diag qq(WAIT FOR LSN returned '$output'
+pg_stat_replication on upstream:
+${details});
+				croak "WAIT FOR LSN '$wait_mode' returned '$output'";
+			}
+
+			# WAIT FOR LSN was interrupted. Only fall back to polling if this
+			# looks like a recovery conflict - the canonical PostgreSQL error
+			# message contains "conflict with recovery". Other errors should
+			# fail immediately rather than being masked by a silent fallback.
+			if ($@ =~ /conflict with recovery/i)
+			{
+				diag qq(WAIT FOR LSN interrupted, falling back to polling:
+$@);
+			}
+			else
+			{
+				croak "WAIT FOR LSN failed: $@";
 			}
-			print "done\n";
-			return;
 		}
 	}
 
@@ -3424,6 +3454,7 @@ sub wait_for_catchup
 	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
 	# - When standby_name is a string (e.g., subscription name)
 	# - When the standby is no longer in recovery (was promoted)
+	# - When WAIT FOR LSN was interrupted (e.g., killed by a recovery conflict)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#91Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#89)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Jan 6, 2026 at 11:34 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Jan 6, 2026 at 3:12 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

Thanks for raising this issue. I think it is related to f30848cb after
some analysis. I'll prepare a follow-up patch to fix it.

Sorry, I've mistakenly referenced this report from commit [1]. I
thought it was related, but it appears to be not. [1] is related to
the report I've got from Ruikai Peng off-list.

Regarding the present failure, could it happen before ExecWaitStmt()
calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we
should do preliminary efforts to release these snapshots.

1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f

I agree that moving PopActiveSnapshot() and
InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt()
appears to be a sensible optimization. However, in this particular
failure scenario, it may not address the issue.

For tablespace conflicts, recovery conflict resolution uses
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which
returns all active backends, regardless of their snapshot state. As a
result, even if all snapshots are released at the start of
ExecWaitStmt(), the session would still be canceled during replay of
DROP TABLESPACE.

GetConflictingVirtualXIDs() uses proc->xmin to detect the conflicts.
ExecWaitStmt() asserts MyProc->xmin == InvalidTransactionId after
releasing all the snapshots. I still think this happens because
conflict handling happens before ExecWaitStmt() manages to release the
snapshots.

I did not notice this message before. I'll look more closely at this case.

--
Best,
Xuneng

#92Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#91)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Tue, Jan 6, 2026 at 11:58 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Jan 6, 2026 at 11:34 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Jan 6, 2026 at 3:12 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

Thanks for raising this issue. I think it is related to f30848cb after
some analysis. I'll prepare a follow-up patch to fix it.

Sorry, I've mistakenly referenced this report from commit [1]. I
thought it was related, but it appears to be not. [1] is related to
the report I've got from Ruikai Peng off-list.

Regarding the present failure, could it happen before ExecWaitStmt()
calls PopActiveSnapshot() and InvalidateCatalogSnapshot()? If so, we
should do preliminary efforts to release these snapshots.

1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f

I agree that moving PopActiveSnapshot() and
InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt()
appears to be a sensible optimization. However, in this particular
failure scenario, it may not address the issue.

For tablespace conflicts, recovery conflict resolution uses
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which
returns all active backends, regardless of their snapshot state. As a
result, even if all snapshots are released at the start of
ExecWaitStmt(), the session would still be canceled during replay of
DROP TABLESPACE.

GetConflictingVirtualXIDs() uses proc->xmin to detect the conflicts.
ExecWaitStmt() asserts MyProc->xmin == InvalidTransactionId after
releasing all the snapshots. I still think this happens because
conflict handling happens before ExecWaitStmt() manages to release the
snapshots.

I did not notice this message before. I'll look more closely at this case.

# VACUUM FREEZE, pruning those dead tuples
$node_primary->safe_psql($test_db, qq[VACUUM FREEZE $table1;]);

# Wait for attempted replay of PRUNE records
$node_primary->wait_for_replay_catchup($node_standby);

check_conflict_log(
"User query might have needed to see row versions that must be removed");
$psql_standby->reconnect_and_clear();
check_conflict_stat("snapshot");

Yeah, this code path could be problematic for the conflict type
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT. I created a patch to reduce the
false conflict detecting window as you suggested. Please check it too.

--
Best,
Xuneng

Attachments:

v1-0001-Move-snapshot-release-to-the-beginning-of-ExecWai.patchapplication/octet-stream; name=v1-0001-Move-snapshot-release-to-the-beginning-of-ExecWai.patchDownload
From fef38ad31b4bd0c4ac968a93420a3bb4513e6b15 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 7 Jan 2026 00:57:40 +0800
Subject: [PATCH v1] Move snapshot release to the beginning of ExecWaitStmt()

Move the snapshot handling code (PopActiveSnapshot, InvalidateCatalogSnapshot,
and the HaveRegisteredOrActiveSnapshot check) from after option parsing to
the very beginning of ExecWaitStmt().  This reduces the window during which
the WAIT FOR LSN session could be killed by snapshot-based recovery conflicts.

When a snapshot-based recovery conflict is processed on a hot standby,
GetConflictingVirtualXIDs() targets backends whose xmin <= limitXmin.
By releasing our snapshot and clearing xmin before option parsing, we
become immune to such conflicts during that phase.

This is a pure code movement with no functional change to the snapshot
handling logic itself.  All original comments are preserved.
---
 src/backend/commands/wait.c | 68 ++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 34 deletions(-)

diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 97f1e778488..dd1daa89623 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -44,6 +44,40 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 	bool		no_throw_specified = false;
 	bool		mode_specified = false;
 
+	/*
+	 * We are going to wait for the LSN.  We should first care that we don't
+	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
+	 * command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or procedure, nor within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
 	/* Parse and validate the mandatory LSN */
 	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
 										  CStringGetDatum(stmt->lsn_literal)));
@@ -134,40 +168,6 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 		}
 	}
 
-	/*
-	 * We are going to wait for the LSN.  We should first care that we don't
-	 * hold a snapshot and correspondingly our MyProc->xmin is invalid.
-	 * Otherwise, our snapshot could prevent the replay of WAL records
-	 * implying a kind of self-deadlock.  This is the reason why WAIT FOR is a
-	 * command, not a procedure or function.
-	 *
-	 * At first, we should check there is no active snapshot.  According to
-	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
-	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
-	 * because PortalRunUtility() can tolerate this.
-	 */
-	if (ActiveSnapshotSet())
-		PopActiveSnapshot();
-
-	/*
-	 * At second, invalidate a catalog snapshot if any.  And we should be done
-	 * with the preparation.
-	 */
-	InvalidateCatalogSnapshot();
-
-	/* Give up if there is still an active or registered snapshot. */
-	if (HaveRegisteredOrActiveSnapshot())
-		ereport(ERROR,
-				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				errmsg("WAIT FOR must be called without an active or registered snapshot"),
-				errdetail("WAIT FOR cannot be executed from a function or procedure, nor within a transaction with an isolation level higher than READ COMMITTED."));
-
-	/*
-	 * As the result we should hold no snapshot, and correspondingly our xmin
-	 * should be unset.
-	 */
-	Assert(MyProc->xmin == InvalidTransactionId);
-
 	/*
 	 * Validate that the requested mode matches the current server state.
 	 * Primary modes can only be used on a primary.
-- 
2.51.0

#93Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#85)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

The failure rates of this are very high - the majority of the CI runs on the
postgres/postgres repos failed since the change went in. Which then also means
cfbot has a very high spurious failure rate. I think we need to revert this
change until the problem has been verified as fixed.

Greetings,

Andres Freund

#94Xuneng Zhou
xunengzhou@gmail.com
In reply to: Andres Freund (#93)
2 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Wed, Jan 7, 2026 at 8:32 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

The failure rates of this are very high - the majority of the CI runs on the
postgres/postgres repos failed since the change went in. Which then also means
cfbot has a very high spurious failure rate. I think we need to revert this
change until the problem has been verified as fixed.

This specific failure can be reproduced with this patch v1.

I guess the potential race condition is: when
wait_for_replay_catchup() runs WAIT FOR LSN on the standby, if a
tablespace conflict fires during that wait, the WAIT FOR LSN session
is killed even though it doesn't use the tablespace.

In my test, the failure won't occur after applying the v2 patch.

--
Best,
Xuneng

Attachments:

v1-0001-reproduce-the-failure-in-031_recovery_conflict.pl.patchapplication/octet-stream; name=v1-0001-reproduce-the-failure-in-031_recovery_conflict.pl.patchDownload
From ca73929687f9bf7d4aaa258f8e413ff2c3eea6aa Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 7 Jan 2026 11:39:41 +0800
Subject: [PATCH v1] reproduce the failure in 031_recovery_conflict.pl

---
 src/test/recovery/t/031_recovery_conflict.pl | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index 7a740f69806..39061fcc0a8 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -198,10 +198,25 @@ like($res, qr/^6000$/m,
 	"$sect: cursor with conflicting temp file established");
 
 # Drop the tablespace currently containing spill files for the query on the
-# standby
+# standby.  We pause replay before the DROP, then resume it via a background
+# session.  This forces wait_for_replay_catchup's internal WAIT FOR LSN to be
+# running when the conflict fires, exercising the recovery conflict handling
+# in Cluster.pm.
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause()");
 $node_primary->safe_psql($test_db, qq[DROP TABLESPACE $tablespace1;]);
 
+# Start a background session that waits 1 second then resumes replay.
+# This triggers the conflict while wait_for_replay_catchup is running.
+my $resume_session = $node_standby->background_psql('postgres');
+$resume_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	SELECT pg_sleep(1);
+	SELECT pg_wal_replay_resume();
+]);
+
 $node_primary->wait_for_replay_catchup($node_standby);
+$resume_session->quit;
 
 check_conflict_log(
 	"User was or might have been using tablespace that must be dropped");
-- 
2.51.0

v2-0001-Fix-wait_for_catchup-failure-when-standby-session.patchapplication/octet-stream; name=v2-0001-Fix-wait_for_catchup-failure-when-standby-session.patchDownload
From 1eaf36cbfafb75c91734615529dcc8f0ed7d7999 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 6 Jan 2026 20:55:43 +0800
Subject: [PATCH v2] Fix wait_for_catchup() failure when standby session is
 killed by recovery conflict

Commit f30848cb optimized wait_for_catchup() to use WAIT FOR LSN on
the standby instead of polling pg_stat_replication on the primary.
However, this introduced a failure mode: the WAIT FOR LSN session
can be killed by recovery conflicts on the standby, causing the
test helper to die unexpectedly.

This manifests as flapping failures in tests like 031_recovery_conflict,
where DROP TABLESPACE on the primary triggers
ResolveRecoveryConflictWithTablespace() on the standby. That function
kills all backends indiscriminately, including the innocent WAIT FOR
LSN session that happens to be connected at that moment.

Fix by wrapping the WAIT FOR LSN call in an eval block and falling
back to the original polling approach when the session is killed by
a recovery conflict. The fallback is selective:

- If WAIT FOR LSN succeeds with 'success': return immediately
- If WAIT FOR LSN returns non-success (timeout, not_in_recovery):
  fail immediately with diagnostics
- If the session is killed by a recovery conflict (error contains
  "conflict with recovery"): fall back to polling on the primary
- For any other error: fail immediately to avoid masking real problems

The polling fallback is immune to standby-side conflicts because it
queries pg_stat_replication on the primary, not the standby.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 53 +++++++++++++++++++-----
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a28ea89aa10..08379aeb8fb 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3401,22 +3401,52 @@ sub wait_for_catchup
 			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
 			my $wait_query =
 			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
-			my $output = $standby_node->safe_psql('postgres', $wait_query);
-			chomp($output);
 
-			if ($output ne 'success')
+			# Try WAIT FOR LSN. If it succeeds, we're done. If it returns a
+			# non-success status (timeout, not_in_recovery), fail immediately.
+			# If the session is interrupted (e.g., killed by recovery conflict),
+			# fall back to polling on the upstream which is immune to standby-
+			# side conflicts.
+			my $output;
+			local $@;
+			my $wait_succeeded = eval {
+				$output = $standby_node->safe_psql('postgres', $wait_query);
+				chomp($output);
+				1;
+			};
+
+			if ($wait_succeeded && $output eq 'success')
+			{
+				print "done\n";
+				return;
+			}
+
+			# If WAIT FOR LSN executed but returned non-success (e.g., timeout,
+			# not_in_recovery), fail immediately with diagnostic info. Falling
+			# back to polling would just waste time.
+			if ($wait_succeeded)
 			{
-				# Fetch additional detail for debugging purposes
 				my $details = $self->safe_psql('postgres',
 					"SELECT * FROM pg_catalog.pg_stat_replication");
-				diag qq(WAIT FOR LSN failed with status:
-	${output});
-				diag qq(Last pg_stat_replication contents:
-	${details});
-				croak "failed waiting for catchup";
+				diag qq(WAIT FOR LSN returned '$output'
+pg_stat_replication on upstream:
+${details});
+				croak "WAIT FOR LSN '$wait_mode' returned '$output'";
+			}
+
+			# WAIT FOR LSN was interrupted. Only fall back to polling if this
+			# looks like a recovery conflict - the canonical PostgreSQL error
+			# message contains "conflict with recovery". Other errors should
+			# fail immediately rather than being masked by a silent fallback.
+			if ($@ =~ /conflict with recovery/i)
+			{
+				diag qq(WAIT FOR LSN interrupted, falling back to polling:
+$@);
+			}
+			else
+			{
+				croak "WAIT FOR LSN failed: $@";
 			}
-			print "done\n";
-			return;
 		}
 	}
 
@@ -3424,6 +3454,7 @@ sub wait_for_catchup
 	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
 	# - When standby_name is a string (e.g., subscription name)
 	# - When the standby is no longer in recovery (was promoted)
+	# - When WAIT FOR LSN was interrupted (e.g., killed by a recovery conflict)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#95Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#93)
Re: Implement waiting for wal lsn replay: reloaded

On Wed, Jan 7, 2026, 02:32 Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on

primary

error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be

dropped.'

while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at

/Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm

line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

The failure rates of this are very high - the majority of the CI runs on
the
postgres/postgres repos failed since the change went in. Which then also
means
cfbot has a very high spurious failure rate. I think we need to revert this
change until the problem has been verified as fixed.

This is fair. I will revert the commit causing the failures in the next few
hours.

------
Regards,
Alexander Korotkov

Show quoted text
#96Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#94)
Re: Implement waiting for wal lsn replay: reloaded

On Wed, Jan 7, 2026 at 6:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Jan 7, 2026 at 8:32 AM Andres Freund <andres@anarazel.de> wrote:

On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

The failure rates of this are very high - the majority of the CI runs on the
postgres/postgres repos failed since the change went in. Which then also means
cfbot has a very high spurious failure rate. I think we need to revert this
change until the problem has been verified as fixed.

This specific failure can be reproduced with this patch v1.

I guess the potential race condition is: when
wait_for_replay_catchup() runs WAIT FOR LSN on the standby, if a
tablespace conflict fires during that wait, the WAIT FOR LSN session
is killed even though it doesn't use the tablespace.

In my test, the failure won't occur after applying the v2 patch.

I see, you were right. This is not related to the MyProc->xmin.
ResolveRecoveryConflictWithTablespace() calls
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid). That
would kill WAIT FOR LSN query independently on its xmin. I guess your
patch is the only way to go. It's clumsy to wrap WAIT FOR LSN call
with retry loop, but it would still consume less resources than
polling.

------
Regards,
Alexander Korotkov
Supabase

#97Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#96)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Thu, Jan 8, 2026 at 10:19 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Jan 7, 2026 at 6:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Wed, Jan 7, 2026 at 8:32 AM Andres Freund <andres@anarazel.de> wrote:

On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:

Could this be causing the recent flapping failures on CI/macOS in
recovery/031_recovery_conflict? I didn't have time to dig personally
but f30848cb looks relevant:

Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
error running SQL: 'psql:<stdin>:1: ERROR: canceling statement due to
conflict with recovery
DETAIL: User was or might have been using tablespace that must be dropped.'
while running 'psql --no-psqlrc --no-align --tuples-only --quiet
--dbname port=25195
host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 2300.

https://cirrus-ci.com/task/5771274900733952

The master branch in time-descending order, macOS tasks only:

task_id | substring | status
------------------+-----------+-----------
6460882231754752 | c970bdc0 | FAILED
5771274900733952 | 6ca8506e | FAILED
6217757068361728 | 63ed3bc7 | FAILED
5980650261446656 | ae283736 | FAILED
6585898394976256 | 5f13999a | COMPLETED
4527474786172928 | 7f9acc9b | COMPLETED
4826100842364928 | e8d4e94a | COMPLETED
4540563027918848 | b9ee5f2d | FAILED
6358528648019968 | c5af141c | FAILED
5998005284765696 | e212a0f8 | COMPLETED
6488580526178304 | b85d5dc0 | FAILED
5034091344560128 | 7dc95cc3 | ABORTED
5688692477526016 | bb048e31 | COMPLETED
5481187977723904 | d351063e | COMPLETED
5101831568752640 | f30848cb | COMPLETED <-- the change
6395317408497664 | 3f33b63d | COMPLETED
6741325208354816 | 877ae5db | COMPLETED
4594007789010944 | de746e0d | COMPLETED
6497208998035456 | 461b8cc9 | COMPLETED

The failure rates of this are very high - the majority of the CI runs on the
postgres/postgres repos failed since the change went in. Which then also means
cfbot has a very high spurious failure rate. I think we need to revert this
change until the problem has been verified as fixed.

This specific failure can be reproduced with this patch v1.

I guess the potential race condition is: when
wait_for_replay_catchup() runs WAIT FOR LSN on the standby, if a
tablespace conflict fires during that wait, the WAIT FOR LSN session
is killed even though it doesn't use the tablespace.

In my test, the failure won't occur after applying the v2 patch.

I see, you were right. This is not related to the MyProc->xmin.
ResolveRecoveryConflictWithTablespace() calls
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid). That
would kill WAIT FOR LSN query independently on its xmin.

I think the concern is valid --- conflicts like
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT could occur and terminate the
backend if the timing is unlucky. It's more difficult to reproduce
though. A check for the log containing "conflict with recovery" would
likely catch these conflicts as well.

I guess your
patch is the only way to go. It's clumsy to wrap WAIT FOR LSN call
with retry loop, but it would still consume less resources than
polling.

Assuming recovery conflicts are relatively rare in tap tests, except
for the explicitly designed tests like 031_recovery_conflict and the
narrow timing window that the standby has not caught up while the wait
for gets invoked, a simple fallback seems appropriate to me.

--
Best,
Xuneng

#98Alexander Korotkov
aekorotkov@gmail.com
In reply to: Xuneng Zhou (#97)
Re: Implement waiting for wal lsn replay: reloaded

On Thu, Jan 8, 2026 at 6:29 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Jan 8, 2026 at 10:19 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I see, you were right. This is not related to the MyProc->xmin.
ResolveRecoveryConflictWithTablespace() calls
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid). That
would kill WAIT FOR LSN query independently on its xmin.

I think the concern is valid --- conflicts like
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT could occur and terminate the
backend if the timing is unlucky. It's more difficult to reproduce
though. A check for the log containing "conflict with recovery" would
likely catch these conflicts as well.

Yes, I found multiple reasons why xmin gets temporarily set during
processing of WAIT FOR LSN query. I'll soon post a draft patch to fix
that.

I guess your
patch is the only way to go. It's clumsy to wrap WAIT FOR LSN call
with retry loop, but it would still consume less resources than
polling.

Assuming recovery conflicts are relatively rare in tap tests, except
for the explicitly designed tests like 031_recovery_conflict and the
narrow timing window that the standby has not caught up while the wait
for gets invoked, a simple fallback seems appropriate to me.

Yes, I see. Seems acceptable given this seems the only feasible way to go.

------
Regards,
Alexander Korotkov
Supabase

#99Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Korotkov (#98)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi,

On Fri, Jan 9, 2026 at 4:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Jan 8, 2026 at 6:29 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Jan 8, 2026 at 10:19 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I see, you were right. This is not related to the MyProc->xmin.
ResolveRecoveryConflictWithTablespace() calls
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid). That
would kill WAIT FOR LSN query independently on its xmin.

I think the concern is valid --- conflicts like
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT could occur and terminate the
backend if the timing is unlucky. It's more difficult to reproduce
though. A check for the log containing "conflict with recovery" would
likely catch these conflicts as well.

Yes, I found multiple reasons why xmin gets temporarily set during
processing of WAIT FOR LSN query. I'll soon post a draft patch to fix
that.

I guess your
patch is the only way to go. It's clumsy to wrap WAIT FOR LSN call
with retry loop, but it would still consume less resources than
polling.

Assuming recovery conflicts are relatively rare in tap tests, except
for the explicitly designed tests like 031_recovery_conflict and the
narrow timing window that the standby has not caught up while the wait
for gets invoked, a simple fallback seems appropriate to me.

Yes, I see. Seems acceptable given this seems the only feasible way to go.

Here is the updated patch with recovery conflicts handled.

--
Best,
Xuneng

Attachments:

v1-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchapplication/octet-stream; name=v1-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchDownload
From 1b1aa652aff6681e5f43eba4f4690b174052d478 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 9 Jan 2026 21:32:12 +0800
Subject: [PATCH v1] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.

Additionally, if the WAIT FOR LSN session is killed by a recovery
conflict (e.g., DROP TABLESPACE killing all backends indiscriminately),
the function catches this error and falls back to polling.  This makes
the test infrastructure robust against the timing-dependent conflicts
that can occur in tests like 031_recovery_conflict.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 91 +++++++++++++++++++++++-
 1 file changed, 90 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 955dfc0e7f8..87c3d2750cb 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,85 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write'  => 'standby_write',
+				'flush'  => 'standby_flush',
+			);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+
+			# Try WAIT FOR LSN. If it succeeds, we're done. If it returns a
+			# non-success status (timeout, not_in_recovery), fail immediately.
+			# If the session is interrupted (e.g., killed by recovery conflict),
+			# fall back to polling on the upstream which is immune to standby-
+			# side conflicts.
+			my $output;
+			local $@;
+			my $wait_succeeded = eval {
+				$output = $standby_node->safe_psql('postgres', $wait_query);
+				chomp($output);
+				1;
+			};
+
+			if ($wait_succeeded && $output eq 'success')
+			{
+				print "done\n";
+				return;
+			}
+
+			# If WAIT FOR LSN executed but returned non-success (e.g., timeout,
+			# not_in_recovery), fail immediately with diagnostic info. Falling
+			# back to polling would just waste time.
+			if ($wait_succeeded)
+			{
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN returned '$output'
+pg_stat_replication on upstream:
+${details});
+				croak "WAIT FOR LSN '$wait_mode' to '$target_lsn' returned '$output'";
+			}
+
+			# WAIT FOR LSN was interrupted. Only fall back to polling if this
+			# looks like a recovery conflict - the canonical PostgreSQL error
+			# message contains "conflict with recovery". Other errors should
+			# fail immediately rather than being masked by a silent fallback.
+			if ($@ =~ /conflict with recovery/i)
+			{
+				diag qq(WAIT FOR LSN interrupted, falling back to polling:
+$@);
+			}
+			else
+			{
+				croak "WAIT FOR LSN failed: $@";
+			}
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
+	# - When WAIT FOR LSN was interrupted (e.g., killed by a recovery conflict)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#100Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#99)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

On Fri, Jan 9, 2026 at 9:44 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Jan 9, 2026 at 4:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Jan 8, 2026 at 6:29 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Jan 8, 2026 at 10:19 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I see, you were right. This is not related to the MyProc->xmin.
ResolveRecoveryConflictWithTablespace() calls
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid). That
would kill WAIT FOR LSN query independently on its xmin.

I think the concern is valid --- conflicts like
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT could occur and terminate the
backend if the timing is unlucky. It's more difficult to reproduce
though. A check for the log containing "conflict with recovery" would
likely catch these conflicts as well.

Yes, I found multiple reasons why xmin gets temporarily set during
processing of WAIT FOR LSN query. I'll soon post a draft patch to fix
that.

I guess your
patch is the only way to go. It's clumsy to wrap WAIT FOR LSN call
with retry loop, but it would still consume less resources than
polling.

Assuming recovery conflicts are relatively rare in tap tests, except
for the explicitly designed tests like 031_recovery_conflict and the
narrow timing window that the standby has not caught up while the wait
for gets invoked, a simple fallback seems appropriate to me.

Yes, I see. Seems acceptable given this seems the only feasible way to go.

Here is the updated patch with recovery conflicts handled.

V2 corrected the commit message to state " if the WAIT FOR LSN session
is interrupted by a recovery conflict (e.g., DROP TABLESPACE
triggering conflicts on all backends),". In this case, the statement
is canceled when possible; in some states (idle in transaction or
subtransaction) the session may be terminated.

--
Best,
Xuneng

Attachments:

v2-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchapplication/octet-stream; name=v2-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchDownload
From 8d92735d473b974bfe53183615c448792ad209dc Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 9 Jan 2026 21:32:12 +0800
Subject: [PATCH v2] Use WAIT FOR LSN in
 PostgreSQL::Test::Cluster::wait_for_catchup()

When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
- The standby is in recovery

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), or when the standby has
been promoted, the function falls back to the original polling-based
approach using pg_stat_replication on the upstream.

Additionally, if the WAIT FOR LSN session is interrupted by a recovery
conflict (e.g., DROP TABLESPACE triggering conflicts on all backends),
the function catches this error and falls back to polling.  This makes
the test infrastructure robust against the timing-dependent conflicts
that can occur in tests like 031_recovery_conflict.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 91 +++++++++++++++++++++++-
 1 file changed, 90 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 955dfc0e7f8..87c3d2750cb 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3320,6 +3320,13 @@ If you pass an explicit value of target_lsn, it should almost always be
 the primary's write LSN; so this parameter is seldom needed except when
 querying some intermediate replication node rather than the primary.
 
+When the standby is passed as a PostgreSQL::Test::Cluster instance and is
+in recovery, this function uses the WAIT FOR LSN command on the standby
+for modes replay, write, and flush.  This is more efficient than polling
+pg_stat_replication on the upstream, as WAIT FOR LSN uses a latch-based
+wakeup mechanism.  For 'sent' mode, or when the standby is passed as a
+string (e.g., a subscription name), it falls back to polling.
+
 If there is no active replication connection from this peer, waits until
 poll_query_until timeout.
 
@@ -3339,10 +3346,13 @@ sub wait_for_catchup
 	  . join(', ', keys(%valid_modes))
 	  unless exists($valid_modes{$mode});
 
-	# Allow passing of a PostgreSQL::Test::Cluster instance as shorthand
+	# Keep a reference to the standby node if passed as an object, so we can
+	# use WAIT FOR LSN on it later.
+	my $standby_node;
 	if (blessed($standby_name)
 		&& $standby_name->isa("PostgreSQL::Test::Cluster"))
 	{
+		$standby_node = $standby_name;
 		$standby_name = $standby_name->name;
 	}
 	if (!defined($target_lsn))
@@ -3367,6 +3377,85 @@ sub wait_for_catchup
 	  . $self->name . "\n";
 	# Before release 12 walreceiver just set the application name to
 	# "walreceiver"
+
+	# Use WAIT FOR LSN on the standby when:
+	# - The standby was passed as a Cluster object (so we can connect to it)
+	# - The mode is replay, write, or flush (not 'sent')
+	# - The standby is in recovery
+	# This is more efficient than polling pg_stat_replication on the upstream,
+	# as WAIT FOR LSN uses a latch-based wakeup mechanism.
+	if (defined($standby_node) && ($mode ne 'sent'))
+	{
+		my $standby_in_recovery =
+		  $standby_node->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+		chomp($standby_in_recovery);
+
+		if ($standby_in_recovery eq 't')
+		{
+			# Map mode names to WAIT FOR LSN mode names
+			my %mode_map = (
+				'replay' => 'standby_replay',
+				'write'  => 'standby_write',
+				'flush'  => 'standby_flush',
+			);
+			my $wait_mode = $mode_map{$mode};
+			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+			my $wait_query =
+			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
+
+			# Try WAIT FOR LSN. If it succeeds, we're done. If it returns a
+			# non-success status (timeout, not_in_recovery), fail immediately.
+			# If the session is interrupted (e.g., killed by recovery conflict),
+			# fall back to polling on the upstream which is immune to standby-
+			# side conflicts.
+			my $output;
+			local $@;
+			my $wait_succeeded = eval {
+				$output = $standby_node->safe_psql('postgres', $wait_query);
+				chomp($output);
+				1;
+			};
+
+			if ($wait_succeeded && $output eq 'success')
+			{
+				print "done\n";
+				return;
+			}
+
+			# If WAIT FOR LSN executed but returned non-success (e.g., timeout,
+			# not_in_recovery), fail immediately with diagnostic info. Falling
+			# back to polling would just waste time.
+			if ($wait_succeeded)
+			{
+				my $details = $self->safe_psql('postgres',
+					"SELECT * FROM pg_catalog.pg_stat_replication");
+				diag qq(WAIT FOR LSN returned '$output'
+pg_stat_replication on upstream:
+${details});
+				croak "WAIT FOR LSN '$wait_mode' to '$target_lsn' returned '$output'";
+			}
+
+			# WAIT FOR LSN was interrupted. Only fall back to polling if this
+			# looks like a recovery conflict - the canonical PostgreSQL error
+			# message contains "conflict with recovery". Other errors should
+			# fail immediately rather than being masked by a silent fallback.
+			if ($@ =~ /conflict with recovery/i)
+			{
+				diag qq(WAIT FOR LSN interrupted, falling back to polling:
+$@);
+			}
+			else
+			{
+				croak "WAIT FOR LSN failed: $@";
+			}
+		}
+	}
+
+	# Fall back to polling pg_stat_replication on the upstream for:
+	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
+	# - When standby_name is a string (e.g., subscription name)
+	# - When the standby is no longer in recovery (was promoted)
+	# - When WAIT FOR LSN was interrupted (e.g., killed by a recovery conflict)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

#101Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#100)
1 attachment(s)
Re: Implement waiting for wal lsn replay: reloaded

Hi Alexander,

On Sat, Jan 10, 2026 at 12:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Jan 9, 2026 at 9:44 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Jan 9, 2026 at 4:42 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Jan 8, 2026 at 6:29 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Thu, Jan 8, 2026 at 10:19 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I see, you were right. This is not related to the MyProc->xmin.
ResolveRecoveryConflictWithTablespace() calls
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid). That
would kill WAIT FOR LSN query independently on its xmin.

I think the concern is valid --- conflicts like
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT could occur and terminate the
backend if the timing is unlucky. It's more difficult to reproduce
though. A check for the log containing "conflict with recovery" would
likely catch these conflicts as well.

Yes, I found multiple reasons why xmin gets temporarily set during
processing of WAIT FOR LSN query. I'll soon post a draft patch to fix
that.

I guess your
patch is the only way to go. It's clumsy to wrap WAIT FOR LSN call
with retry loop, but it would still consume less resources than
polling.

Assuming recovery conflicts are relatively rare in tap tests, except
for the explicitly designed tests like 031_recovery_conflict and the
narrow timing window that the standby has not caught up while the wait
for gets invoked, a simple fallback seems appropriate to me.

Yes, I see. Seems acceptable given this seems the only feasible way to go.

Here is the updated patch with recovery conflicts handled.

V2 corrected the commit message to state " if the WAIT FOR LSN session
is interrupted by a recovery conflict (e.g., DROP TABLESPACE
triggering conflicts on all backends),". In this case, the statement
is canceled when possible; in some states (idle in transaction or
subtransaction) the session may be terminated.

The attached patch avoids a syscache lookup while constructing the
tuple descriptor for WAIT FOR LSN, so that a catalog snapshot is not
re-established after the wait finishes.

The standard output path (printtup) may still briefly establish a
catalog snapshot during result emission, but this seems acceptable:
the snapshot window is narrow to emit a single row. A fully
catalog-free output path would require either bypassing the
DestReceiver lifecycle (breaking layering) or adding a custom receiver
(added complexity for marginal benefit). The current approach is
simpler and might be sufficient unless output-phase conflicts are
observed a lot in practice. Does this make sense to you?

--
Best,
Xuneng

Attachments:

v1-0001-Avoid-syscache-lookup-in-WAIT-FOR-LSN-tuple-descr.patchapplication/octet-stream; name=v1-0001-Avoid-syscache-lookup-in-WAIT-FOR-LSN-tuple-descr.patchDownload
From 50beec4de4e078020b03667453458cd440c26267 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Mon, 12 Jan 2026 14:36:27 +0800
Subject: [PATCH v1] Avoid syscache lookup in WAIT FOR LSN tuple descriptor

Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when
building the result tuple descriptor for WAIT FOR LSN. This avoids
a syscache access that could re-establish a catalog snapshot after
we've explicitly released all snapshots before the wait.
---
 src/backend/commands/wait.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 97f1e778488..191c1877125 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -15,6 +15,7 @@
 
 #include <math.h>
 
+#include "access/tupdesc.h"
 #include "access/xlog.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogwait.h"
@@ -320,7 +321,17 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
 			break;
 	}
 
-	/* need a tuple descriptor representing a single TEXT column */
+	/*
+	 * Output the result.
+	 *
+	 * We use TupleDescInitBuiltinEntry in WaitStmtResultDesc to avoid
+	 * syscache access when building the tuple descriptor. The standard output
+	 * path may briefly establish a catalog snapshot during output, but this
+	 * is acceptable since: 1. The snapshot window is very brief (just
+	 * emitting one row) 2. The critical section (the wait itself) is already
+	 * snapshot-free 3. Using the standard path respects receiver lifecycle
+	 * and semantics
+	 */
 	tupdesc = WaitStmtResultDesc(stmt);
 
 	/* prepare for projection of tuples */
@@ -337,9 +348,16 @@ WaitStmtResultDesc(WaitStmt *stmt)
 {
 	TupleDesc	tupdesc;
 
-	/* Need a tuple descriptor representing a single TEXT  column */
+	/*
+	 * Need a tuple descriptor representing a single TEXT column.
+	 *
+	 * We use TupleDescInitBuiltinEntry instead of TupleDescInitEntry to avoid
+	 * syscache access. This is important because WaitStmtResultDesc may be
+	 * called after snapshots have been released, and we must not re-establish
+	 * a catalog snapshot which could cause recovery conflicts on a standby.
+	 */
 	tupdesc = CreateTemplateTupleDesc(1);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
-					   TEXTOID, -1, 0);
+	TupleDescInitBuiltinEntry(tupdesc, (AttrNumber) 1, "status",
+							  TEXTOID, -1, 0);
 	return tupdesc;
 }
-- 
2.51.0